Class 32: Intro to NLP

  1. say hello on zoom

  2. share a sentence on the doc linked on prismia

import numpy as np

Reviewing Confidence Intervals

# %load http://drsmb.co/310
def classification_confint(acc, n):
    '''
    Compute the 95% confidence interval for a classification problem.
      acc -- classification accuracy
      n   -- number of observations used to compute the accuracy
    Returns a tuple (lb,ub)
    '''
    interval = 1.96*np.sqrt(acc*(1-acc)/n)
    lb = max(0, acc - interval)
    ub = min(1.0, acc + interval)
    return (lb,ub)

If you trained to classifiers on the same data and evaluated on 50 test samples, to get accuracies of 78% and 90% is the difference significant?

To check, we compute the confidence interval for each.

classification_confint(.78,50)
(0.6651767828355258, 0.8948232171644742)
classification_confint(.9,50)
(0.816844242532462, 0.983155757467538)

Then we check to see if the intervals overlap. They do, so these are not significantly different.

This means that while those seem meaningfully different, with 50 samples, 78% vs 50% is not statistically significantly different. This means that we can’t formally guarantee that the two classifiers have reliablly different perforamnce.

If we had more samples, it could be, for example, for 200 samples we see that they are different.

N =200
classification_confint(.9,N)
(0.8584221212662311, 0.941577878733769)
classification_confint(.78,N)
(0.722588391417763, 0.8374116085822371)

Natural Language Processing

The first thing we need to do to be able to do to model text is transform to a numerical representation. We can’t use any of the models we’ve seen so far, or other models, on non numerical data.

terms:

  • document: unit of text we’re analyzing (one sample)

  • token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)

  • stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent. more info

Representation

vector or bag of words, implemented by the CountVectorizer

Some sample text:

# %load http://drsmb.co/310
text = {
'Demeus Alves':'Hope everybody is staying safe',
'Ryan Booth':'The power is out where I live, might be forced to leave soon',
'Brianna MacDonald':'Rainy days',
'Jair Delgado':'Can not wait for lunch... hungry',
'Shawn Vincent':'I am excited for Thanksgiving',
'Jacob Afonso':'Short weeks are the best!',
'Ryan Buquicchio':'The sentence is sentence. (Best sentence ever)',
'Nick McCaffery':'Very windy today',
'David Perrone':'this is a sentence',
'Masoud':'It is rainy here. What about there?',
'Rony Lopes':'I get to relax later this week',
'Patrick Dowd':'It is cold out today',
'Ruifang Kuang':'Happy Thanksgiving!',
}
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances

Let’s try it on one:

s1 = text['Demeus Alves']
s1
'Hope everybody is staying safe'

first we initalize the object

counts = CountVectorizer()

Then we can fit and transform at once, this will build the representation and return the input represented that way.

counts.fit_transform([s1])
<1x5 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

It tells us the size and that it’s a “sparse matrix” but that doesnt’ display much more To see more we can cast it to a regular array

counts.fit_transform([s1]).toarray()
array([[1, 1, 1, 1, 1]])

This doesn’t tell us much because this is all ones.

Or look at the “vocabulary” also called the “dictionary” for the whole representation

counts.vocabulary_
{'hope': 1, 'everybody': 0, 'is': 2, 'staying': 4, 'safe': 3}

We can instead apply to the whole dataset.

counts.fit_transform(text.values())
<13x48 sparse matrix of type '<class 'numpy.int64'>'
	with 65 stored elements in Compressed Sparse Row format>

Now there are more rows (samples/documents) and more columns (words in vocabulary)

counts.vocabulary_
{'hope': 16,
 'everybody': 9,
 'is': 18,
 'staying': 34,
 'safe': 30,
 'the': 36,
 'power': 27,
 'out': 26,
 'where': 46,
 'live': 22,
 'might': 24,
 'be': 3,
 'forced': 12,
 'to': 39,
 'leave': 21,
 'soon': 33,
 'rainy': 28,
 'days': 7,
 'can': 5,
 'not': 25,
 'wait': 42,
 'for': 11,
 'lunch': 23,
 'hungry': 17,
 'am': 1,
 'excited': 10,
 'thanksgiving': 35,
 'short': 32,
 'weeks': 44,
 'are': 2,
 'best': 4,
 'sentence': 31,
 'ever': 8,
 'very': 41,
 'windy': 47,
 'today': 40,
 'this': 38,
 'it': 19,
 'here': 15,
 'what': 45,
 'about': 0,
 'there': 37,
 'get': 13,
 'relax': 29,
 'later': 20,
 'week': 43,
 'cold': 6,
 'happy': 14}

We can save the transformed data to a variable

mat = counts.fit_transform(text.values()).toarray()
mat
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
        1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
        0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
        0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
        0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
        0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0]])

To make it easier to read, we can use a dataFrame

import pandas as pd

The index is the keys of the dictionary of the sentences. The columns are the words from the vocabulary. The get_feature_names method will return them as a sorted list instead of a dictionary with numbers.

text_df = pd.DataFrame(data=mat, index = text.keys(), columns=counts.get_feature_names() )
text_df
about am are be best can cold days ever everybody ... this to today very wait week weeks what where windy
Demeus Alves 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
Ryan Booth 0 0 0 1 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 1 0
Brianna MacDonald 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
Jair Delgado 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
Shawn Vincent 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Jacob Afonso 0 0 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
Ryan Buquicchio 0 0 0 0 1 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
Nick McCaffery 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 1
David Perrone 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
Masoud 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
Rony Lopes 0 0 0 0 0 0 0 0 0 0 ... 1 1 0 0 0 1 0 0 0 0
Patrick Dowd 0 0 0 0 0 0 1 0 0 0 ... 0 0 1 0 0 0 0 0 0 0
Ruifang Kuang 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

13 rows × 48 columns

To compute the distances we use the euclidean_distances function. To make this easy to read, we will put this in a dataframe as well.

dist_df = pd.DataFrame(data = euclidean_distances(text_df),
                       index=  text.keys(), columns= text.keys())
dist_df
Demeus Alves Ryan Booth Brianna MacDonald Jair Delgado Shawn Vincent Jacob Afonso Ryan Buquicchio Nick McCaffery David Perrone Masoud Rony Lopes Patrick Dowd Ruifang Kuang
Demeus Alves 0.000000 3.872983 2.645751 3.316625 3.000000 3.162278 4.000000 2.828427 2.449490 3.162278 3.316625 2.828427 2.645751
Ryan Booth 3.872983 0.000000 3.741657 4.242641 4.000000 3.872983 4.582576 3.872983 3.605551 4.123106 4.000000 3.605551 3.741657
Brianna MacDonald 2.645751 3.741657 0.000000 2.828427 2.449490 2.645751 3.872983 2.236068 2.236068 2.645751 2.828427 2.645751 2.000000
Jair Delgado 3.316625 4.242641 2.828427 0.000000 2.828427 3.316625 4.358899 3.000000 3.000000 3.605551 3.464102 3.316625 2.828427
Shawn Vincent 3.000000 4.000000 2.449490 2.828427 0.000000 3.000000 4.123106 2.645751 2.645751 3.316625 3.162278 3.000000 2.000000
Jacob Afonso 3.162278 3.872983 2.645751 3.316625 3.000000 0.000000 3.741657 2.828427 2.828427 3.464102 3.316625 3.162278 2.645751
Ryan Buquicchio 4.000000 4.582576 3.872983 4.358899 4.123106 3.741657 0.000000 4.000000 2.828427 4.242641 4.358899 4.000000 3.872983
Nick McCaffery 2.828427 3.872983 2.236068 3.000000 2.645751 2.828427 4.000000 0.000000 2.449490 3.162278 3.000000 2.449490 2.236068
David Perrone 2.449490 3.605551 2.236068 3.000000 2.645751 2.828427 2.828427 2.449490 0.000000 2.828427 2.645751 2.449490 2.236068
Masoud 3.162278 4.123106 2.645751 3.605551 3.316625 3.464102 4.242641 3.162278 2.828427 0.000000 3.605551 2.828427 3.000000
Rony Lopes 3.316625 4.000000 2.828427 3.464102 3.162278 3.316625 4.358899 3.000000 2.645751 3.605551 0.000000 3.316625 2.828427
Patrick Dowd 2.828427 3.605551 2.645751 3.316625 3.000000 3.162278 4.000000 2.449490 2.449490 2.828427 3.316625 0.000000 2.645751
Ruifang Kuang 2.645751 3.741657 2.000000 2.828427 2.000000 2.645751 3.872983 2.236068 2.236068 3.000000 2.828427 2.645751 0.000000

How can we find who’s sentence was most similar to Masoud’s?

We can select his column and take the min.

dist_df['Masoud'].min()
0.0

But this will return zero, because it’s the distance to the same sentence, so we can drop that row of the column

dist_df['Masoud'].drop('Masoud')
Demeus Alves         3.162278
Ryan Booth           4.123106
Brianna MacDonald    2.645751
Jair Delgado         3.605551
Shawn Vincent        3.316625
Jacob Afonso         3.464102
Ryan Buquicchio      4.242641
Nick McCaffery       3.162278
David Perrone        2.828427
Rony Lopes           3.605551
Patrick Dowd         2.828427
Ruifang Kuang        3.000000
Name: Masoud, dtype: float64

Then min gives us the the value that’s the minumum.

dist_df['Masoud'].drop('Masoud').min()
2.6457513110645907

We can use idx min instead.

dist_df['Masoud'].drop('Masoud').idxmin()
'Brianna MacDonald'

Try it yourself

  1. Which two people wrote the most similar sentences?

  2. Using the feature space defined by the text above, what would the following sentence be as a vector?

  • “Thanksgiving is a short week”?

  • “Rainy, windy days are cold”

  1. What word was used the most in the whole set of sentences?