Class 32: Intro to NLP¶
say hello on zoom
share a sentence on the doc linked on prismia
import numpy as np
Reviewing Confidence Intervals¶
# %load http://drsmb.co/310
def classification_confint(acc, n):
'''
Compute the 95% confidence interval for a classification problem.
acc -- classification accuracy
n -- number of observations used to compute the accuracy
Returns a tuple (lb,ub)
'''
interval = 1.96*np.sqrt(acc*(1-acc)/n)
lb = max(0, acc - interval)
ub = min(1.0, acc + interval)
return (lb,ub)
If you trained to classifiers on the same data and evaluated on 50 test samples, to get accuracies of 78% and 90% is the difference significant?
To check, we compute the confidence interval for each.
classification_confint(.78,50)
(0.6651767828355258, 0.8948232171644742)
classification_confint(.9,50)
(0.816844242532462, 0.983155757467538)
Then we check to see if the intervals overlap. They do, so these are not significantly different.
This means that while those seem meaningfully different, with 50 samples, 78% vs 50% is not statistically significantly different. This means that we can’t formally guarantee that the two classifiers have reliablly different perforamnce.
If we had more samples, it could be, for example, for 200 samples we see that they are different.
N =200
classification_confint(.9,N)
(0.8584221212662311, 0.941577878733769)
classification_confint(.78,N)
(0.722588391417763, 0.8374116085822371)
Natural Language Processing¶
The first thing we need to do to be able to do to model text is transform to a numerical representation. We can’t use any of the models we’ve seen so far, or other models, on non numerical data.
terms:
document: unit of text we’re analyzing (one sample)
token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)
stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent. more info
Representation¶
vector or bag of words, implemented by the CountVectorizer
Some sample text:
# %load http://drsmb.co/310
text = {
'Demeus Alves':'Hope everybody is staying safe',
'Ryan Booth':'The power is out where I live, might be forced to leave soon',
'Brianna MacDonald':'Rainy days',
'Jair Delgado':'Can not wait for lunch... hungry',
'Shawn Vincent':'I am excited for Thanksgiving',
'Jacob Afonso':'Short weeks are the best!',
'Ryan Buquicchio':'The sentence is sentence. (Best sentence ever)',
'Nick McCaffery':'Very windy today',
'David Perrone':'this is a sentence',
'Masoud':'It is rainy here. What about there?',
'Rony Lopes':'I get to relax later this week',
'Patrick Dowd':'It is cold out today',
'Ruifang Kuang':'Happy Thanksgiving!',
}
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
Let’s try it on one:
s1 = text['Demeus Alves']
s1
'Hope everybody is staying safe'
first we initalize the object
counts = CountVectorizer()
Then we can fit and transform at once, this will build the representation and return the input represented that way.
counts.fit_transform([s1])
<1x5 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>
It tells us the size and that it’s a “sparse matrix” but that doesnt’ display much more To see more we can cast it to a regular array
counts.fit_transform([s1]).toarray()
array([[1, 1, 1, 1, 1]])
This doesn’t tell us much because this is all ones.
Or look at the “vocabulary” also called the “dictionary” for the whole representation
counts.vocabulary_
{'hope': 1, 'everybody': 0, 'is': 2, 'staying': 4, 'safe': 3}
We can instead apply to the whole dataset.
counts.fit_transform(text.values())
<13x48 sparse matrix of type '<class 'numpy.int64'>'
with 65 stored elements in Compressed Sparse Row format>
Now there are more rows (samples/documents) and more columns (words in vocabulary)
counts.vocabulary_
{'hope': 16,
'everybody': 9,
'is': 18,
'staying': 34,
'safe': 30,
'the': 36,
'power': 27,
'out': 26,
'where': 46,
'live': 22,
'might': 24,
'be': 3,
'forced': 12,
'to': 39,
'leave': 21,
'soon': 33,
'rainy': 28,
'days': 7,
'can': 5,
'not': 25,
'wait': 42,
'for': 11,
'lunch': 23,
'hungry': 17,
'am': 1,
'excited': 10,
'thanksgiving': 35,
'short': 32,
'weeks': 44,
'are': 2,
'best': 4,
'sentence': 31,
'ever': 8,
'very': 41,
'windy': 47,
'today': 40,
'this': 38,
'it': 19,
'here': 15,
'what': 45,
'about': 0,
'there': 37,
'get': 13,
'relax': 29,
'later': 20,
'week': 43,
'cold': 6,
'happy': 14}
We can save the transformed data to a variable
mat = counts.fit_transform(text.values()).toarray()
mat
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,
0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]])
To make it easier to read, we can use a dataFrame
import pandas as pd
The index is the keys of the dictionary of the sentences. The columns are the
words from the vocabulary. The get_feature_names
method will return them as a
sorted list instead of a dictionary with numbers.
text_df = pd.DataFrame(data=mat, index = text.keys(), columns=counts.get_feature_names() )
text_df
about | am | are | be | best | can | cold | days | ever | everybody | ... | this | to | today | very | wait | week | weeks | what | where | windy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Demeus Alves | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Ryan Booth | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Brianna MacDonald | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Jair Delgado | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Shawn Vincent | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Jacob Afonso | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Ryan Buquicchio | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Nick McCaffery | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
David Perrone | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Masoud | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Rony Lopes | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
Patrick Dowd | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Ruifang Kuang | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 rows × 48 columns
To compute the distances we use the euclidean_distances
function. To make this
easy to read, we will put this in a dataframe as well.
dist_df = pd.DataFrame(data = euclidean_distances(text_df),
index= text.keys(), columns= text.keys())
dist_df
Demeus Alves | Ryan Booth | Brianna MacDonald | Jair Delgado | Shawn Vincent | Jacob Afonso | Ryan Buquicchio | Nick McCaffery | David Perrone | Masoud | Rony Lopes | Patrick Dowd | Ruifang Kuang | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Demeus Alves | 0.000000 | 3.872983 | 2.645751 | 3.316625 | 3.000000 | 3.162278 | 4.000000 | 2.828427 | 2.449490 | 3.162278 | 3.316625 | 2.828427 | 2.645751 |
Ryan Booth | 3.872983 | 0.000000 | 3.741657 | 4.242641 | 4.000000 | 3.872983 | 4.582576 | 3.872983 | 3.605551 | 4.123106 | 4.000000 | 3.605551 | 3.741657 |
Brianna MacDonald | 2.645751 | 3.741657 | 0.000000 | 2.828427 | 2.449490 | 2.645751 | 3.872983 | 2.236068 | 2.236068 | 2.645751 | 2.828427 | 2.645751 | 2.000000 |
Jair Delgado | 3.316625 | 4.242641 | 2.828427 | 0.000000 | 2.828427 | 3.316625 | 4.358899 | 3.000000 | 3.000000 | 3.605551 | 3.464102 | 3.316625 | 2.828427 |
Shawn Vincent | 3.000000 | 4.000000 | 2.449490 | 2.828427 | 0.000000 | 3.000000 | 4.123106 | 2.645751 | 2.645751 | 3.316625 | 3.162278 | 3.000000 | 2.000000 |
Jacob Afonso | 3.162278 | 3.872983 | 2.645751 | 3.316625 | 3.000000 | 0.000000 | 3.741657 | 2.828427 | 2.828427 | 3.464102 | 3.316625 | 3.162278 | 2.645751 |
Ryan Buquicchio | 4.000000 | 4.582576 | 3.872983 | 4.358899 | 4.123106 | 3.741657 | 0.000000 | 4.000000 | 2.828427 | 4.242641 | 4.358899 | 4.000000 | 3.872983 |
Nick McCaffery | 2.828427 | 3.872983 | 2.236068 | 3.000000 | 2.645751 | 2.828427 | 4.000000 | 0.000000 | 2.449490 | 3.162278 | 3.000000 | 2.449490 | 2.236068 |
David Perrone | 2.449490 | 3.605551 | 2.236068 | 3.000000 | 2.645751 | 2.828427 | 2.828427 | 2.449490 | 0.000000 | 2.828427 | 2.645751 | 2.449490 | 2.236068 |
Masoud | 3.162278 | 4.123106 | 2.645751 | 3.605551 | 3.316625 | 3.464102 | 4.242641 | 3.162278 | 2.828427 | 0.000000 | 3.605551 | 2.828427 | 3.000000 |
Rony Lopes | 3.316625 | 4.000000 | 2.828427 | 3.464102 | 3.162278 | 3.316625 | 4.358899 | 3.000000 | 2.645751 | 3.605551 | 0.000000 | 3.316625 | 2.828427 |
Patrick Dowd | 2.828427 | 3.605551 | 2.645751 | 3.316625 | 3.000000 | 3.162278 | 4.000000 | 2.449490 | 2.449490 | 2.828427 | 3.316625 | 0.000000 | 2.645751 |
Ruifang Kuang | 2.645751 | 3.741657 | 2.000000 | 2.828427 | 2.000000 | 2.645751 | 3.872983 | 2.236068 | 2.236068 | 3.000000 | 2.828427 | 2.645751 | 0.000000 |
How can we find who’s sentence was most similar to Masoud’s?
We can select his column and take the min.
dist_df['Masoud'].min()
0.0
But this will return zero, because it’s the distance to the same sentence, so we can drop that row of the column
dist_df['Masoud'].drop('Masoud')
Demeus Alves 3.162278
Ryan Booth 4.123106
Brianna MacDonald 2.645751
Jair Delgado 3.605551
Shawn Vincent 3.316625
Jacob Afonso 3.464102
Ryan Buquicchio 4.242641
Nick McCaffery 3.162278
David Perrone 2.828427
Rony Lopes 3.605551
Patrick Dowd 2.828427
Ruifang Kuang 3.000000
Name: Masoud, dtype: float64
Then min gives us the the value that’s the minumum.
dist_df['Masoud'].drop('Masoud').min()
2.6457513110645907
We can use idx min instead.
dist_df['Masoud'].drop('Masoud').idxmin()
'Brianna MacDonald'
Try it yourself¶
Which two people wrote the most similar sentences?
Using the feature space defined by the text above, what would the following sentence be as a vector?
“Thanksgiving is a short week”?
“Rainy, windy days are cold”
What word was used the most in the whole set of sentences?