23. Intro to NLP- representing text data#
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
return_X_y = True)
23.1. Text as Data#
Let’s mak a small dataset of sentences
sentences = [
'The semester is almost over. - Professor Brown',
'The weather is getting much warmer outside. - Ben',
'Dr. Pepper is the best soda ever - Ebrahima',
'Only a few more weeks before graduation. -Gabe',
'Hello friends. -Zach',
'Mr. Owl ate my metal worm. -Sath',
'I wonder what is for dinner? - Vinai',
'Hello everyone- jay ',
]
How can we analyze these? All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.
That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.
sentences[0]
'The semester is almost over. - Professor Brown'
s1 = sentences[0]
23.2. Terms#
document: unit of text we’re analyzing (one sample)
token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)
stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent
dictionary: all of the possible words that a given system knows how to process
23.3. Bag of Words Representionat#
We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.
Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence.
counts = text.CountVectorizer()
counts.fit_transform([s1])
<1x7 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s or missing values in it and so we only represent the data. If we were to represent all of the 0s or NAs we would have to use the same amount of mmory for a full matrix and a sparse on
Considr a matrix that’s 10x10, that meand 100 total values to store if we store it dense. If we have only a small number of values that are not 0, say 5%, storing all of that seems like a lot. As a spars matrix, we can instead store the each value and its location. So 5% is 5 values, but we store 3 values for it (value, column, row) for 15. 15 is still a lot less than 100. So as long as less than 1/3 of th values are used sparse format is an advantage.
mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[7], line 1
----> 1 mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])
NameError: name 'np' is not defined
but as a sparse matrix, we could store fewer values.
[[0,0,1],[1,2,1],[2,3,1]]# the above
[[0, 0, 1], [1, 2, 1], [2, 3, 1]]
Text data will often be sparse.
So any matrix where the number of total values is low enough, we can store it more efficiently by tracking the locations and values instead of all of the zeros.
To actually see it though we have to cast out of that into a regular array.
counts.fit_transform([s1]).toarray()
array([[1, 1, 1, 1, 1, 1, 1]])
We can also examine attributes of the object.
counts.vocabulary_
{'the': 6,
'semester': 5,
'is': 2,
'almost': 0,
'over': 3,
'professor': 4,
'brown': 1}
We see that what it does is creates an ordered (the values are the order) list of words as the parameters of this model (ending in _
is an attribute of the object or parameter of the model).
it puts the words in the vocabulary_
attribute (aka the dictionary) in alphabetical order.
23.4. Reformatting data#
Now we can transform the whole dataset,first we will process it a little.
sentences_split = [sent_attr.split('-') for sent_attr in sentences]
sentences_split
[['The semester is almost over. ', ' Professor Brown'],
['The weather is getting much warmer outside. ', ' Ben'],
['Dr. Pepper is the best soda ever ', ' Ebrahima'],
['Only a few more weeks before graduation. ', 'Gabe'],
['Hello friends. ', 'Zach'],
['Mr. Owl ate my metal worm. ', 'Sath'],
['I wonder what is for dinner? ', ' Vinai'],
['Hello everyone', ' jay ']]
text_dict = {attr.strip():sentence.strip() for sentence,attr in sentences_split}
text_dict
{'Professor Brown': 'The semester is almost over.',
'Ben': 'The weather is getting much warmer outside.',
'Ebrahima': 'Dr. Pepper is the best soda ever',
'Gabe': 'Only a few more weeks before graduation.',
'Zach': 'Hello friends.',
'Sath': 'Mr. Owl ate my metal worm.',
'Vinai': 'I wonder what is for dinner?',
'jay': 'Hello everyone'}
and now transform
mat = counts.fit_transform(text_dict.values()).toarray()
mat
array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
From this we can see that the representation is the count of how many times each word appears.
Now we can apply it to all of the sentences, or our whole corpus. We can get the dictionary out in order using the get_feature_names_out
method. This method has a generic name, not specific to text, because it’s a property of transformers in general.
counts.get_feature_names_out()
array(['almost', 'ate', 'before', 'best', 'dinner', 'dr', 'ever',
'everyone', 'few', 'for', 'friends', 'getting', 'graduation',
'hello', 'is', 'metal', 'more', 'mr', 'much', 'my', 'only',
'outside', 'over', 'owl', 'pepper', 'semester', 'soda', 'the',
'warmer', 'weather', 'weeks', 'what', 'wonder', 'worm'],
dtype=object)
We can use a dataframe again to see this more easily. We can put labels on both the index and the column headings.
sentence_df = pd.DataFrame(data = mat, columns = counts.get_feature_names_out(),
index = text_dict.keys())
sentence_df
almost | ate | before | best | dinner | dr | ever | everyone | few | for | ... | pepper | semester | soda | the | warmer | weather | weeks | what | wonder | worm | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Professor Brown | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Ben | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
Ebrahima | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
Gabe | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
Zach | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Sath | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Vinai | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
jay | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
8 rows × 34 columns
23.5. Basic text analysis#
We can find the most common word
One guess
sentence_df.max()
almost 1
ate 1
before 1
best 1
dinner 1
dr 1
ever 1
everyone 1
few 1
for 1
friends 1
getting 1
graduation 1
hello 1
is 1
metal 1
more 1
mr 1
much 1
my 1
only 1
outside 1
over 1
owl 1
pepper 1
semester 1
soda 1
the 1
warmer 1
weather 1
weeks 1
what 1
wonder 1
worm 1
dtype: int64
This is the maximum number of times each word appears in single “document”, but it’s also not sorted, it’s alphabetical.
This shows the word that appears the most times.
To get what we want we need to sum, which by default is along the columns, or per word.
sentence_df.sum(axis=0).sort_values(ascending=False)
is 4
the 3
hello 2
almost 1
pepper 1
only 1
outside 1
over 1
owl 1
semester 1
much 1
soda 1
warmer 1
weather 1
weeks 1
what 1
wonder 1
my 1
mr 1
ate 1
more 1
metal 1
graduation 1
getting 1
friends 1
for 1
few 1
everyone 1
ever 1
dr 1
dinner 1
best 1
before 1
worm 1
dtype: int64
Then we get the location of the max with idx max.
sentence_df.sum().idxmax()
'is'
What is the total number of unique words across all sentences?
_, n_unique = sentence_df.shape
n_unique
34
this is also the size of the dictionary
We can get the total number of words in all of the sentences by summing twice.
sentence_df.sum().sum()
40
Who’s sentence had the most words in it?
sentence_df.sum(axis=1)
Professor Brown 5
Ben 7
Ebrahima 7
Gabe 6
Zach 2
Sath 6
Vinai 5
jay 2
dtype: int64
summing across rows shows how many words per sentence.
23.6. Distances in text#
We can now use a distance function to calculate how far apart the different sentences are.
dists = euclidean_distances(sentence_df)
dists
array([[0. , 2.82842712, 2.82842712, 3.31662479, 2.64575131,
3.31662479, 2.82842712, 2.64575131],
[2.82842712, 0. , 3.16227766, 3.60555128, 3. ,
3.60555128, 3.16227766, 3. ],
[2.82842712, 3.16227766, 0. , 3.60555128, 3. ,
3.60555128, 3.16227766, 3. ],
[3.31662479, 3.60555128, 3.60555128, 0. , 2.82842712,
3.46410162, 3.31662479, 2.82842712],
[2.64575131, 3. , 3. , 2.82842712, 0. ,
2.82842712, 2.64575131, 1.41421356],
[3.31662479, 3.60555128, 3.60555128, 3.46410162, 2.82842712,
0. , 3.31662479, 2.82842712],
[2.82842712, 3.16227766, 3.16227766, 3.31662479, 2.64575131,
3.31662479, 0. , 2.64575131],
[2.64575131, 3. , 3. , 2.82842712, 1.41421356,
2.82842712, 2.64575131, 0. ]])
This distance is only int terms of actual reused words. It does not contain anything about the meaning of the words
We can make this eaiser to read by making it a Data Frame.
dist_df = pd.DataFrame(data=dists, index= text_dict.keys(),
columns=text_dict.keys())
dist_df
Professor Brown | Ben | Ebrahima | Gabe | Zach | Sath | Vinai | jay | |
---|---|---|---|---|---|---|---|---|
Professor Brown | 0.000000 | 2.828427 | 2.828427 | 3.316625 | 2.645751 | 3.316625 | 2.828427 | 2.645751 |
Ben | 2.828427 | 0.000000 | 3.162278 | 3.605551 | 3.000000 | 3.605551 | 3.162278 | 3.000000 |
Ebrahima | 2.828427 | 3.162278 | 0.000000 | 3.605551 | 3.000000 | 3.605551 | 3.162278 | 3.000000 |
Gabe | 3.316625 | 3.605551 | 3.605551 | 0.000000 | 2.828427 | 3.464102 | 3.316625 | 2.828427 |
Zach | 2.645751 | 3.000000 | 3.000000 | 2.828427 | 0.000000 | 2.828427 | 2.645751 | 1.414214 |
Sath | 3.316625 | 3.605551 | 3.605551 | 3.464102 | 2.828427 | 0.000000 | 3.316625 | 2.828427 |
Vinai | 2.828427 | 3.162278 | 3.162278 | 3.316625 | 2.645751 | 3.316625 | 0.000000 | 2.645751 |
jay | 2.645751 | 3.000000 | 3.000000 | 2.828427 | 1.414214 | 2.828427 | 2.645751 | 0.000000 |
Who wrote the most similar to me? which two were most similar to one another?
text_dict
{'Professor Brown': 'The semester is almost over.',
'Ben': 'The weather is getting much warmer outside.',
'Ebrahima': 'Dr. Pepper is the best soda ever',
'Gabe': 'Only a few more weeks before graduation.',
'Zach': 'Hello friends.',
'Sath': 'Mr. Owl ate my metal worm.',
'Vinai': 'I wonder what is for dinner?',
'jay': 'Hello everyone'}