21. Representing Text#

from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
                                       return_X_y = True)

21.1. Text as Data#

Let’s make a variable with a sentence in it:

s1 = 'I like cheese'

How can we analyze these? All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.

That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.

21.2. Terms#

  • document: unit of text we’re analyzing (one sample)

  • token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)

  • stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent

  • dictionary: all of the possible words that a given system knows how to process

21.3. Bag of Words Representation#

We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.

Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence.

counts = text.CountVectorizer()
counts.fit_transform([s1])
<1x2 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s or missing values in it and so we only represent the data. If we were to represent all of the 0s or NAs we would have to use the same amount of mmory for a full matrix and a sparse on

21.3.1. Sparse Matrices#

Considr a matrix that’s 10x10, that meand 100 total values to store if we store it dense. If we have only a small number of values that are not 0, say 5%, storing all of that seems like a lot. As a spars matrix, we can instead store the each value and its location. So 5% is 5 values, but we store 3 values for it (value, column, row) for 15. 15 is still a lot less than 100. So as long as less than 1/3 of th values are used sparse format is an advantage.

mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])

NameError: name 'np' is not defined

but as a sparse matrix, we could store fewer values.

[[0,0,1],[1,2,1],[2,3,1]]# the above
[[0, 0, 1], [1, 2, 1], [2, 3, 1]]

Text data will often be sparse.

So any matrix where the number of total values is low enough, we can store it more efficiently by tracking the locations and values instead of all of the zeros.

To actually see it though we have to cast out of that into a regular array.

21.3.2. Viewing it fully#

counts.fit_transform([s1]).toarray()
array([[1, 1]])

Since it’s only one sentence, it’s all 1s because each word occurs onece.

We see that what it does is creates an ordered (the values are the order) list of words as the parameters of this model

counts.get_feature_names_out()
array(['cheese', 'like'], dtype=object)

If we add more words, the vocabulary size increases.

s1 = 'I like cheese very much; '
counts.fit_transform([s1]).toarray()
array([[1, 1, 1, 1]])

Again, all ones, but more

with the new vocabulary:

counts.get_feature_names_out()
array(['cheese', 'like', 'much', 'very'], dtype=object)
counts.fit_transform([s1]).toarray()
array([[1, 1, 1, 1]])
counts.get_feature_names_out()
array(['cheese', 'like', 'much', 'very'], dtype=object)
s1 = 'I like cheese very much; I cannot wait to eat mac n cheese on Thursday'
counts.fit_transform([s1]).toarray()
array([[1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
counts.fit_transform([s1]).toarray()
array([[1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1]])
counts.get_feature_names_out()
array(['cannot', 'cheese', 'eat', 'like', 'mac', 'much', 'on', 'thursday',
       'to', 'very', 'wait'], dtype=object)

Since cheese was in the document twice, there is a 2 now.

To make it more interesting, we need two documents, so we’ll add another sentence

s2 = 'I go to URI and my favorite class is CSC310'

WHen we transform both together

counts.fit_transform([s1,s2]).toarray()
array([[0, 1, 2, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0]])

we have counts of each word and some words that are in both documents.

counts.get_feature_names_out()
array(['and', 'cannot', 'cheese', 'class', 'csc310', 'eat', 'favorite',
       'go', 'is', 'like', 'mac', 'much', 'my', 'on', 'thursday', 'to',
       'uri', 'very', 'wait'], dtype=object)

We can put all of that together into a dataframe to make it easier to interpret and analyze further.

vocabulary = counts.get_feature_names_out()
count_mat = counts.fit_transform([s1,s2]).toarray()
count_df = pd.DataFrame(data = count_mat, columns =vocabulary)
count_df
and cannot cheese class csc310 eat favorite go is like mac much my on thursday to uri very wait
0 0 1 2 0 0 1 0 0 0 1 1 1 0 1 1 1 0 1 1
1 1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1 0 0

From this we can see that the representation is the count of how many times each word appears.

21.4. Basic text analysis#

We can use this to get the total number of times each word appears across all of the documents:

count_df.sum()
and         1
cannot      1
cheese      2
class       1
csc310      1
eat         1
favorite    1
go          1
is          1
like        1
mac         1
much        1
my          1
on          1
thursday    1
to          2
uri         1
very        1
wait        1
dtype: int64

or the total number of words in each document:

count_df.sum(axis=1)
0    12
1     9
dtype: int64

21.5. Representing text manually#

vocabulary = ['and', 'are', 'cat', 'cats', 'dogs', 'pets', 'popular', 'videos']

cats and dogs are pets

vector = [1,1,0,1,1,1,0,0]

cat videos are popular

vector = [0,1,1,0,0,0,1,1]

21.6. Distances in text#

We can now use a distance function to calculate how far apart the different sentences are.

dists = euclidean_distances(count_df)
dists
array([[0.        , 4.58257569],
       [4.58257569, 0.        ]])

This distance is only int terms of actual reused words. It does not contain anything about the meaning of the words

21.7. Classifying Text Data#

We can classify text data.

We imported a dataset from sklearn.

Lables are the topic of the article 0 = computer graphics, 1 = cyrptography.

First, we transform the features first, we need to transform before we split to test and train, because if we transform them separately, they could have different vocabularies.

count_vec = text.CountVectorizer()
ng_vec = counts.fit_transform(ng_X)

We can then instantiate the classifier. This is naive bayes like the gaussian naive bayes we saw before.

It assumes each feature is a count and independent.

clf = MultinomialNB()

Now we split to train and test.

ng_vec_train, ng_vec_test, ng_y_train, ng_y_test = train_test_split(ng_vec,ng_y)
clf.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test)
0.9661016949152542
ng_vec_train
<884x24257 sparse matrix of type '<class 'numpy.int64'>'
	with 145752 stored elements in Compressed Sparse Row format>
euclidean_distances(ng_vec_train).mean()
46.68188089992422

21.8. Questions after class#

21.8.1. How similar is this method to the methods that companies such as Twitter use for sentiment analysis?#

Sentiment analysis is done by labeling documents as positive or negative and training a model to predict that.

You could do sentiment analysis with a bag of words representation.