23. Intro to NLP- representing text data#

from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
                                       return_X_y = True)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd

All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.

That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.

23.1. Terms#

  • document: unit of text we’re analyzing (one sample)

  • token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)

  • stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent

  • dictionary: all of the possible words that a given system knows how to process

23.2. Bag of Words Representionat#

We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.

Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence.

sentence= 'I walked a dog. I had fun'
counts =  CountVectorizer()

then we fit transform it:

counts.fit_transform([sentence])
<1x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s in it and so we only represent the data.

For example

mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])

NameError: name 'np' is not defined

but as a sparse matrix, we could store fewer values.

[[0,0,1],[1,2,1],[2,3,1]]# the above
[[0, 0, 1], [1, 2, 1], [2, 3, 1]]

So any matrix where the number of total values is low enough, we can store it more efficiently by tracking the locations and values instead of all of the zeros.

To actually see it though we have to cast out of that into a regular array.

counts.fit_transform([sentence]).toarray()
array([[1, 1, 1, 1]])

For only one sentence it’s all ones, because it only has a small vocabulary.

We can make it more interesting, by picking a second sentence

counts.vocabulary_
{'walked': 3, 'dog': 0, 'had': 2, 'fun': 1}
sentence_list = [sentence, 'This is a sentence', 'I have a dog']
mat = counts.fit_transform(sentence_list).toarray()
mat
array([[1, 1, 1, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 1, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0]])

We can also examine attributes of the object.

total_occurences_per_word = mat.sum(axis=0)
total_words_per_document = mat.sum(axis=1)
total_occurences_per_word, total_words_per_document
(array([2, 1, 1, 1, 1, 1, 1, 1]), array([4, 3, 2]))

We can also get the names out as an array instead of as a dictionary:

counts.get_feature_names_out()
array(['dog', 'fun', 'had', 'have', 'is', 'sentence', 'this', 'walked'],
      dtype=object)

and this makes it easier to put the counts in a DataFrame to see what it looks like better:

sentence_df = pd.DataFrame(data=mat, columns =counts.get_feature_names_out())
sentence_df
dog fun had have is sentence this walked
0 1 1 1 0 0 0 0 1
1 0 0 0 0 1 1 1 0
2 1 0 0 1 0 0 0 0

From this we can see that the representation is the count of how many times each word appears.

Now we can apply it to all of the sentences, or our whole corpus. We can get the dictionary out in order using the get_feature_names method. This method has a generic name, not specific to text, because it’s a property of transformers in general.