30. Intro to NLP- representing text data#

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
# %load http://drsmb.co/310read

# add an entry to the following dictionary with a name as the key and a sentence as the value
# share a sentence about how you're doing this week
# remember this will be python code, don't use
# You can remain anonymous (this page & the notes will be fully public)
# by attributing it to a celebrity or psuedonym, but include *some* sort of attribution
sentence_dict = {
'Professor Brown':"I'm excited for Thanksgiving.",
'Matt Langton':"I'm doing pretty good, I'll be taking the days off to catch up on various classwork.",
'Evan':"I'm just here so my grade doesn't get fined",
'Greg Bassett':"I'm doing well, my birthday is today. I'm looking forward to seeing my family this Thursday, I haven't seen a lot of them in a long time.",
'Noah N':"I'm doing well! I can't wait to take opportuity of this long weekend to catch up on various HW's, projects, etc.",
'Tuyetlinh':"I'm struggling to get all my work done before break, but I'm excited to have that time off when I'm all done.",
'Kenza Bouabdallah':"I am doing good. How are you ?",
'Chris Kerfoot':"I'm doing pretty good. I'm happy to have some days off this week because of Thanksgiving!",
'Kang Liu': "New week, new start",
'Aiden Hill':"I am very much enjoying this class.",
'Muhammad S':"I am doing pretty well. I am looking forward to taking a few days off.",
'Max Mastrorocco':"Cannot wait for a break.",
'Daniel':"I am doing well. I am ready and excited for break!",
'Nate':"I'm just vibing right now, ready for break ",
'Jacob':"I am going to eat Turkey.",
'Anon':"nom nom nom"
}

How can we analyze these? All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.

That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.

30.1. Terms#

  • document: unit of text we’re analyzing (one sample)

  • token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)

  • stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent

  • dictionary: all of the possible words that a given system knows how to process

We’ll start by taking out one sentence and anlyzeing that:

s1 = sentence_dict['Professor Brown']

s1
"I'm excited for Thanksgiving."

30.2. Bag of Words Representionat#

We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.

Then we initialize our transformer

counts = CountVectorizer()

We can use the fit transform method to fit the vectorizer model and apply it to this sentence.

counts.fit_transform([s1])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
	with 3 stored elements in Compressed Sparse Row format>

To see the output better, we use the toarray method.

counts.fit_transform([s1]).toarray()
array([[1, 1, 1]])

We can also examine attributes of the object.

counts.vocabulary_
{'excited': 0, 'for': 1, 'thanksgiving': 2}

We see that what it does is creates an ordered (the values are the order) list of words as the parameters of this model (ending in _ is an attribute of the object or parameter of the model)

Try it yourself

What other model parameters have we seen? How have we used model parameters in the past?

To see what happens a bit more, let’s add a second senntence.

s2 = sentence_dict['Kang Liu']
s2
'New week, new start'
counts.fit_transform([s1,s2])
counts.vocabulary_
{'excited': 0, 'for': 1, 'thanksgiving': 4, 'new': 2, 'week': 5, 'start': 3}

Now we can see that it puts the words in the vocabulary_ attribute (aka the dictionary) in alphabetical order.

counts.fit_transform([s1,s2]).toarray()
array([[1, 1, 0, 0, 1, 0],
       [0, 0, 2, 1, 0, 1]])

From this we can see that the representation is the count of how many times each word appears.

Now we can apply it to all of the sentences, or our whole corpus

mat = counts.fit_transform(sentence_dict.values()).toarray()
mat
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

We can get the dictionary out in order using the get_feature_names method. This method has a generic name, not specific to text, because it’s a property of transformers in general.

counts.get_feature_names()
/opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
['all',
 'am',
 'and',
 'are',
 'be',
 'because',
 'before',
 'birthday',
 'break',
 'but',
 'can',
 'cannot',
 'catch',
 'class',
 'classwork',
 'days',
 'doesn',
 'doing',
 'done',
 'eat',
 'enjoying',
 'etc',
 'excited',
 'family',
 'few',
 'fined',
 'for',
 'forward',
 'get',
 'going',
 'good',
 'grade',
 'happy',
 'have',
 'haven',
 'here',
 'how',
 'hw',
 'in',
 'is',
 'just',
 'll',
 'long',
 'looking',
 'lot',
 'much',
 'my',
 'new',
 'nom',
 'now',
 'of',
 'off',
 'on',
 'opportuity',
 'pretty',
 'projects',
 'ready',
 'right',
 'seeing',
 'seen',
 'so',
 'some',
 'start',
 'struggling',
 'take',
 'taking',
 'thanksgiving',
 'that',
 'the',
 'them',
 'this',
 'thursday',
 'time',
 'to',
 'today',
 'turkey',
 'up',
 'various',
 'very',
 'vibing',
 'wait',
 'week',
 'weekend',
 'well',
 'when',
 'work',
 'you']

We can use a dataframe again to see this more easily. We can put labels on both the index and the column headings.

sentence_df = pd.DataFrame(data = mat, columns =counts.get_feature_names(),
                          index=sentence_dict.keys())

sentence_df
all am and are be because before birthday break but ... various very vibing wait week weekend well when work you
Professor Brown 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Matt Langton 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
Evan 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Greg Bassett 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 1 0 0 0
Noah N 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 1 0 1 1 0 0 0
Tuyetlinh 2 0 0 0 0 0 1 0 1 1 ... 0 0 0 0 0 0 0 1 1 0
Kenza Bouabdallah 0 1 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
Chris Kerfoot 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
Kang Liu 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
Aiden Hill 0 1 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
Muhammad S 0 2 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
Max Mastrorocco 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 1 0 0 0 0 0 0
Daniel 0 2 1 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 1 0 0 0
Nate 0 0 0 0 0 0 0 0 1 0 ... 0 0 1 0 0 0 0 0 0 0
Jacob 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Anon 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

16 rows × 87 columns

30.3. How can we find the most commonly used word?#

One guess

sentence_df.max()
all        2
am         2
and        1
are        1
be         1
          ..
weekend    1
well       1
when       1
work       1
you        1
Length: 87, dtype: int64

This is the maximum number of times each word appears in single “document”, but it’s also not sorted, it’s alphabetical.

This shows the word that appears the most times.:

sentence_df.max().sort_values()
looking    1
start      1
some       1
so         1
seen       1
          ..
my         2
done       2
am         2
all        2
nom        3
Length: 87, dtype: int64

To get what we want we need to sum, which by default is along the columns, or per word.

sentence_df.sum()
all        2
am         7
and        1
are        1
be         1
          ..
weekend    1
well       4
when       1
work       1
you        1
Length: 87, dtype: int64

Agaain it’s unsorted, but we can apply max

sentence_df.sum().max
<bound method NDFrame._add_numeric_operations.<locals>.max of all        2
am         7
and        1
are        1
be         1
          ..
weekend    1
well       4
when       1
work       1
you        1
Length: 87, dtype: int64>

this gives only t the value though, we want the word. When we summed we got back a Series with the words in the index:

sentence_df.sum().index
Index(['all', 'am', 'and', 'are', 'be', 'because', 'before', 'birthday',
       'break', 'but', 'can', 'cannot', 'catch', 'class', 'classwork', 'days',
       'doesn', 'doing', 'done', 'eat', 'enjoying', 'etc', 'excited', 'family',
       'few', 'fined', 'for', 'forward', 'get', 'going', 'good', 'grade',
       'happy', 'have', 'haven', 'here', 'how', 'hw', 'in', 'is', 'just', 'll',
       'long', 'looking', 'lot', 'much', 'my', 'new', 'nom', 'now', 'of',
       'off', 'on', 'opportuity', 'pretty', 'projects', 'ready', 'right',
       'seeing', 'seen', 'so', 'some', 'start', 'struggling', 'take', 'taking',
       'thanksgiving', 'that', 'the', 'them', 'this', 'thursday', 'time', 'to',
       'today', 'turkey', 'up', 'various', 'very', 'vibing', 'wait', 'week',
       'weekend', 'well', 'when', 'work', 'you'],
      dtype='object')

So we can use idxmax

sentence_df.sum().idxmax()
'to'

30.4. Distances in text#

We can now use a distance function to calculate how far apaart the different sentences are.

euclidean_distances(sentence_df)
array([[0.        , 4.24264069, 3.31662479, 5.19615242, 4.89897949,
        5.09901951, 3.        , 3.87298335, 3.        , 3.        ,
        4.12310563, 2.23606798, 3.16227766, 2.82842712, 2.82842712,
        3.46410162],
       [4.24264069, 0.        , 4.79583152, 5.91607978, 4.69041576,
        5.83095189, 4.12310563, 4.12310563, 4.58257569, 4.58257569,
        4.12310563, 4.35889894, 4.89897949, 4.69041576, 4.24264069,
        4.89897949],
       [3.31662479, 4.79583152, 0.        , 5.29150262, 5.38516481,
        5.38516481, 3.74165739, 4.69041576, 3.74165739, 3.74165739,
        4.69041576, 3.46410162, 4.35889894, 3.60555128, 3.60555128,
        4.12310563],
       [5.19615242, 5.91607978, 5.29150262, 0.        , 5.56776436,
        6.244998  , 5.29150262, 5.47722558, 5.47722558, 5.29150262,
        5.29150262, 5.29150262, 5.56776436, 5.56776436, 5.19615242,
        5.74456265],
       [4.89897949, 4.69041576, 5.38516481, 5.56776436, 0.        ,
        6.164414  , 5.        , 5.        , 5.19615242, 5.        ,
        5.19615242, 4.79583152, 5.29150262, 5.29150262, 4.69041576,
        5.47722558],
       [5.09901951, 5.83095189, 5.38516481, 6.244998  , 6.164414  ,
        0.        , 5.56776436, 5.56776436, 5.56776436, 5.56776436,
        5.74456265, 5.19615242, 5.65685425, 5.47722558, 5.09901951,
        5.83095189],
       [3.        , 4.12310563, 3.74165739, 5.29150262, 5.        ,
        5.56776436, 0.        , 4.        , 3.46410162, 3.16227766,
        3.74165739, 3.16227766, 3.31662479, 3.60555128, 3.        ,
        3.87298335],
       [3.87298335, 4.12310563, 4.69041576, 5.47722558, 5.        ,
        5.56776436, 4.        , 0.        , 4.24264069, 4.24264069,
        4.24264069, 4.24264069, 4.79583152, 4.58257569, 4.12310563,
        4.79583152],
       [3.        , 4.58257569, 3.74165739, 5.47722558, 5.19615242,
        5.56776436, 3.46410162, 4.24264069, 0.        , 3.46410162,
        4.47213595, 3.16227766, 4.12310563, 3.60555128, 3.31662479,
        3.87298335],
       [3.        , 4.58257569, 3.74165739, 5.29150262, 5.        ,
        5.56776436, 3.16227766, 4.24264069, 3.46410162, 0.        ,
        4.        , 3.16227766, 3.60555128, 3.60555128, 3.        ,
        3.87298335],
       [4.12310563, 4.12310563, 4.69041576, 5.29150262, 5.19615242,
        5.74456265, 3.74165739, 4.24264069, 4.47213595, 4.        ,
        0.        , 4.24264069, 3.60555128, 4.58257569, 3.60555128,
        4.79583152],
       [2.23606798, 4.35889894, 3.46410162, 5.29150262, 4.79583152,
        5.19615242, 3.16227766, 4.24264069, 3.16227766, 3.16227766,
        4.24264069, 0.        , 3.31662479, 2.64575131, 3.        ,
        3.60555128],
       [3.16227766, 4.89897949, 4.35889894, 5.56776436, 5.29150262,
        5.65685425, 3.31662479, 4.79583152, 4.12310563, 3.60555128,
        3.60555128, 3.31662479, 0.        , 3.46410162, 3.46410162,
        4.47213595],
       [2.82842712, 4.69041576, 3.60555128, 5.56776436, 5.29150262,
        5.47722558, 3.60555128, 4.58257569, 3.60555128, 3.60555128,
        4.58257569, 2.64575131, 3.46410162, 0.        , 3.46410162,
        4.        ],
       [2.82842712, 4.24264069, 3.60555128, 5.19615242, 4.69041576,
        5.09901951, 3.        , 4.12310563, 3.31662479, 3.        ,
        3.60555128, 3.        , 3.46410162, 3.46410162, 0.        ,
        3.74165739],
       [3.46410162, 4.89897949, 4.12310563, 5.74456265, 5.47722558,
        5.83095189, 3.87298335, 4.79583152, 3.87298335, 3.87298335,
        4.79583152, 3.60555128, 4.47213595, 4.        , 3.74165739,
        0.        ]])

We can make this eaiser to read by making it a Data Frame.

dist_df = pd.DataFrame(data = euclidean_distances(sentence_df),
            index= sentence_dict.keys(), columns= sentence_dict.keys())
dist_df
Professor Brown Matt Langton Evan Greg Bassett Noah N Tuyetlinh Kenza Bouabdallah Chris Kerfoot Kang Liu Aiden Hill Muhammad S Max Mastrorocco Daniel Nate Jacob Anon
Professor Brown 0.000000 4.242641 3.316625 5.196152 4.898979 5.099020 3.000000 3.872983 3.000000 3.000000 4.123106 2.236068 3.162278 2.828427 2.828427 3.464102
Matt Langton 4.242641 0.000000 4.795832 5.916080 4.690416 5.830952 4.123106 4.123106 4.582576 4.582576 4.123106 4.358899 4.898979 4.690416 4.242641 4.898979
Evan 3.316625 4.795832 0.000000 5.291503 5.385165 5.385165 3.741657 4.690416 3.741657 3.741657 4.690416 3.464102 4.358899 3.605551 3.605551 4.123106
Greg Bassett 5.196152 5.916080 5.291503 0.000000 5.567764 6.244998 5.291503 5.477226 5.477226 5.291503 5.291503 5.291503 5.567764 5.567764 5.196152 5.744563
Noah N 4.898979 4.690416 5.385165 5.567764 0.000000 6.164414 5.000000 5.000000 5.196152 5.000000 5.196152 4.795832 5.291503 5.291503 4.690416 5.477226
Tuyetlinh 5.099020 5.830952 5.385165 6.244998 6.164414 0.000000 5.567764 5.567764 5.567764 5.567764 5.744563 5.196152 5.656854 5.477226 5.099020 5.830952
Kenza Bouabdallah 3.000000 4.123106 3.741657 5.291503 5.000000 5.567764 0.000000 4.000000 3.464102 3.162278 3.741657 3.162278 3.316625 3.605551 3.000000 3.872983
Chris Kerfoot 3.872983 4.123106 4.690416 5.477226 5.000000 5.567764 4.000000 0.000000 4.242641 4.242641 4.242641 4.242641 4.795832 4.582576 4.123106 4.795832
Kang Liu 3.000000 4.582576 3.741657 5.477226 5.196152 5.567764 3.464102 4.242641 0.000000 3.464102 4.472136 3.162278 4.123106 3.605551 3.316625 3.872983
Aiden Hill 3.000000 4.582576 3.741657 5.291503 5.000000 5.567764 3.162278 4.242641 3.464102 0.000000 4.000000 3.162278 3.605551 3.605551 3.000000 3.872983
Muhammad S 4.123106 4.123106 4.690416 5.291503 5.196152 5.744563 3.741657 4.242641 4.472136 4.000000 0.000000 4.242641 3.605551 4.582576 3.605551 4.795832
Max Mastrorocco 2.236068 4.358899 3.464102 5.291503 4.795832 5.196152 3.162278 4.242641 3.162278 3.162278 4.242641 0.000000 3.316625 2.645751 3.000000 3.605551
Daniel 3.162278 4.898979 4.358899 5.567764 5.291503 5.656854 3.316625 4.795832 4.123106 3.605551 3.605551 3.316625 0.000000 3.464102 3.464102 4.472136
Nate 2.828427 4.690416 3.605551 5.567764 5.291503 5.477226 3.605551 4.582576 3.605551 3.605551 4.582576 2.645751 3.464102 0.000000 3.464102 4.000000
Jacob 2.828427 4.242641 3.605551 5.196152 4.690416 5.099020 3.000000 4.123106 3.316625 3.000000 3.605551 3.000000 3.464102 3.464102 0.000000 3.741657
Anon 3.464102 4.898979 4.123106 5.744563 5.477226 5.830952 3.872983 4.795832 3.872983 3.872983 4.795832 3.605551 4.472136 4.000000 3.741657 0.000000

Who wrote the most similar question to me?

dist_df['Professor Brown'].drop('Professor Brown').idxmin()
'Max Mastrorocco'

30.5. Questions After Classroom#

30.5.1. How can this be used for training a classifier?#

To train a classifier, we would also need target variables, but the mat variable we had above can be used as the X for any sklearn estimator object. To train more complex tasks you would need appropriate data: for example labeled articles that are real and fake to train a fake news classifier (this is provided for a12).

30.5.2. How are ram tokens tracked?#

Ram Tokens are tracked in the Ram Token Bank: http://drsmb.co/ramtoken form. You’ll get e-mails when you earn or use them, however no one has submitted for any. You still can though (espeically if you were advised to and forgot).

30.6. More Practice#

  1. Which two people wrote the most similar sentences?

  2. Do you think this representation captures all cases of similary? Can you generate a case where it doesn’t do well?

  3. Try