36. Intro to NLP- representing text data#

Important

Wednesday class will be on zoom, using the office hours link that you can find on the course GitHub Organization Page

If you do not see a list of links at the top, you might need to accept an invite

The link will also be sent on prismia at class time.

Wednesday office hours will be 3-4:30pm instead of 7-8:30pm

36.1. Confidence Intervals review#

import numpy as np
def classification_confint(acc, n):
      '''
      Compute the 95% confidence interval for a classification problem.
       acc -- classification accuracy
       n  -- number of observations used to compute the accuracy
      Returns a tuple (lb,ub)
      '''
      interval = 1.96*np.sqrt(acc*(1-acc)/n)
      lb = max(0, acc - interval)
      ub = min(1.0, acc + interval)
      return (lb,ub)
N = 50
classification_confint(.78,N) , classification_confint(.9,N)
((0.6651767828355258, 0.8948232171644742),
 (0.816844242532462, 0.983155757467538))

These overlap, so they are not differnt

N = 200
classification_confint(.78,N) , classification_confint(.9,N)
((0.722588391417763, 0.8374116085822371),
 (0.8584221212662311, 0.941577878733769))

with more samples the intervals shrink and they stop overlapping

36.2. Text as Data#

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
sentence_list = ['The class is just starting to feel settled for me. - Dr. Brown',
 'Hello, I like sushi! - ',
 'Why squared  aka the mask - is a computer science student.Data science is fun',
 'Hello my fellow gaymers - Sun Tzu',
 'Soccer is a sport -Obama',
 'Hello, I love pizza - Bear',
 'This class is CSC/DSP 310. - Student',
 'It is 2:21pm -',
 'Pizza conquers all- Beetlejuice',
 'ayyy whaddup wit it - frankie',
 'This is a sentence - George W Bush',
 'Steam is the best place to play videogames change my mind. - Todd Howard',
 'This is a hello -',
 'Hello how are you -',
 'The monkey likes bananas. - A banana',
 'Just type a random sentence - Rosa Parks',
 'I love CSC. - Everyone',
 'The quick brown fox jumps over the lazy dog - Brendan Chadwick',
 'I like computers - David',
 'The fitness gram pacer test is a multi aerobic capacity test - Matt 3',
 'Sally sells seashells by the seashore. - Narrator',
 'I would like to take a nap. - Tom Cruise,']

How can we analyze these? All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.

That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.

df = pd.DataFrame(data=[s.split('-') for s in sentence_list],
                 columns = ['sentence','attribution'])

We can make it a dataframe, but we cannot use statistics on this because it is still sentences.

df
sentence attribution
0 The class is just starting to feel settled for... Dr. Brown
1 Hello, I like sushi!
2 Why squared aka the mask is a computer science student.Data science is...
3 Hello my fellow gaymers Sun Tzu
4 Soccer is a sport Obama
5 Hello, I love pizza Bear
6 This class is CSC/DSP 310. Student
7 It is 2:21pm
8 Pizza conquers all Beetlejuice
9 ayyy whaddup wit it frankie
10 This is a sentence George W Bush
11 Steam is the best place to play videogames cha... Todd Howard
12 This is a hello
13 Hello how are you
14 The monkey likes bananas. A banana
15 Just type a random sentence Rosa Parks
16 I love CSC. Everyone
17 The quick brown fox jumps over the lazy dog Brendan Chadwick
18 I like computers David
19 The fitness gram pacer test is a multi aerobic... Matt 3
20 Sally sells seashells by the seashore. Narrator
21 I would like to take a nap. Tom Cruise,
s1 = sentence_list[4]
s1
'Soccer is a sport -Obama'

36.3. Terms#

  • document: unit of text we’re analyzing (one sample)

  • token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)

  • stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent

  • dictionary: all of the possible words that a given system knows how to process

36.4. Bag of Words Represention#

We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.

Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence.

counts = CountVectorizer()
counts.fit_transform([s1])
<1x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s in it and so we only represent the data.

For example

mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])

but as a sparse matrix, we could store fewer values.

[[0,0,1],[1,2,1],[2,3,1]]# the above
[[0, 0, 1], [1, 2, 1], [2, 3, 1]]

So any matrix where the number of total values is low enough, we can store it more efficiently by tracking the locations and values instead of all of the zeros.

To actually see it though we have to cast out of that into a regular array.

counts.fit_transform([s1]).toarray()
array([[1, 1, 1, 1]])

For only one sentence it’s all ones, because it only has a small vocabulary.

We can make it more interesting, by picking a second sentence

s2 = sentence_list[19]
counts.fit_transform([s1,s2]).toarray()
array([[0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 2, 1]])

We can also examine attributes of the object.

counts.vocabulary_
{'soccer': 9,
 'is': 4,
 'sport': 10,
 'obama': 7,
 'the': 12,
 'fitness': 2,
 'gram': 3,
 'pacer': 8,
 'test': 11,
 'multi': 6,
 'aerobic': 0,
 'capacity': 1,
 'matt': 5}

We see that what it does is creates an ordered (the values are the order) list of words as the parameters of this model (ending in _ is an attribute of the object or parameter of the model).

it puts the words in the vocabulary_ attribute (aka the dictionary) in alphabetical order.

Now we can transform the whole dataset:

mat = counts.fit_transform(df['sentence']).toarray()
mat
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0]])

From this we can see that the representation is the count of how many times each word appears.

Now we can apply it to all of the sentences, or our whole corpus. We can get the dictionary out in order using the get_feature_names method. This method has a generic name, not specific to text, because it’s a property of transformers in general.

We can use a dataframe again to see this more easily. We can put labels on both the index and the column headings.

sentence_df = pd.DataFrame(mat, columns=counts.get_feature_names(),index= df['attribution'])
sentence_df
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[18], line 1
----> 1 sentence_df = pd.DataFrame(mat, columns=counts.get_feature_names(),index= df['attribution'])
      2 sentence_df

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

36.5. How can we find the most commonly used word?#

One guess

sentence_df.max()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 1
----> 1 sentence_df.max()

NameError: name 'sentence_df' is not defined

This is the maximum number of times each word appears in single “document”, but it’s also not sorted, it’s alphabetical.

This shows the word that appears the most times.

To get what we want we need to sum, which by default is along the columns, or per word. Then we get the location of the max with idx max.

sentence_df.sum().idxmax()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[20], line 1
----> 1 sentence_df.sum().idxmax()

NameError: name 'sentence_df' is not defined

36.6. Distances in text#

We can now use a distance function to calculate how far apart the different sentences are.

euclidean_distances(sentence_df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 euclidean_distances(sentence_df)

NameError: name 'sentence_df' is not defined

This distance is only int terms of actual reused words. It does not contain anything about the meaning of the words

We can make this eaiser to read by making it a Data Frame.

dist_df = pd.DataFrame(data=euclidean_distances(sentence_df),index =df['attribution'],
                      columns=df['attribution'])
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 dist_df = pd.DataFrame(data=euclidean_distances(sentence_df),index =df['attribution'],
      2                       columns=df['attribution'])

NameError: name 'sentence_df' is not defined
dist_df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[23], line 1
----> 1 dist_df.head()

NameError: name 'dist_df' is not defined

Who wrote the most similar question to me?

dist_df[' Dr. Brown'].drop(' Dr. Brown').idxmin()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[24], line 1
----> 1 dist_df[' Dr. Brown'].drop(' Dr. Brown').idxmin()

NameError: name 'dist_df' is not defined

36.7. Check on Your grade:#

grade_in = pd.read_json('grade-tracker-2022-11-18.json')
skill_level = grade_in['title'].str.split('-').apply(pd.Series).rename(columns={0:'skill',1:'level'})
grade_df_tall = pd.concat([grade_in['state'],skill_level],axis=1)
grade_df_view = grade_df_tall.pivot(index='skill',columns ='level').replace({'OPEN':'','CLOSED':'achieved'}

grade_df_num = grade_df_tall.pivot(index='skill',columns ='level').replace({'OPEN':0,'CLOSED':1})

Then you can use summary statistics to get the number of achivements you have already earned at each level.

Important

Remember if your grade is lower than you want right now, this is the minimum grade you can earn, your grade can go up and you are not likely locked out of the grade you want. Use office hours to make up level 1.

36.8. Questions After Classroom#

36.8.1. How can this be used for training a classifier?#

To train a classifier, we would also need target variables, but the mat variable we had above can be used as the X for any sklearn estimator object. To train more complex tasks you would need appropriate data: for example labeled articles that are real and fake to train a fake news classifier (this is provided for a12).