36. Intro to NLP- representing text data#


36.1. Confidence Intervals review#

import numpy as np
def classification_confint(acc, n):
      Compute the 95% confidence interval for a classification problem.
       acc -- classification accuracy
       n  -- number of observations used to compute the accuracy
      Returns a tuple (lb,ub)
      interval = 1.96*np.sqrt(acc*(1-acc)/n)
      lb = max(0, acc - interval)
      ub = min(1.0, acc + interval)
      return (lb,ub)
N = 50
classification_confint(.78,N) , classification_confint(.9,N)
((0.6651767828355258, 0.8948232171644742),
 (0.816844242532462, 0.983155757467538))

These overlap, so they are not differnt

N = 200
classification_confint(.78,N) , classification_confint(.9,N)
((0.722588391417763, 0.8374116085822371),
 (0.8584221212662311, 0.941577878733769))

with more samples the intervals shrink and they stop overlapping

36.2. Text as Data#

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
sentence_list = ['The class is just starting to feel settled for me. - Dr. Brown',
 'Hello, I like sushi! - ',
 'Why squared  aka the mask - is a computer science student.Data science is fun',
 'Hello my fellow gaymers - Sun Tzu',
 'Soccer is a sport -Obama',
 'Hello, I love pizza - Bear',
 'This class is CSC/DSP 310. - Student',
 'It is 2:21pm -',
 'Pizza conquers all- Beetlejuice',
 'ayyy whaddup wit it - frankie',
 'This is a sentence - George W Bush',
 'Steam is the best place to play videogames change my mind. - Todd Howard',
 'This is a hello -',
 'Hello how are you -',
 'The monkey likes bananas. - A banana',
 'Just type a random sentence - Rosa Parks',
 'I love CSC. - Everyone',
 'The quick brown fox jumps over the lazy dog - Brendan Chadwick',
 'I like computers - David',
 'The fitness gram pacer test is a multi aerobic capacity test - Matt 3',
 'Sally sells seashells by the seashore. - Narrator',
 'I would like to take a nap. - Tom Cruise,']

How can we analyze these? All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.

That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.

df = pd.DataFrame(data=[s.split('-') for s in sentence_list],
                 columns = ['sentence','attribution'])

We can make it a dataframe, but we cannot use statistics on this because it is still sentences.

sentence attribution
0 The class is just starting to feel settled for... Dr. Brown
1 Hello, I like sushi!
2 Why squared aka the mask is a computer science student.Data science is...
3 Hello my fellow gaymers Sun Tzu
4 Soccer is a sport Obama
5 Hello, I love pizza Bear
6 This class is CSC/DSP 310. Student
7 It is 2:21pm
8 Pizza conquers all Beetlejuice
9 ayyy whaddup wit it frankie
10 This is a sentence George W Bush
11 Steam is the best place to play videogames cha... Todd Howard
12 This is a hello
13 Hello how are you
14 The monkey likes bananas. A banana
15 Just type a random sentence Rosa Parks
16 I love CSC. Everyone
17 The quick brown fox jumps over the lazy dog Brendan Chadwick
18 I like computers David
19 The fitness gram pacer test is a multi aerobic... Matt 3
20 Sally sells seashells by the seashore. Narrator
21 I would like to take a nap. Tom Cruise,
s1 = sentence_list[4]
'Soccer is a sport -Obama'

36.3. Terms#

  • document: unit of text we’re analyzing (one sample)

  • token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)

  • stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent

  • dictionary: all of the possible words that a given system knows how to process

36.4. Bag of Words Represention#

We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.

Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence.

counts = CountVectorizer()
<1x4 sparse matrix of type '<class 'numpy.int64'>'
	with 4 stored elements in Compressed Sparse Row format>

We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s in it and so we only represent the data.

For example

mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])

but as a sparse matrix, we could store fewer values.

[[0,0,1],[1,2,1],[2,3,1]]# the above
[[0, 0, 1], [1, 2, 1], [2, 3, 1]]

So any matrix where the number of total values is low enough, we can store it more efficiently by tracking the locations and values instead of all of the zeros.

To actually see it though we have to cast out of that into a regular array.

array([[1, 1, 1, 1]])

For only one sentence it’s all ones, because it only has a small vocabulary.

We can make it more interesting, by picking a second sentence

s2 = sentence_list[19]
array([[0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0],
       [1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 2, 1]])

We can also examine attributes of the object.

{'soccer': 9,
 'is': 4,
 'sport': 10,
 'obama': 7,
 'the': 12,
 'fitness': 2,
 'gram': 3,
 'pacer': 8,
 'test': 11,
 'multi': 6,
 'aerobic': 0,
 'capacity': 1,
 'matt': 5}

We see that what it does is creates an ordered (the values are the order) list of words as the parameters of this model (ending in _ is an attribute of the object or parameter of the model).

it puts the words in the vocabulary_ attribute (aka the dictionary) in alphabetical order.

Now we can transform the whole dataset:

mat = counts.fit_transform(df['sentence']).toarray()
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 1, 0]])

From this we can see that the representation is the count of how many times each word appears.

Now we can apply it to all of the sentences, or our whole corpus. We can get the dictionary out in order using the get_feature_names method. This method has a generic name, not specific to text, because it’s a property of transformers in general.

We can use a dataframe again to see this more easily. We can put labels on both the index and the column headings.

sentence_df = pd.DataFrame(mat, columns=counts.get_feature_names(),index= df['attribution'])
36.8. Questions After Classroom#

36.8.1. How can this be used for training a classifier?#

To train a classifier, we would also need target variables, but the mat variable we had above can be used as the X for any sklearn estimator object. To train more complex tasks you would need appropriate data: for example labeled articles that are real and fake to train a fake news classifier (this is provided for a12).