# Classification of Text Data

## Next Week

Speakers:

- Monday (Zoom): [Justin White](https://www.linkedin.com/in/thejustinwhite)
- Wednesday (TBD): [Cass Wilkinson Saldana](https://datasparkri.org/our-people#:~:text=CASS%20WILKINSON%20SALDA%C3%91A%2C%20DATA%20ANALYST) from [DataSpark RI](https://datasparkri.org/) Me
- Friday (Zoom): [Milecia McGregor](https://www.linkedin.com/in/milecia) speaking on Deploying Models

In [1]:
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
                    return_X_y = True)

## News Data

Lables are the topic of the article 0 = computer graphics, 1 = cyrptography.

In [2]:
ng_y[:5]

array([0, 0, 1, 0, 0])

the X is he actual text

In [3]:
ng_X[0]

"From: robert@cpuserver.acsc.com (Robert Grant)\nSubject: Virtual Reality for X on the CHEAP!\nOrganization: USCACSC, Los Angeles\nLines: 187\nDistribution: world\nReply-To: robert@cpuserver.acsc.com (Robert Grant)\nNNTP-Posting-Host: cpuserver.acsc.com\n\nHi everyone,\n\nI thought that some people may be interested in my VR\nsoftware on these groups:\n\n*******Announcing the release of Multiverse-1.0.2*******\n\nMultiverse is a multi-user, non-immersive, X-Windows based Virtual Reality\nsystem, primarily focused on entertainment/research.\n\nFeatures:\n\n   Client-Server based model, using Berkeley Sockets.\n   No limit to the number of users (apart from performance).\n   Generic clients.\n   Customizable servers.\n   Hierachical Objects (allowing attachment of cameras and light sources).\n   Multiple light sources (ambient, point and spot).\n   Objects can have extension code, to handle unique functionality, easily\n        attached.\n\nFunctionality:\n\n  Client:\n   The client is b

##

We're going to instantiate the object and fit it two the whole dataset.

In [4]:
count_vec = text.CountVectorizer()
ng_vec = count_vec.fit_transform(ng_X)

```{important}
I changed the following a little bit from class so that we can use the same test/train split for the two different types of transformation so that we can compare them more easily.

This also helps illustrate when using the fit and transform separately is helpful.
```

In [5]:
ng_X_train, ng_X_test, ng_y_train, ng_y_test = train_test_split(
                                        ng_X, ng_y, random_state=0)

Now, we can use the transformation that we fit to the whole dataset to transform the train and test portions of the data separately.

The transform method also returns the sparse matrix directly so we no longer need the `toarray` method.

In [6]:
ng_vec_train = count_vec.transform(ng_X_train)
ng_vec_test = count_vec.transform(ng_X_test)

In [7]:
ex_sample = pd.DataFrame(ng_vec_train[0].T,index = count_vec.vocabulary_,columns =['count'])

ex_sample.sort_values().head(10)

TypeError: sort_values() missing 1 required positional argument: 'by'

In [8]:
clf = MultinomialNB()

In [9]:
clf.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test)

0.9830508474576272

THis tells us that from the word counts we are able to distinguish between computer graphics articles and cryptography articles very well.

## TF-IDF


We wanted the TfidfVectorizer, not the transformer, so that it accepts documents not features. We will again, instantiate the object and then fit on the whole dataset.

In [10]:
tfidf = text.TfidfVectorizer()

tfidf.fit(ng_X)

We can see this works, because the code runs, but for completeness , we can also check the input again to compare with the above.

In [11]:
print('\n'.join(tfidf.fit.__doc__.split('\n')[:6]))

Learn vocabulary and idf from training set.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which generates either str, unicode or file objects.


This now takes documents as we wanted. Since we split the data before transforming, we can then apply the new fit using the transform on the train/test splits.

In [12]:
ng_tfidf_train = tfidf.transform(ng_X_train)
ng_tfidf_test = tfidf.transform(ng_X_test)

Since these splits were made before we can use the same targets we used above.

In [13]:
clf.fit(ng_tfidf_train,ng_y_train).score(ng_tfidf_test,ng_y_test)

0.9288135593220339

## Comparing representations

To start, we will look at one element from each in order to compare them.

In [14]:
ng_tfidf_train[0]

<1x24257 sparse matrix of type '<class 'numpy.float64'>'
	with 84 stored elements in Compressed Sparse Row format>

In [15]:
ng_vec_train[0]

<1x24257 sparse matrix of type '<class 'numpy.int64'>'
	with 84 stored elements in Compressed Sparse Row format>

To start they both have 84 elements, since it is two different representations of the same document, that makes sense.  We can check a few others as well

In [16]:
ng_tfidf_train[1]

<1x24257 sparse matrix of type '<class 'numpy.float64'>'
	with 202 stored elements in Compressed Sparse Row format>

In [17]:
ng_vec_train[1]

<1x24257 sparse matrix of type '<class 'numpy.int64'>'
	with 202 stored elements in Compressed Sparse Row format>

In [18]:
(ng_vec_train[4]>0).sum() == (ng_tfidf_train[4]>0).sum()

True

Let's pick out a common word so that the calculation is meaningful and do the tfidf calucation. To find a common word in the dictionary, we'll first filter the vocabulary to keep only the words that occur at least 300 times in the training set. We sum along the columns of the matrix, transform it to an array, then iterate over the sum, enumerated (assigning the number to each element of the sum) and use that to get the word out, if its total is over 300.  I saw that this is actually a sort of long list, so I chose to only print out the first 25. We print them out with the index so we can use it for the one we choose.

In [19]:
[(count_vec.get_feature_names()[i],i) for i, n in
         enumerate(np.asarray(ng_vec_train.sum(axis=0))[0])
 if n>300][:25]

NameError: name 'np' is not defined

Let's use computer.

In [20]:
computer_idx = 6786
count_vec.get_feature_names()[computer_idx]

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

In [21]:
ng_vec_train[:,computer_idx].toarray()[:10].T

array([[0, 0, 0, 5, 0, 0, 0, 0, 0, 0]])

In [22]:
ng_tfidf_train[:,computer_idx].toarray()[:10].T

array([[0.        , 0.        , 0.        , 0.06907742, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

So, we can see they have non zero elements in the same places, meaning that in both representations the column refers to the same thing.

We can compare the untransformed to the count vectorizer:

In [23]:
len(ng_X_train[0].split())

100

In [24]:
ng_vec_train[0].sum()

100

We see that it is just a count of the number of total words, not unique words, but total

In [25]:
ng_tfidf_train[0].sum()

7.558255873996122

The tf-idf matrix, however is normalized to make the sums smaller. Each row is not the same, but it is more similar.

In [26]:
sns.displot(ng_vec_train.sum(axis=1),bins=20)

NameError: name 'sns' is not defined

In [27]:
sns.displot(ng_tfidf_train.sum(axis=1),bins=20)

NameError: name 'sns' is not defined

We can see that the `tf-idf` makes the totals across documents more spread out.

When we sum across words, we then get to see how

From the documentation we see that the idf is not exactly the inverse of the number of documents, it's also rescaled some.

$$ idf = \log \frac{1 +n }{1 +df} + 1 $$

In this implementation, each row is the normalized as well, to keep them small, so that documents of different sizes are more comparable.  For example, if we return to what we did last week.


## Questions After Class

### What are unique tokens vs total vocabulary?

 the total vocabulary is the unique tokens across all documents.  THe number of values stored in the whole matrix is the sume of the number of unique tokens across each word.  In that number, any word that appears in more than one document is counted more than once.

### Does it only take into account if the word appears or not appear, or does the number of times the word appears matter? (Multinomial Naïve Bayes)


It takes the counts.