37. Classification of Text Data#

37.1. Next Week#

Speakers:

from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
                    return_X_y = True)

37.2. News Data#

Lables are the topic of the article 0 = computer graphics, 1 = cyrptography.

ng_y[:5]
array([0, 0, 1, 0, 0])

the X is he actual text

ng_X[0]
"From: robert@cpuserver.acsc.com (Robert Grant)\nSubject: Virtual Reality for X on the CHEAP!\nOrganization: USCACSC, Los Angeles\nLines: 187\nDistribution: world\nReply-To: robert@cpuserver.acsc.com (Robert Grant)\nNNTP-Posting-Host: cpuserver.acsc.com\n\nHi everyone,\n\nI thought that some people may be interested in my VR\nsoftware on these groups:\n\n*******Announcing the release of Multiverse-1.0.2*******\n\nMultiverse is a multi-user, non-immersive, X-Windows based Virtual Reality\nsystem, primarily focused on entertainment/research.\n\nFeatures:\n\n   Client-Server based model, using Berkeley Sockets.\n   No limit to the number of users (apart from performance).\n   Generic clients.\n   Customizable servers.\n   Hierachical Objects (allowing attachment of cameras and light sources).\n   Multiple light sources (ambient, point and spot).\n   Objects can have extension code, to handle unique functionality, easily\n        attached.\n\nFunctionality:\n\n  Client:\n   The client is built around a 'fast' render loop. Basically it changes things\n   when told to by the server and then renders an image from the user's\n   viewpoint. It also provides the server with information about the user's\n   actions - which can then be communicated to other clients and therefore to\n   other users.\n\n   The client is designed to be generic - in other words you don't need to\n   develop a new client when you want to enter a new world. This means that\n   resources can be spent on enhancing the client software rather than adapting\n   it. The adaptations, as will be explained in a moment, occur in the servers.\n\n   This release of the client software supports the following functionality:\n\n    o Hierarchical Objects (with associated addressing)\n\n    o Multiple Light Sources and Types (Ambient, Point and Spot)\n\n    o User Interface Panels\n\n    o Colour Polygonal Rendering with Phong Shading (optional wireframe for\n\tfaster frame rates)\n\n    o Mouse and Keyboard Input\n\n   (Some people may be disappointed that this software doesn't support the\n   PowerGlove as an input device - this is not because it can't, but because\n   I don't have one! This will, however, be one of the first enhancements!)\n\n  Server(s):\n   This is where customization can take place. The following basic support is\n   provided in this release for potential world server developers:\n\n    o Transparent Client Management\n\n    o Client Message Handling\n\n   This may not sound like much, but it takes away the headache of\naccepting and\n   terminating clients and receiving messages from them - the\napplication writer\n   can work with the assumption that things are happening locally.\n\n   Things get more interesting in the object extension functionality. This is\n   what is provided to allow you to animate your objects:\n\n    o Server Selectable Extension Installation:\n        What this means is that you can decide which objects have extended\n        functionality in your world. Basically you call the extension\n        initialisers you want.\n\n    o Event Handler Registration:\n        When you develop extensions for an object you basically write callback\n        functions for the events that you want the object to respond to.\n        (Current events supported: INIT, MOVE, CHANGE, COLLIDE & TERMINATE)\n\n    o Collision Detection Registration:\n        If you want your object to respond to collision events just provide\n        some basic information to the collision detection management software.\n        Your callback will be activated when a collision occurs.\n\n    This software is kept separate from the worldServer applications because\n    the application developer wants to build a library of extended objects\n    from which to choose.\n\n    The following is all you need to make a World Server application:\n\n    o Provide an initWorld function:\n        This is where you choose what object extensions will be supported, plus\n        any initialization you want to do.\n\n    o Provide a positionObject function:\n        This is where you determine where to place a new client.\n\n    o Provide an installWorldObjects function:\n        This is where you load the world (.wld) file for a new client.\n\n    o Provide a getWorldType function:\n        This is where you tell a new client what persona they should have.\n\n    o Provide an animateWorld function:\n        This is where you can go wild! At a minimum you should let the objects\n        move (by calling a move function) and let the server sleep for a bit\n        (to avoid outrunning the clients).\n\n    That's all there is to it! And to prove it here are the line counts for the\n    three world servers I've provided:\n\n        generic - 81 lines\n        dactyl - 270 lines (more complicated collision detection due to the\n                           stairs! Will probably be improved with future\n                           versions)\n        dogfight - 72 lines\n\nLocation:\n\n   This software is located at the following site:\n   ftp.u.washington.edu\n\n   Directory:\n   pub/virtual-worlds\n\n   File:\n   multiverse-1.0.2.tar.Z\n\nFutures:\n\n   Client:\n\n    o Texture mapping.\n\n    o More realistic rendering: i.e. Z-Buffering (or similar), Gouraud shading\n\n    o HMD support.\n\n    o Etc, etc....\n\n   Server:\n\n    o Physical Modelling (gravity, friction etc).\n\n    o Enhanced Object Management/Interaction\n\n    o Etc, etc....\n\n   Both:\n\n    o Improved Comms!!!\n\nI hope this provides people with a good understanding of the Multiverse\nsoftware,\nunfortunately it comes with practically zero documentation, and I'm not sure\nwhether that will ever be able to be rectified! :-(\n\nI hope people enjoy this software and that it is useful in our explorations of\nthe Virtual Universe - I've certainly found fascinating developing it, and I\nwould *LOVE* to add support for the PowerGlove...and an HMD :-)!!\n\nFinally one major disclaimer:\n\nThis is totally amateur code. By that I mean there is no support for this code\nother than what I, out the kindness of my heart, or you, out of pure\ndesperation, provide. I cannot be held responsible for anything good or bad\nthat may happen through the use of this code - USE IT AT YOUR OWN RISK!\n\nDisclaimer over!\n\nOf course if you love it, I would like to here from you. And anyone with\nPOSITIVE contributions/criticisms is also encouraged to contact me. Anyone who\nhates it: > /dev/null!\n\n************************************************************************\n*********\nAnd if anyone wants to let me do this for a living: you know where to\nwrite :-)!\n************************************************************************\n*********\n\nThanks,\n\nRobert.\n\nrobert@acsc.com\n^^^^^^^^^^^^^^^\n"

37.3. #

We’re going to instantiate the object and fit it two the whole dataset.

count_vec = text.CountVectorizer()
ng_vec = count_vec.fit_transform(ng_X)

Important

I changed the following a little bit from class so that we can use the same test/train split for the two different types of transformation so that we can compare them more easily.

This also helps illustrate when using the fit and transform separately is helpful.

ng_X_train, ng_X_test, ng_y_train, ng_y_test = train_test_split(
                                        ng_X, ng_y, random_state=0)

Now, we can use the transformation that we fit to the whole dataset to transform the train and test portions of the data separately.

The transform method also returns the sparse matrix directly so we no longer need the toarray method.

ng_vec_train = count_vec.transform(ng_X_train)
ng_vec_test = count_vec.transform(ng_X_test)
ex_sample = pd.DataFrame(ng_vec_train[0].T,index = count_vec.vocabulary_,columns =['count'])

ex_sample.sort_values().head(10)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 3
      1 ex_sample = pd.DataFrame(ng_vec_train[0].T,index = count_vec.vocabulary_,columns =['count'])
----> 3 ex_sample.sort_values().head(10)

File /opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/pandas/util/_decorators.py:331, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    325 if len(args) > num_allow_args:
    326     warnings.warn(
    327         msg.format(arguments=_format_argument_list(allow_args)),
    328         FutureWarning,
    329         stacklevel=find_stack_level(),
    330     )
--> 331 return func(*args, **kwargs)

TypeError: sort_values() missing 1 required positional argument: 'by'
clf = MultinomialNB()
clf.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test)
0.9830508474576272

THis tells us that from the word counts we are able to distinguish between computer graphics articles and cryptography articles very well.

37.4. TF-IDF#

We wanted the TfidfVectorizer, not the transformer, so that it accepts documents not features. We will again, instantiate the object and then fit on the whole dataset.

tfidf = text.TfidfVectorizer()

tfidf.fit(ng_X)
TfidfVectorizer()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We can see this works, because the code runs, but for completeness , we can also check the input again to compare with the above.

print('\n'.join(tfidf.fit.__doc__.split('\n')[:6]))
Learn vocabulary and idf from training set.

        Parameters
        ----------
        raw_documents : iterable
            An iterable which generates either str, unicode or file objects.

This now takes documents as we wanted. Since we split the data before transforming, we can then apply the new fit using the transform on the train/test splits.

ng_tfidf_train = tfidf.transform(ng_X_train)
ng_tfidf_test = tfidf.transform(ng_X_test)

Since these splits were made before we can use the same targets we used above.

clf.fit(ng_tfidf_train,ng_y_train).score(ng_tfidf_test,ng_y_test)
0.9288135593220339

37.5. Comparing representations#

To start, we will look at one element from each in order to compare them.

ng_tfidf_train[0]
<1x24257 sparse matrix of type '<class 'numpy.float64'>'
	with 84 stored elements in Compressed Sparse Row format>
ng_vec_train[0]
<1x24257 sparse matrix of type '<class 'numpy.int64'>'
	with 84 stored elements in Compressed Sparse Row format>

To start they both have 84 elements, since it is two different representations of the same document, that makes sense. We can check a few others as well

ng_tfidf_train[1]
<1x24257 sparse matrix of type '<class 'numpy.float64'>'
	with 202 stored elements in Compressed Sparse Row format>
ng_vec_train[1]
<1x24257 sparse matrix of type '<class 'numpy.int64'>'
	with 202 stored elements in Compressed Sparse Row format>
(ng_vec_train[4]>0).sum() == (ng_tfidf_train[4]>0).sum()
True

Let’s pick out a common word so that the calculation is meaningful and do the tfidf calucation. To find a common word in the dictionary, we’ll first filter the vocabulary to keep only the words that occur at least 300 times in the training set. We sum along the columns of the matrix, transform it to an array, then iterate over the sum, enumerated (assigning the number to each element of the sum) and use that to get the word out, if its total is over 300. I saw that this is actually a sort of long list, so I chose to only print out the first 25. We print them out with the index so we can use it for the one we choose.

[(count_vec.get_feature_names()[i],i) for i, n in
         enumerate(np.asarray(ng_vec_train.sum(axis=0))[0])
 if n>300][:25]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[19], line 2
      1 [(count_vec.get_feature_names()[i],i) for i, n in
----> 2          enumerate(np.asarray(ng_vec_train.sum(axis=0))[0])
      3  if n>300][:25]

NameError: name 'np' is not defined

Let’s use computer.

computer_idx = 6786
count_vec.get_feature_names()[computer_idx]
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[20], line 2
      1 computer_idx = 6786
----> 2 count_vec.get_feature_names()[computer_idx]

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'
ng_vec_train[:,computer_idx].toarray()[:10].T
array([[0, 0, 0, 5, 0, 0, 0, 0, 0, 0]])
ng_tfidf_train[:,computer_idx].toarray()[:10].T
array([[0.        , 0.        , 0.        , 0.06907742, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

So, we can see they have non zero elements in the same places, meaning that in both representations the column refers to the same thing.

We can compare the untransformed to the count vectorizer:

len(ng_X_train[0].split())
100
ng_vec_train[0].sum()
100

We see that it is just a count of the number of total words, not unique words, but total

ng_tfidf_train[0].sum()
7.558255873996122

The tf-idf matrix, however is normalized to make the sums smaller. Each row is not the same, but it is more similar.

sns.displot(ng_vec_train.sum(axis=1),bins=20)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[26], line 1
----> 1 sns.displot(ng_vec_train.sum(axis=1),bins=20)

NameError: name 'sns' is not defined
sns.displot(ng_tfidf_train.sum(axis=1),bins=20)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 sns.displot(ng_tfidf_train.sum(axis=1),bins=20)

NameError: name 'sns' is not defined

We can see that the tf-idf makes the totals across documents more spread out.

When we sum across words, we then get to see how

From the documentation we see that the idf is not exactly the inverse of the number of documents, it’s also rescaled some.

\[ idf = \log \frac{1 +n }{1 +df} + 1 \]

In this implementation, each row is the normalized as well, to keep them small, so that documents of different sizes are more comparable. For example, if we return to what we did last week.

37.6. Questions After Class#

37.6.1. What are unique tokens vs total vocabulary?#

the total vocabulary is the unique tokens across all documents. THe number of values stored in the whole matrix is the sume of the number of unique tokens across each word. In that number, any word that appears in more than one document is counted more than once.

37.6.2. Does it only take into account if the word appears or not appear, or does the number of times the word appears matter? (Multinomial Naïve Bayes)#

It takes the counts.