24. Classifying Text#

from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.model_selection import train_test_split
# edited so I can reuse these later
selected_categories = ['comp.graphics','sci.crypt']
ng_X,ng_y = datasets.fetch_20newsgroups(categories =selected_categories,
                                       return_X_y = True)

from sklearn.metrics import confusion_matrix

24.1. review of count representation#

ex_vocab = ['and', 'are', 'cat', 'cats', 'dogs', 'pets', 'popular', 'videos']
sample = 'Cats and dogs are pets'
rep = [1, 1, 0, 1, 1, 1, 0, 0]

24.2. News Data#

Lables are the topic of the article 0 = computer graphics, 1 = cyrptography.

type(ng_X)
list
len(ng_X)
1179

the X is he actual text

print(ng_X[0])
From: robert@cpuserver.acsc.com (Robert Grant)
Subject: Virtual Reality for X on the CHEAP!
Organization: USCACSC, Los Angeles
Lines: 187
Distribution: world
Reply-To: robert@cpuserver.acsc.com (Robert Grant)
NNTP-Posting-Host: cpuserver.acsc.com

Hi everyone,

I thought that some people may be interested in my VR
software on these groups:

*******Announcing the release of Multiverse-1.0.2*******

Multiverse is a multi-user, non-immersive, X-Windows based Virtual Reality
system, primarily focused on entertainment/research.

Features:

   Client-Server based model, using Berkeley Sockets.
   No limit to the number of users (apart from performance).
   Generic clients.
   Customizable servers.
   Hierachical Objects (allowing attachment of cameras and light sources).
   Multiple light sources (ambient, point and spot).
   Objects can have extension code, to handle unique functionality, easily
        attached.

Functionality:

  Client:
   The client is built around a 'fast' render loop. Basically it changes things
   when told to by the server and then renders an image from the user's
   viewpoint. It also provides the server with information about the user's
   actions - which can then be communicated to other clients and therefore to
   other users.

   The client is designed to be generic - in other words you don't need to
   develop a new client when you want to enter a new world. This means that
   resources can be spent on enhancing the client software rather than adapting
   it. The adaptations, as will be explained in a moment, occur in the servers.

   This release of the client software supports the following functionality:

    o Hierarchical Objects (with associated addressing)

    o Multiple Light Sources and Types (Ambient, Point and Spot)

    o User Interface Panels

    o Colour Polygonal Rendering with Phong Shading (optional wireframe for
	faster frame rates)

    o Mouse and Keyboard Input

   (Some people may be disappointed that this software doesn't support the
   PowerGlove as an input device - this is not because it can't, but because
   I don't have one! This will, however, be one of the first enhancements!)

  Server(s):
   This is where customization can take place. The following basic support is
   provided in this release for potential world server developers:

    o Transparent Client Management

    o Client Message Handling

   This may not sound like much, but it takes away the headache of
accepting and
   terminating clients and receiving messages from them - the
application writer
   can work with the assumption that things are happening locally.

   Things get more interesting in the object extension functionality. This is
   what is provided to allow you to animate your objects:

    o Server Selectable Extension Installation:
        What this means is that you can decide which objects have extended
        functionality in your world. Basically you call the extension
        initialisers you want.

    o Event Handler Registration:
        When you develop extensions for an object you basically write callback
        functions for the events that you want the object to respond to.
        (Current events supported: INIT, MOVE, CHANGE, COLLIDE & TERMINATE)

    o Collision Detection Registration:
        If you want your object to respond to collision events just provide
        some basic information to the collision detection management software.
        Your callback will be activated when a collision occurs.

    This software is kept separate from the worldServer applications because
    the application developer wants to build a library of extended objects
    from which to choose.

    The following is all you need to make a World Server application:

    o Provide an initWorld function:
        This is where you choose what object extensions will be supported, plus
        any initialization you want to do.

    o Provide a positionObject function:
        This is where you determine where to place a new client.

    o Provide an installWorldObjects function:
        This is where you load the world (.wld) file for a new client.

    o Provide a getWorldType function:
        This is where you tell a new client what persona they should have.

    o Provide an animateWorld function:
        This is where you can go wild! At a minimum you should let the objects
        move (by calling a move function) and let the server sleep for a bit
        (to avoid outrunning the clients).

    That's all there is to it! And to prove it here are the line counts for the
    three world servers I've provided:

        generic - 81 lines
        dactyl - 270 lines (more complicated collision detection due to the
                           stairs! Will probably be improved with future
                           versions)
        dogfight - 72 lines

Location:

   This software is located at the following site:
   ftp.u.washington.edu

   Directory:
   pub/virtual-worlds

   File:
   multiverse-1.0.2.tar.Z

Futures:

   Client:

    o Texture mapping.

    o More realistic rendering: i.e. Z-Buffering (or similar), Gouraud shading

    o HMD support.

    o Etc, etc....

   Server:

    o Physical Modelling (gravity, friction etc).

    o Enhanced Object Management/Interaction

    o Etc, etc....

   Both:

    o Improved Comms!!!

I hope this provides people with a good understanding of the Multiverse
software,
unfortunately it comes with practically zero documentation, and I'm not sure
whether that will ever be able to be rectified! :-(

I hope people enjoy this software and that it is useful in our explorations of
the Virtual Universe - I've certainly found fascinating developing it, and I
would *LOVE* to add support for the PowerGlove...and an HMD :-)!!

Finally one major disclaimer:

This is totally amateur code. By that I mean there is no support for this code
other than what I, out the kindness of my heart, or you, out of pure
desperation, provide. I cannot be held responsible for anything good or bad
that may happen through the use of this code - USE IT AT YOUR OWN RISK!

Disclaimer over!

Of course if you love it, I would like to here from you. And anyone with
POSITIVE contributions/criticisms is also encouraged to contact me. Anyone who
hates it: > /dev/null!

************************************************************************
*********
And if anyone wants to let me do this for a living: you know where to
write :-)!
************************************************************************
*********

Thanks,

Robert.

robert@acsc.com
^^^^^^^^^^^^^^^
ng_y[:3]
array([0, 0, 1])

24.3. Count Vectorization#

We’re going to instantiate the object and fit it two the whole dataset.

count_vec = text.CountVectorizer()
ng_vec = count_vec.fit_transform(ng_X)
ng_vec.shape
(1179, 24257)
type(ng_vec)
scipy.sparse._csr.csr_matrix

Now get a classifier ready:

clf = MultinomialNB()

THen train/test split:

ng_vec_train, ng_vec_test, ng_y_train, ng_y_test = train_test_split(ng_vec,ng_y)
clf.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test)
0.9830508474576272
clf.__dict__
{'alpha': 1.0,
 'fit_prior': True,
 'class_prior': None,
 'force_alpha': 'warn',
 'n_features_in_': 24257,
 'classes_': array([0, 1]),
 'class_count_': array([438., 446.]),
 'feature_count_': array([[30., 11.,  0., ...,  0.,  0.,  0.],
        [26.,  8.,  2., ...,  1.,  1.,  1.]]),
 'feature_log_prob_': array([[ -8.37071195,  -9.31979251, -11.80469916, ..., -11.80469916,
         -11.80469916, -11.80469916],
        [ -8.84621446,  -9.94482675, -11.04343904, ..., -11.44890414,
         -11.44890414, -11.44890414]]),
 'class_log_prior_': array([-0.70223815, -0.68413811])}

We can predict on new articles and by transforming and then passing it to our classifierr.

article_vec = count_vec.transform(['this is about cryptography'])
clf.predict(article_vec)
array([1])

We can see that it was confident in that prediction:

clf.predict_proba(article_vec)
array([[0.00502949, 0.99497051]])

It would be high the other way with a different sentence:

article_vec = count_vec.transform(['this is about image processing'])
clf.predict(article_vec)
array([0])

If we make something about both topics, we can get less certain predictions:

article_vec = count_vec.transform(['this is about encrypting images'])
clf.predict_proba(article_vec)
array([[0.5753077, 0.4246923]])

24.4. TF-IDF#

This stands for term-frequency inverse document frequency. for a document number \(d\) and word number \(t\) with \(D\) total documents:

\[\operatorname{tf-idf(t,d)}=\operatorname{tf(t,d)} \times \operatorname{idf(t)}\]

where: $\(\operatorname{idf}(t) = \log{\frac{1 + n}{1+\operatorname{df}(t)}} + 1\)$

and

  • \(\operatorname{df}(t)\) is the number of documents word \(t\) occurs in

  • \(\operatorname{tf(t,d)}\) is te number of times word \(t\) occurs in document \(d\)

then sklearn also normalizes as follows:

\[v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}\]
tfidf = text.TfidfTransformer()
ng_tfidf = tfidf.fit_transform(ng_vec)

24.5. Comparing representations#

Which representation is better?

To to this, we resplit the data so that the same sampls are in the test set for both representations:

ng_vec_train, ng_vec_test, ng_tfidf_train, ng_tfidf_test,ng_y_train, ng_y_test = train_test_split(ng_vec, ng_tfidf, ng_y)

Warning

In class, i used Multinomial NB for both and verbally said that was a bad choice because the TFIDF does not meet the assumptions it is not counts.

I have switched this to Gaussian.

Now we get two new classifiers

clf_vec = MultinomialNB()
clf_tfidf = GaussianNB()

and fit and score each

clf_vec.fit(ng_vec_train,ng_y_train).score(ng_vec_test,ng_y_test)
0.9898305084745763
clf_tfidf.fit(ng_tfidf_train,ng_y_train).score(ng_tfidf_test,ng_y_test)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[22], line 1
----> 1 clf_tfidf.fit(ng_tfidf_train,ng_y_train).score(ng_tfidf_test,ng_y_test)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/naive_bayes.py:263, in GaussianNB.fit(self, X, y, sample_weight)
    240 """Fit Gaussian Naive Bayes according to X, y.
    241 
    242 Parameters
   (...)
    260     Returns the instance itself.
    261 """
    262 y = self._validate_data(y=y)
--> 263 return self._partial_fit(
    264     X, y, np.unique(y), _refit=True, sample_weight=sample_weight
    265 )

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/naive_bayes.py:423, in GaussianNB._partial_fit(self, X, y, classes, _refit, sample_weight)
    420     self.classes_ = None
    422 first_call = _check_partial_fit_first_call(self, classes)
--> 423 X, y = self._validate_data(X, y, reset=first_call)
    424 if sample_weight is not None:
    425     sample_weight = _check_sample_weight(sample_weight, X)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/base.py:622, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    620         y = check_array(y, input_name="y", **check_y_params)
    621     else:
--> 622         X, y = check_X_y(X, y, **check_params)
    623     out = X, y
    625 if not no_val_X and check_params.get("ensure_2d", True):

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/validation.py:1146, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1141         estimator_name = _check_estimator_name(estimator)
   1142     raise ValueError(
   1143         f"{estimator_name} requires y to be passed, but the target y is None"
   1144     )
-> 1146 X = check_array(
   1147     X,
   1148     accept_sparse=accept_sparse,
   1149     accept_large_sparse=accept_large_sparse,
   1150     dtype=dtype,
   1151     order=order,
   1152     copy=copy,
   1153     force_all_finite=force_all_finite,
   1154     ensure_2d=ensure_2d,
   1155     allow_nd=allow_nd,
   1156     ensure_min_samples=ensure_min_samples,
   1157     ensure_min_features=ensure_min_features,
   1158     estimator=estimator,
   1159     input_name="X",
   1160 )
   1162 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1164 check_consistent_length(X, y)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/validation.py:881, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    879 if sp.issparse(array):
    880     _ensure_no_complex_data(array)
--> 881     array = _ensure_sparse_format(
    882         array,
    883         accept_sparse=accept_sparse,
    884         dtype=dtype,
    885         copy=copy,
    886         force_all_finite=force_all_finite,
    887         accept_large_sparse=accept_large_sparse,
    888         estimator_name=estimator_name,
    889         input_name=input_name,
    890     )
    891 else:
    892     # If np.array(..) gives ComplexWarning, then we convert the warning
    893     # to an error. This is needed because specifying a non complex
    894     # dtype to the function converts complex to real dtype,
    895     # thereby passing the test made in the lines following the scope
    896     # of warnings context manager.
    897     with warnings.catch_warnings():

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/validation.py:532, in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name)
    529 _check_large_sparse(spmatrix, accept_large_sparse)
    531 if accept_sparse is False:
--> 532     raise TypeError(
    533         "A sparse matrix was passed, but dense "
    534         "data is required. Use X.toarray() to "
    535         "convert to a dense numpy array."
    536     )
    537 elif isinstance(accept_sparse, (list, tuple)):
    538     if len(accept_sparse) == 0:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

To get a more robust answer, we should use cross validation and likely use a different classifier. Multinomal Naive Bayes is well suited for the count data, but, strictly speaking, not the normalized ones.

24.6. Distances in Text Data#

ng_y[:20]
array([0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0])
selected_categories
['comp.graphics', 'sci.crypt']

Next we will take a few samples of each and see what the numerical representation gives us

first_graphics = np.where(ng_y == 0)[0]
first_crypt = np.where(ng_y == 1)[0]

where returns the indices where the condition is true

subset_rows = np.concatenate([first_graphics,first_crypt])
subset_rows
array([   0,    1,    3, ..., 1174, 1176, 1178])
sns.heatmap( euclidean_distances(ng_tfidf[subset_rows]))
<Axes: >
../_images/959418b41b4dd6439aab1e468b1f41eefee8bdd62bf2f3e6ae4ed28663b35d89.png

This shows that the graphics articles are close to one another, but the cryptography articles are far apart.

This mean thatt the graphics article are more similar to one another than they are to the cryptography articles. It also shows that the cryptography articles are more diverse than the graphics articles.

We could also look att only the test samples, but we can sort them

test_sorted = np.concatenate([np.where(ng_y_test == 0)[0],np.where(ng_y_test == 1)[0]])
sns.heatmap( euclidean_distances(ng_tfidf_test[test_sorted]))
<Axes: >
../_images/d9b6794e20070a6a652da1ff17299e7fbda044ff8254b690994fac7a2ae452a4.png
y_pred_tfidf = clf_tfidf.predict(ng_tfidf_test)
cols = pd.MultiIndex.from_arrays([['actual class']*len(selected_categories),
                                  selected_categories])
rows = pd.MultiIndex.from_arrays([['predicted class']*len(selected_categories),
                                  selected_categories])

pd.DataFrame(data = confusion_matrix(y_pred_tfidf,ng_y_test),columns = cols,
             index = rows)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[29], line 1
----> 1 y_pred_tfidf = clf_tfidf.predict(ng_tfidf_test)
      2 cols = pd.MultiIndex.from_arrays([['actual class']*len(selected_categories),
      3                                   selected_categories])
      4 rows = pd.MultiIndex.from_arrays([['predicted class']*len(selected_categories),
      5                                   selected_categories])

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/naive_bayes.py:101, in _BaseNB.predict(self, X)
     87 """
     88 Perform classification on an array of test vectors X.
     89 
   (...)
     98     Predicted target values for X.
     99 """
    100 check_is_fitted(self)
--> 101 X = self._check_X(X)
    102 jll = self._joint_log_likelihood(X)
    103 return self.classes_[np.argmax(jll, axis=1)]

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/naive_bayes.py:269, in GaussianNB._check_X(self, X)
    267 def _check_X(self, X):
    268     """Validate X, used only in predict* methods."""
--> 269     return self._validate_data(X, reset=False)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/base.py:605, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, cast_to_ndarray, **check_params)
    603         out = X, y
    604 elif not no_val_X and no_val_y:
--> 605     out = check_array(X, input_name="X", **check_params)
    606 elif no_val_X and not no_val_y:
    607     out = _check_y(y, **check_params)

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/validation.py:881, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    879 if sp.issparse(array):
    880     _ensure_no_complex_data(array)
--> 881     array = _ensure_sparse_format(
    882         array,
    883         accept_sparse=accept_sparse,
    884         dtype=dtype,
    885         copy=copy,
    886         force_all_finite=force_all_finite,
    887         accept_large_sparse=accept_large_sparse,
    888         estimator_name=estimator_name,
    889         input_name=input_name,
    890     )
    891 else:
    892     # If np.array(..) gives ComplexWarning, then we convert the warning
    893     # to an error. This is needed because specifying a non complex
    894     # dtype to the function converts complex to real dtype,
    895     # thereby passing the test made in the lines following the scope
    896     # of warnings context manager.
    897     with warnings.catch_warnings():

File /opt/hostedtoolcache/Python/3.8.18/x64/lib/python3.8/site-packages/sklearn/utils/validation.py:532, in _ensure_sparse_format(spmatrix, accept_sparse, dtype, copy, force_all_finite, accept_large_sparse, estimator_name, input_name)
    529 _check_large_sparse(spmatrix, accept_large_sparse)
    531 if accept_sparse is False:
--> 532     raise TypeError(
    533         "A sparse matrix was passed, but dense "
    534         "data is required. Use X.toarray() to "
    535         "convert to a dense numpy array."
    536     )
    537 elif isinstance(accept_sparse, (list, tuple)):
    538     if len(accept_sparse) == 0:

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

This makes sense that more of the errors are mislabeling computer graphics as cryptography, since the graphics articles are more diers and the cryptography articles were mostly similar to one another.