Intro to NLP- representing text data - Programming for Data Science (Fall 2025)

from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd

# data_home is an attempt to make this run on gh
ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
                                    data_home = '.',
                                       return_X_y = True)

All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.

That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.

Terms¶

document: unit of text we’re analyzing (one sample)
token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)
stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent
dictionary: all of the possible words that a given system knows how to process

Bag of Words Representionat¶

We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.

In order to analyze text we need some text, let’s use a small sentence:

sentence = 'Awesome, extra credit!'

Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence(Awesome, extra credit!).

counts = CountVectorizer()

counts.fit_transform([sentence])

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 3 stored elements and shape (1, 3)>

We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s in it and so we only represent the data.

What is a Sparse Matrix?

For example

mfull = np.asarray([[1,0,0,0,0],[0,0,1,0,0],[0,0,0,1,0]])
mfull

array([[1, 0, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0]])

but as a sparse matrix, we could store fewer values.

[[0,0,1],[1,2,1],[2,3,1]]# the above

[[0, 0, 1], [1, 2, 1], [2, 3, 1]]

So any matrix where the number of total values is low enough, we can store it more efficiently by tracking the locations and values instead of all of the zeros.

To actually see it though we have to cast out of that into a regular array.

counts.fit_transform([sentence]).toarray()

array([[1, 1, 1]])

For only one sentence it’s all ones, because it only has a small vocabulary.

We can make it more interesting, by picking a second sentence

counts.vocabulary_

{'awesome': 0, 'extra': 2, 'credit': 1}

To make it more interesting, we will add another sentence:

sentence_list = [sentence, 'the cow jumped over the moon']

This time we can see more information

mat = counts.fit_transform(sentence_list).toarray()

We see that there a row for each sentence (document) and a column for each word(token).

mat

array([[1, 0, 1, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 1, 1, 2]])

We can also examine attributes of the object.

Solution to Exercise 1

To get the number of tim each word occurs we sum down the columns (axis=0)and to get the total words (excluding stop words) for each document we sum along the rows (axis=1)

total_occurences_per_word = mat.sum(axis=0)
total_words_per_document = mat.sum(axis=1)

total_occurences_per_word, total_words_per_document

(array([1, 1, 1, 1, 1, 1, 1, 2]), array([3, 6]))

Classifying Text¶

Labels are the topic of the article 0 = computer graphics, 1 = cryptography.

type(ng_X)

list

len(ng_X)

1179

the X is he actual text

print(ng_X[0])

From: robert@cpuserver.acsc.com (Robert Grant)
Subject: Virtual Reality for X on the CHEAP!
Organization: USCACSC, Los Angeles
Lines: 187
Distribution: world
Reply-To: robert@cpuserver.acsc.com (Robert Grant)
NNTP-Posting-Host: cpuserver.acsc.com

Hi everyone,

I thought that some people may be interested in my VR
software on these groups:

*******Announcing the release of Multiverse-1.0.2*******

Multiverse is a multi-user, non-immersive, X-Windows based Virtual Reality
system, primarily focused on entertainment/research.

Features:

   Client-Server based model, using Berkeley Sockets.
   No limit to the number of users (apart from performance).
   Generic clients.
   Customizable servers.
   Hierachical Objects (allowing attachment of cameras and light sources).
   Multiple light sources (ambient, point and spot).
   Objects can have extension code, to handle unique functionality, easily
        attached.

Functionality:

  Client:
   The client is built around a 'fast' render loop. Basically it changes things
   when told to by the server and then renders an image from the user's
   viewpoint. It also provides the server with information about the user's
   actions - which can then be communicated to other clients and therefore to
   other users.

   The client is designed to be generic - in other words you don't need to
   develop a new client when you want to enter a new world. This means that
   resources can be spent on enhancing the client software rather than adapting
   it. The adaptations, as will be explained in a moment, occur in the servers.

   This release of the client software supports the following functionality:

    o Hierarchical Objects (with associated addressing)

    o Multiple Light Sources and Types (Ambient, Point and Spot)

    o User Interface Panels

    o Colour Polygonal Rendering with Phong Shading (optional wireframe for
	faster frame rates)

    o Mouse and Keyboard Input

   (Some people may be disappointed that this software doesn't support the
   PowerGlove as an input device - this is not because it can't, but because
   I don't have one! This will, however, be one of the first enhancements!)

  Server(s):
   This is where customization can take place. The following basic support is
   provided in this release for potential world server developers:

    o Transparent Client Management

    o Client Message Handling

   This may not sound like much, but it takes away the headache of
accepting and
   terminating clients and receiving messages from them - the
application writer
   can work with the assumption that things are happening locally.

   Things get more interesting in the object extension functionality. This is
   what is provided to allow you to animate your objects:

    o Server Selectable Extension Installation:
        What this means is that you can decide which objects have extended
        functionality in your world. Basically you call the extension
        initialisers you want.

    o Event Handler Registration:
        When you develop extensions for an object you basically write callback
        functions for the events that you want the object to respond to.
        (Current events supported: INIT, MOVE, CHANGE, COLLIDE & TERMINATE)

    o Collision Detection Registration:
        If you want your object to respond to collision events just provide
        some basic information to the collision detection management software.
        Your callback will be activated when a collision occurs.

    This software is kept separate from the worldServer applications because
    the application developer wants to build a library of extended objects
    from which to choose.

    The following is all you need to make a World Server application:

    o Provide an initWorld function:
        This is where you choose what object extensions will be supported, plus
        any initialization you want to do.

    o Provide a positionObject function:
        This is where you determine where to place a new client.

    o Provide an installWorldObjects function:
        This is where you load the world (.wld) file for a new client.

    o Provide a getWorldType function:
        This is where you tell a new client what persona they should have.

    o Provide an animateWorld function:
        This is where you can go wild! At a minimum you should let the objects
        move (by calling a move function) and let the server sleep for a bit
        (to avoid outrunning the clients).

    That's all there is to it! And to prove it here are the line counts for the
    three world servers I've provided:

        generic - 81 lines
        dactyl - 270 lines (more complicated collision detection due to the
                           stairs! Will probably be improved with future
                           versions)
        dogfight - 72 lines

Location:

   This software is located at the following site:
   ftp.u.washington.edu

   Directory:
   pub/virtual-worlds

   File:
   multiverse-1.0.2.tar.Z

Futures:

   Client:

    o Texture mapping.

    o More realistic rendering: i.e. Z-Buffering (or similar), Gouraud shading

    o HMD support.

    o Etc, etc....

   Server:

    o Physical Modelling (gravity, friction etc).

    o Enhanced Object Management/Interaction

    o Etc, etc....

   Both:

    o Improved Comms!!!

I hope this provides people with a good understanding of the Multiverse
software,
unfortunately it comes with practically zero documentation, and I'm not sure
whether that will ever be able to be rectified! :-(

I hope people enjoy this software and that it is useful in our explorations of
the Virtual Universe - I've certainly found fascinating developing it, and I
would *LOVE* to add support for the PowerGlove...and an HMD :-)!!

Finally one major disclaimer:

This is totally amateur code. By that I mean there is no support for this code
other than what I, out the kindness of my heart, or you, out of pure
desperation, provide. I cannot be held responsible for anything good or bad
that may happen through the use of this code - USE IT AT YOUR OWN RISK!

Disclaimer over!

Of course if you love it, I would like to here from you. And anyone with
POSITIVE contributions/criticisms is also encouraged to contact me. Anyone who
hates it: > /dev/null!

************************************************************************
*********
And if anyone wants to let me do this for a living: you know where to
write :-)!
************************************************************************
*********

Thanks,

Robert.

robert@acsc.com
^^^^^^^^^^^^^^^

ng_y[:3]

array([0, 0, 1])

Count Vectorization¶

We’re going to instantiate the object and fit it two the whole dataset.

count_vec =CountVectorizer()
ng_vec = count_vec.fit_transform(ng_X)

Next we can look at the data little:

ng_X[:1]

["From: robert@cpuserver.acsc.com (Robert Grant)\nSubject: Virtual Reality for X on the CHEAP!\nOrganization: USCACSC, Los Angeles\nLines: 187\nDistribution: world\nReply-To: robert@cpuserver.acsc.com (Robert Grant)\nNNTP-Posting-Host: cpuserver.acsc.com\n\nHi everyone,\n\nI thought that some people may be interested in my VR\nsoftware on these groups:\n\n*******Announcing the release of Multiverse-1.0.2*******\n\nMultiverse is a multi-user, non-immersive, X-Windows based Virtual Reality\nsystem, primarily focused on entertainment/research.\n\nFeatures:\n\n   Client-Server based model, using Berkeley Sockets.\n   No limit to the number of users (apart from performance).\n   Generic clients.\n   Customizable servers.\n   Hierachical Objects (allowing attachment of cameras and light sources).\n   Multiple light sources (ambient, point and spot).\n   Objects can have extension code, to handle unique functionality, easily\n        attached.\n\nFunctionality:\n\n  Client:\n   The client is built around a 'fast' render loop. Basically it changes things\n   when told to by the server and then renders an image from the user's\n   viewpoint. It also provides the server with information about the user's\n   actions - which can then be communicated to other clients and therefore to\n   other users.\n\n   The client is designed to be generic - in other words you don't need to\n   develop a new client when you want to enter a new world. This means that\n   resources can be spent on enhancing the client software rather than adapting\n   it. The adaptations, as will be explained in a moment, occur in the servers.\n\n   This release of the client software supports the following functionality:\n\n    o Hierarchical Objects (with associated addressing)\n\n    o Multiple Light Sources and Types (Ambient, Point and Spot)\n\n    o User Interface Panels\n\n    o Colour Polygonal Rendering with Phong Shading (optional wireframe for\n\tfaster frame rates)\n\n    o Mouse and Keyboard Input\n\n   (Some people may be disappointed that this software doesn't support the\n   PowerGlove as an input device - this is not because it can't, but because\n   I don't have one! This will, however, be one of the first enhancements!)\n\n  Server(s):\n   This is where customization can take place. The following basic support is\n   provided in this release for potential world server developers:\n\n    o Transparent Client Management\n\n    o Client Message Handling\n\n   This may not sound like much, but it takes away the headache of\naccepting and\n   terminating clients and receiving messages from them - the\napplication writer\n   can work with the assumption that things are happening locally.\n\n   Things get more interesting in the object extension functionality. This is\n   what is provided to allow you to animate your objects:\n\n    o Server Selectable Extension Installation:\n        What this means is that you can decide which objects have extended\n        functionality in your world. Basically you call the extension\n        initialisers you want.\n\n    o Event Handler Registration:\n        When you develop extensions for an object you basically write callback\n        functions for the events that you want the object to respond to.\n        (Current events supported: INIT, MOVE, CHANGE, COLLIDE & TERMINATE)\n\n    o Collision Detection Registration:\n        If you want your object to respond to collision events just provide\n        some basic information to the collision detection management software.\n        Your callback will be activated when a collision occurs.\n\n    This software is kept separate from the worldServer applications because\n    the application developer wants to build a library of extended objects\n    from which to choose.\n\n    The following is all you need to make a World Server application:\n\n    o Provide an initWorld function:\n        This is where you choose what object extensions will be supported, plus\n        any initialization you want to do.\n\n    o Provide a positionObject function:\n        This is where you determine where to place a new client.\n\n    o Provide an installWorldObjects function:\n        This is where you load the world (.wld) file for a new client.\n\n    o Provide a getWorldType function:\n        This is where you tell a new client what persona they should have.\n\n    o Provide an animateWorld function:\n        This is where you can go wild! At a minimum you should let the objects\n        move (by calling a move function) and let the server sleep for a bit\n        (to avoid outrunning the clients).\n\n    That's all there is to it! And to prove it here are the line counts for the\n    three world servers I've provided:\n\n        generic - 81 lines\n        dactyl - 270 lines (more complicated collision detection due to the\n                           stairs! Will probably be improved with future\n                           versions)\n        dogfight - 72 lines\n\nLocation:\n\n   This software is located at the following site:\n   ftp.u.washington.edu\n\n   Directory:\n   pub/virtual-worlds\n\n   File:\n   multiverse-1.0.2.tar.Z\n\nFutures:\n\n   Client:\n\n    o Texture mapping.\n\n    o More realistic rendering: i.e. Z-Buffering (or similar), Gouraud shading\n\n    o HMD support.\n\n    o Etc, etc....\n\n   Server:\n\n    o Physical Modelling (gravity, friction etc).\n\n    o Enhanced Object Management/Interaction\n\n    o Etc, etc....\n\n   Both:\n\n    o Improved Comms!!!\n\nI hope this provides people with a good understanding of the Multiverse\nsoftware,\nunfortunately it comes with practically zero documentation, and I'm not sure\nwhether that will ever be able to be rectified! :-(\n\nI hope people enjoy this software and that it is useful in our explorations of\nthe Virtual Universe - I've certainly found fascinating developing it, and I\nwould *LOVE* to add support for the PowerGlove...and an HMD :-)!!\n\nFinally one major disclaimer:\n\nThis is totally amateur code. By that I mean there is no support for this code\nother than what I, out the kindness of my heart, or you, out of pure\ndesperation, provide. I cannot be held responsible for anything good or bad\nthat may happen through the use of this code - USE IT AT YOUR OWN RISK!\n\nDisclaimer over!\n\nOf course if you love it, I would like to here from you. And anyone with\nPOSITIVE contributions/criticisms is also encouraged to contact me. Anyone who\nhates it: > /dev/null!\n\n************************************************************************\n*********\nAnd if anyone wants to let me do this for a living: you know where to\nwrite :-)!\n************************************************************************\n*********\n\nThanks,\n\nRobert.\n\nrobert@acsc.com\n^^^^^^^^^^^^^^^\n"]

ng_y[:5]

array([0, 0, 1, 0, 0])

Note that this is a very sparse matrix:

ng_vec

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 188291 stored elements and shape (1179, 24257)>

Since the matrix is in total, 1179 rows (the number of documents) and 24257 columns (the number of words), if the matrix was stored normally, we would need to store 28599003 values. The sparse matrix, however, only stores 188291 values. This is 0.6583830911867802% if the values, as in less than 1% of the values!

The sparse matrix gives us a bit more work to do programmatically, but save SO MUCH in resources that it is worth it.

Train /test split¶

Next, we train/test split:

ng_vec_train, ng_vec_test, ng_y_train, ng_y_test = train_test_split(ng_vec,ng_y)

Solution to Exercise 2 #

It is important to transform first, because if not and there are any words in the test documents that are not in the training data it won’t work. We want the feature space for both training and test to be the same or our fit classifier cannot actually evalute the test set.

Fit and eval¶

Now get a classifier ready:

clf = MultinomialNB()

Then, as normal

clf.fit(ng_vec_train, ng_y_train)

and score (and we can also do all of the other scores we have seen for classification)

clf.score(ng_vec_test, ng_y_test)

0.9694915254237289

We can predict on new articles and by transforming and then passing it to our classifierr.

article_vec = count_vec.transform(['this is about cryptography'])
clf.predict(article_vec)

array([1])

We can see that it was confident in that prediction:

clf.predict_proba(article_vec)

array([[0.00597812, 0.99402188]])

It would be high the other way with a different sentence:

article_vec = count_vec.transform(['this is about image proecessing'])
clf.predict(article_vec)

array([0])

If we make something about both topics, we can get less certain predictions:

article_vec = count_vec.transform(['this is about encrypting images'])
clf.predict_proba(article_vec)

array([[0.59444577, 0.40555423]])

TF-IDF¶

This stands for term-frequency inverse document frequency. for a document number $d$ and word number $t$ with $D$ total documents:

\operatorname{tf-idf(t,d)}=\operatorname{tf(t,d)} \times \operatorname{idf(t)}

(1)

where:

\operatorname{idf}(t) = \log{\frac{1 + n}{1+\operatorname{df}(t)}} + 1

(2)

and

$\operatorname{df}(t)$ is the number of documents word $t$ occurs in
$\operatorname{tf(t,d)}$ is te number of times word $t$ occurs in document $d$

then sklearn also normalizes as follows:

v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 + v{_2}^2 + \dots + v{_n}^2}}

(3)

tfidf = text.TfidfTransformer()
ng_tfidf = tfidf.fit_transform(ng_vec)

Other embeddings¶

There are other types of embeddings that are generaly represented usign neural networks. we talked breifly about them.

You can read more in this tutorial on huggingface

We also discussed some biases that embeddings end up encoding as in this paper

Questions¶

On transforming before splitting, does it break both ways (like a word in the training but not in the test) or no?¶

For the actual transform method you have, if whichever one you do second (using transform after fit_transform) has an extra word then you will have the problem.