from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances
from sklearn import datasets
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
# data_home is an attempt to make this run on gh
ng_X,ng_y = datasets.fetch_20newsgroups(categories =['comp.graphics','sci.crypt'],
data_home = '.',
return_X_y = True)All of the machine leanring models we have seen only use numerical features organized into a table with one row per samplea and one column per feature.
That’s actually generally true. ALl ML models require numerical features, at some point. The process of taking data that is not numerical and tabular, which is called unstrucutred, into strucutred (tabular) format we require is called feature extraction. There are many, many ways to do that. We’ll see a few over the course of the rest of the semester. Some more advanced models hide the feature extraction, by putting it in the same function, but it’s always there.
Terms¶
document: unit of text we’re analyzing (one sample)
token: sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (basically a word)
stop words: no meaning, we don’t need them (like a, the, an,). Note that this is context dependent
dictionary: all of the possible words that a given system knows how to process
Bag of Words Representionat¶
We’re going to learn a represetnation called the bag of words. It ignores the order of the words within a document. To do this, we’ll first extract all of the tokens (tokenize) the docuemtns and then count how mnay times each word appears. This will be our numerical representation of the data.
In order to analyze text we need some text, let’s use a small sentence:
sentence = 'Awesome, extra credit!'Then we initialize our transformer, and use the fit transform method to fit the vectorizer model and apply it to this sentence(Awesome, extra credit!).
counts = CountVectorizer()counts.fit_transform([sentence])<Compressed Sparse Row sparse matrix of dtype 'int64'
with 3 stored elements and shape (1, 3)>We see it returns a sparse matrix. A sparse matrix means that it has a lot of 0s in it and so we only represent the data.
To actually see it though we have to cast out of that into a regular array.
counts.fit_transform([sentence]).toarray()array([[1, 1, 1]])For only one sentence it’s all ones, because it only has a small vocabulary.
We can make it more interesting, by picking a second sentence
counts.vocabulary_{'awesome': 0, 'extra': 2, 'credit': 1}To make it more interesting, we will add another sentence:
sentence_list = [sentence, 'the cow jumped over the moon']This time we can see more information
mat = counts.fit_transform(sentence_list).toarray()We see that there a row for each sentence (document) and a column for each word(token).
matarray([[1, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 1, 1, 2]])We can also examine attributes of the object.
Fill in the following to make the sums correct:
total_occurences_per_word = mat.sum(axis=??)
total_words_per_document = mat.sum(axis=??)Solution to Exercise 1
To get the number of tim each word occurs we sum down the columns (axis=0)and to get the total words (excluding stop words) for each document we sum along the rows (axis=1)
total_occurences_per_word = mat.sum(axis=0)
total_words_per_document = mat.sum(axis=1)total_occurences_per_word, total_words_per_document(array([1, 1, 1, 1, 1, 1, 1, 2]), array([3, 6]))Classifying Text¶
Labels are the topic of the article 0 = computer graphics, 1 = cryptography.
type(ng_X)listlen(ng_X)1179the X is he actual text
print(ng_X[0])From: robert@cpuserver.acsc.com (Robert Grant)
Subject: Virtual Reality for X on the CHEAP!
Organization: USCACSC, Los Angeles
Lines: 187
Distribution: world
Reply-To: robert@cpuserver.acsc.com (Robert Grant)
NNTP-Posting-Host: cpuserver.acsc.com
Hi everyone,
I thought that some people may be interested in my VR
software on these groups:
*******Announcing the release of Multiverse-1.0.2*******
Multiverse is a multi-user, non-immersive, X-Windows based Virtual Reality
system, primarily focused on entertainment/research.
Features:
Client-Server based model, using Berkeley Sockets.
No limit to the number of users (apart from performance).
Generic clients.
Customizable servers.
Hierachical Objects (allowing attachment of cameras and light sources).
Multiple light sources (ambient, point and spot).
Objects can have extension code, to handle unique functionality, easily
attached.
Functionality:
Client:
The client is built around a 'fast' render loop. Basically it changes things
when told to by the server and then renders an image from the user's
viewpoint. It also provides the server with information about the user's
actions - which can then be communicated to other clients and therefore to
other users.
The client is designed to be generic - in other words you don't need to
develop a new client when you want to enter a new world. This means that
resources can be spent on enhancing the client software rather than adapting
it. The adaptations, as will be explained in a moment, occur in the servers.
This release of the client software supports the following functionality:
o Hierarchical Objects (with associated addressing)
o Multiple Light Sources and Types (Ambient, Point and Spot)
o User Interface Panels
o Colour Polygonal Rendering with Phong Shading (optional wireframe for
faster frame rates)
o Mouse and Keyboard Input
(Some people may be disappointed that this software doesn't support the
PowerGlove as an input device - this is not because it can't, but because
I don't have one! This will, however, be one of the first enhancements!)
Server(s):
This is where customization can take place. The following basic support is
provided in this release for potential world server developers:
o Transparent Client Management
o Client Message Handling
This may not sound like much, but it takes away the headache of
accepting and
terminating clients and receiving messages from them - the
application writer
can work with the assumption that things are happening locally.
Things get more interesting in the object extension functionality. This is
what is provided to allow you to animate your objects:
o Server Selectable Extension Installation:
What this means is that you can decide which objects have extended
functionality in your world. Basically you call the extension
initialisers you want.
o Event Handler Registration:
When you develop extensions for an object you basically write callback
functions for the events that you want the object to respond to.
(Current events supported: INIT, MOVE, CHANGE, COLLIDE & TERMINATE)
o Collision Detection Registration:
If you want your object to respond to collision events just provide
some basic information to the collision detection management software.
Your callback will be activated when a collision occurs.
This software is kept separate from the worldServer applications because
the application developer wants to build a library of extended objects
from which to choose.
The following is all you need to make a World Server application:
o Provide an initWorld function:
This is where you choose what object extensions will be supported, plus
any initialization you want to do.
o Provide a positionObject function:
This is where you determine where to place a new client.
o Provide an installWorldObjects function:
This is where you load the world (.wld) file for a new client.
o Provide a getWorldType function:
This is where you tell a new client what persona they should have.
o Provide an animateWorld function:
This is where you can go wild! At a minimum you should let the objects
move (by calling a move function) and let the server sleep for a bit
(to avoid outrunning the clients).
That's all there is to it! And to prove it here are the line counts for the
three world servers I've provided:
generic - 81 lines
dactyl - 270 lines (more complicated collision detection due to the
stairs! Will probably be improved with future
versions)
dogfight - 72 lines
Location:
This software is located at the following site:
ftp.u.washington.edu
Directory:
pub/virtual-worlds
File:
multiverse-1.0.2.tar.Z
Futures:
Client:
o Texture mapping.
o More realistic rendering: i.e. Z-Buffering (or similar), Gouraud shading
o HMD support.
o Etc, etc....
Server:
o Physical Modelling (gravity, friction etc).
o Enhanced Object Management/Interaction
o Etc, etc....
Both:
o Improved Comms!!!
I hope this provides people with a good understanding of the Multiverse
software,
unfortunately it comes with practically zero documentation, and I'm not sure
whether that will ever be able to be rectified! :-(
I hope people enjoy this software and that it is useful in our explorations of
the Virtual Universe - I've certainly found fascinating developing it, and I
would *LOVE* to add support for the PowerGlove...and an HMD :-)!!
Finally one major disclaimer:
This is totally amateur code. By that I mean there is no support for this code
other than what I, out the kindness of my heart, or you, out of pure
desperation, provide. I cannot be held responsible for anything good or bad
that may happen through the use of this code - USE IT AT YOUR OWN RISK!
Disclaimer over!
Of course if you love it, I would like to here from you. And anyone with
POSITIVE contributions/criticisms is also encouraged to contact me. Anyone who
hates it: > /dev/null!
************************************************************************
*********
And if anyone wants to let me do this for a living: you know where to
write :-)!
************************************************************************
*********
Thanks,
Robert.
robert@acsc.com
^^^^^^^^^^^^^^^
ng_y[:3]array([0, 0, 1])Count Vectorization¶
We’re going to instantiate the object and fit it two the whole dataset.
count_vec =CountVectorizer()
ng_vec = count_vec.fit_transform(ng_X)Next we can look at the data little:
ng_X[:1]["From: robert@cpuserver.acsc.com (Robert Grant)\nSubject: Virtual Reality for X on the CHEAP!\nOrganization: USCACSC, Los Angeles\nLines: 187\nDistribution: world\nReply-To: robert@cpuserver.acsc.com (Robert Grant)\nNNTP-Posting-Host: cpuserver.acsc.com\n\nHi everyone,\n\nI thought that some people may be interested in my VR\nsoftware on these groups:\n\n*******Announcing the release of Multiverse-1.0.2*******\n\nMultiverse is a multi-user, non-immersive, X-Windows based Virtual Reality\nsystem, primarily focused on entertainment/research.\n\nFeatures:\n\n Client-Server based model, using Berkeley Sockets.\n No limit to the number of users (apart from performance).\n Generic clients.\n Customizable servers.\n Hierachical Objects (allowing attachment of cameras and light sources).\n Multiple light sources (ambient, point and spot).\n Objects can have extension code, to handle unique functionality, easily\n attached.\n\nFunctionality:\n\n Client:\n The client is built around a 'fast' render loop. Basically it changes things\n when told to by the server and then renders an image from the user's\n viewpoint. It also provides the server with information about the user's\n actions - which can then be communicated to other clients and therefore to\n other users.\n\n The client is designed to be generic - in other words you don't need to\n develop a new client when you want to enter a new world. This means that\n resources can be spent on enhancing the client software rather than adapting\n it. The adaptations, as will be explained in a moment, occur in the servers.\n\n This release of the client software supports the following functionality:\n\n o Hierarchical Objects (with associated addressing)\n\n o Multiple Light Sources and Types (Ambient, Point and Spot)\n\n o User Interface Panels\n\n o Colour Polygonal Rendering with Phong Shading (optional wireframe for\n\tfaster frame rates)\n\n o Mouse and Keyboard Input\n\n (Some people may be disappointed that this software doesn't support the\n PowerGlove as an input device - this is not because it can't, but because\n I don't have one! This will, however, be one of the first enhancements!)\n\n Server(s):\n This is where customization can take place. The following basic support is\n provided in this release for potential world server developers:\n\n o Transparent Client Management\n\n o Client Message Handling\n\n This may not sound like much, but it takes away the headache of\naccepting and\n terminating clients and receiving messages from them - the\napplication writer\n can work with the assumption that things are happening locally.\n\n Things get more interesting in the object extension functionality. This is\n what is provided to allow you to animate your objects:\n\n o Server Selectable Extension Installation:\n What this means is that you can decide which objects have extended\n functionality in your world. Basically you call the extension\n initialisers you want.\n\n o Event Handler Registration:\n When you develop extensions for an object you basically write callback\n functions for the events that you want the object to respond to.\n (Current events supported: INIT, MOVE, CHANGE, COLLIDE & TERMINATE)\n\n o Collision Detection Registration:\n If you want your object to respond to collision events just provide\n some basic information to the collision detection management software.\n Your callback will be activated when a collision occurs.\n\n This software is kept separate from the worldServer applications because\n the application developer wants to build a library of extended objects\n from which to choose.\n\n The following is all you need to make a World Server application:\n\n o Provide an initWorld function:\n This is where you choose what object extensions will be supported, plus\n any initialization you want to do.\n\n o Provide a positionObject function:\n This is where you determine where to place a new client.\n\n o Provide an installWorldObjects function:\n This is where you load the world (.wld) file for a new client.\n\n o Provide a getWorldType function:\n This is where you tell a new client what persona they should have.\n\n o Provide an animateWorld function:\n This is where you can go wild! At a minimum you should let the objects\n move (by calling a move function) and let the server sleep for a bit\n (to avoid outrunning the clients).\n\n That's all there is to it! And to prove it here are the line counts for the\n three world servers I've provided:\n\n generic - 81 lines\n dactyl - 270 lines (more complicated collision detection due to the\n stairs! Will probably be improved with future\n versions)\n dogfight - 72 lines\n\nLocation:\n\n This software is located at the following site:\n ftp.u.washington.edu\n\n Directory:\n pub/virtual-worlds\n\n File:\n multiverse-1.0.2.tar.Z\n\nFutures:\n\n Client:\n\n o Texture mapping.\n\n o More realistic rendering: i.e. Z-Buffering (or similar), Gouraud shading\n\n o HMD support.\n\n o Etc, etc....\n\n Server:\n\n o Physical Modelling (gravity, friction etc).\n\n o Enhanced Object Management/Interaction\n\n o Etc, etc....\n\n Both:\n\n o Improved Comms!!!\n\nI hope this provides people with a good understanding of the Multiverse\nsoftware,\nunfortunately it comes with practically zero documentation, and I'm not sure\nwhether that will ever be able to be rectified! :-(\n\nI hope people enjoy this software and that it is useful in our explorations of\nthe Virtual Universe - I've certainly found fascinating developing it, and I\nwould *LOVE* to add support for the PowerGlove...and an HMD :-)!!\n\nFinally one major disclaimer:\n\nThis is totally amateur code. By that I mean there is no support for this code\nother than what I, out the kindness of my heart, or you, out of pure\ndesperation, provide. I cannot be held responsible for anything good or bad\nthat may happen through the use of this code - USE IT AT YOUR OWN RISK!\n\nDisclaimer over!\n\nOf course if you love it, I would like to here from you. And anyone with\nPOSITIVE contributions/criticisms is also encouraged to contact me. Anyone who\nhates it: > /dev/null!\n\n************************************************************************\n*********\nAnd if anyone wants to let me do this for a living: you know where to\nwrite :-)!\n************************************************************************\n*********\n\nThanks,\n\nRobert.\n\nrobert@acsc.com\n^^^^^^^^^^^^^^^\n"]ng_y[:5]array([0, 0, 1, 0, 0])Note that this is a very sparse matrix:
ng_vec<Compressed Sparse Row sparse matrix of dtype 'int64'
with 188291 stored elements and shape (1179, 24257)>Since the matrix is in total, 1179 rows (the number of documents) and 24257 columns (the number of words), if the matrix was stored normally, we would need to store 28599003 values. The sparse matrix, however, only stores 188291 values. This is 0.6583830911867802% if the values, as in less than 1% of the values!
The sparse matrix gives us a bit more work to do programmatically, but save SO MUCH in resources that it is worth it.
Train /test split¶
Next, we train/test split:
ng_vec_train, ng_vec_test, ng_y_train, ng_y_test = train_test_split(ng_vec,ng_y)Why is it important that we transform the data first and then split the train and test data?
What could go wrong if we for example, used fit_transform on the training data and then tried to use transform on the test data?
Compare the tansformations we did earlier for a single sentence and for two sentences.
How are they different?
Solution to Exercise 2 #
It is important to transform first, because if not and there are any words in the test documents that are not in the training data it won’t work. We want the feature space for both training and test to be the same or our fit classifier cannot actually evalute the test set.
Fit and eval¶
Now get a classifier ready:
clf = MultinomialNB()Then, as normal
clf.fit(ng_vec_train, ng_y_train)and score (and we can also do all of the other scores we have seen for classification)
clf.score(ng_vec_test, ng_y_test)0.9898305084745763We can predict on new articles and by transforming and then passing it to our classifierr.
article_vec = count_vec.transform(['this is about cryptography'])
clf.predict(article_vec)array([1])We can see that it was confident in that prediction:
clf.predict_proba(article_vec)array([[0.00477751, 0.99522249]])It would be high the other way with a different sentence:
article_vec = count_vec.transform(['this is about image proecessing'])
clf.predict(article_vec)array([0])If we make something about both topics, we can get less certain predictions:
article_vec = count_vec.transform(['this is about encrypting images'])
clf.predict_proba(article_vec)array([[0.64243157, 0.35756843]])TF-IDF¶
This stands for term-frequency inverse document frequency. for a document number and word number with total documents:
where:
and
is the number of documents word occurs in
is te number of times word occurs in document
then sklearn also normalizes as follows:
tfidf = text.TfidfTransformer()
ng_tfidf = tfidf.fit_transform(ng_vec)Other embeddings¶
There are other types of embeddings that are generaly represented usign neural networks. we talked breifly about them.
You can read more in this tutorial on huggingface
We also discussed some biases that embeddings end up encoding as in this paper
Questions¶
On transforming before splitting, does it break both ways (like a word in the training but not in the test) or no?¶
For the actual transform method you have, if whichever one you do second (using transform after fit_transform) has an extra word then you will have the problem.
Next week tuesday, in class will be a problem solving session with extra help for the last assignment and planning extensions