Skip to article frontmatterSkip to article content

Glossary

aggregate
to combine data in some way, a function that can produce a customized summary table
more
anonymous function
a function that’s defined on the fly, typically to lighten syntax or return a function within a function. In python, they’re defined with the lambda keyword.
docs
balanced dataset
when the are equally or close to equally well represented
BeautifulSoup
a python library used to assist in web scraping, it pulls data from html and xml files that can be parsed in a variety of different ways using different methods.
docs
cast
convert a variable from one type to another. in Python, done by using the constructor method of the object type or builtin function for builtin types
categorical
variable type with discrete outcomes
classification
a type of machine learning where a categorical target variable is predicted from features
class
a value of the target variable
cluster
a group of samples that is similar by some definition
clustering
a type of unsupervised learning that finds groups or clusters among the samples
conditional
a logical control to do something, conditioned on something else, for example if, elif, else
confusion matrix
counts the number of samples of each actual category that were *predicted$ to be in each category for every pair of categories. For a two class(binary) problem, the table has 4 outcomes: true positive, false positive, false negative and true negative
corpus
(NLP) a set of documents for analysis
DataFrame
a data structure provided by pandas for tabular data in python.
data leakage
data from the test set is used in training and then falsely improves the test performance
dictionary
(data type) a mapping array that matches keys to values. doc
(in NLP) all of the possible tokens a model knows
discriminative model
a model that describes the decision rule for labeling a sample as one class or another
display
in jupyter notebooks, an HTML rendering of the output of a cell
document
unit of text for analysis (one sample). Could be one sentence, one paragraph, or an article, depending on the goal
error bars
typically vertical, but sometimes also horizontal lines on a point in a line graph or bar in bar chart that indicate the spread of the samples used to create that point or bar height
false negative
items in the positive class that were predicted in the negative class
items incorrectly predicted as members of the negative class
false negative rate
the percentage of actual positive items that were incorrectly classified
FNR=FNPFNR = \frac{FN}{P} for FNFN false negatives and PP actual positive items.
false positive
items in the negative class that were predicted in the postive class
items incorrectly predicted as members of the positive class
feature
an input variable in a prediction algorithm
an independent variable
generalize
to describe previously unseen data, solving related, but not in the training data problems
generative model
a model that describes the data and therefore can also be used to generate new data that looks like the training data.
gh
GitHub’s command line tools
git
a version control tool; it’s a fully open source and always free tool, that can be hosted by anyone or used without a host, locally only.
GitHub
a hosting service for git repositories
hyperparameter
parameters of the learning algorithm that are set by the user and possibly optimized over in an outer loop; in sklearn these are set when instantiating the estimator object
index
(verb) to index into a data structure means to pick out specified items, for example index into a list or a index into a data frame. Indexing usually invovlees square brackets []
(noun) the index of a dataframe is like a column, but it can be used to refer to the rows. It’s the list of names for the rows.
interpreter
the translator from human readable python code to something the computer can run. An interpreted language means you can work with python interactively
iterate
To do the same thing to each item in an iterable data structure, typically, an iterable type. Iterating is usually described as iterate over some data structure and typically uses the for keyword
iterable
any object in python that can return its members one at a time. The most common example is a list, but there are others.
docs
kernel
in the jupyter environment, the kernel is a language specific computational engine
lambda
the keyword used to define an anonymous function; lambda functions are defined with a compact syntax <name> = lambda <parameters>: <body>
docs
learning algorithm
an algorithm that finds patterns in data
implemented in the fit method in sklearn
location parameter
parameter of a distribution that controls where it is spatially. e.g. mean in a Gaussian
mask
take a subset using booleans; False values are dropped and True are kept
like multiplying by an array with 0 and 1s in it.
metric
(in ML) a score that measures the quality of a model’s predictions
model
(general) the set of assumptions, the simplification of the world
(statistics) a mathematical simplification of the world in probabilistic terms
numpy array
a type provided by numpy to represent matrices, used by pd.DataFrame.values doc and accessed by pd.DataFrame.to_numpy doc
The N-dimensional array (ndarray)
overfitting
when a model fits the training data much better than the test set, it has fit the noise in the training data instead of the core underlying pattern
a model that does not generalize well
parameter
(general programming) all inputs to a function
(in ML) the values that transform a generic function into a specific function
partition
a subset of samples
partitioning
a splitting of samples
PEP 8
Python Enhancement Proposal 8, is the Style Guide for Python Code.
pep 8
precision
the percentage of positive predictions that were actually members of the positive class P=TPPPP = \frac{TP}{PP} for TPTP True positives and PPPP positive predictions
For a confusion matrix, CC as sklearn: P=C1,1C0,1+C1,1 P = \frac{C_{1,1}}{C_{0,1} + C_{1,1}}
also called the positive predictive value
prediction
the estimated value of the target variable for a sample based on what was learned
the output of the prediction algorithm
prediction algorithm
an algorithm that takes an input and predicts the output value.
recall
the percentage of the actual positives that were predicted as the positive outcomes. R=TPPR = \frac{TP}{P} for TPTP True positives and PP items in the positive class
For a confusion matrix, CC as sklearn: R=C1,1C1,0+C1,1R = \frac{C_{1,1}}{C_{1,0} + C_{1,1}}
repository
a project folder with tracking information in it in the form of a .git file
scale parameter
a parameter of a distribution that controls its width. e.g. variance in a Guassian
schema
(formal)a description of how a database is set up
(informal) a description of what different columns in a data set mean
Series
a data structure provided by pandas for single columnar data with an index. Subsetting a Dataframe or applying a function to one will often produce a Series
docs
shape
of a dataframe, or matrix is the number of rows and columns.
side effect
an action that occurs in a function (like printing or writing a file) other than it returning a vlaue
slice
a subset of an iterable item based on the start, stop, and step
implemented by the slice built in
Split Apply Combine
a paradigm for splitting data into groups using a column, applying some function(aggregation, transformation, or filtration) to each piece and combinging in the individual pieces back together to a single table
pandas guide
stop words
Words that do not convey important meaning, we don’t need them (like a, the, an,). Note that this is context dependent. These words are removed when transforming text to numerical representation
suffix
additional part of the name that gets added to end of a name in a merge operation
supervised learning
a type of machine learning that requires both features and target variables at time of training
machine learning with labeled examples
target
the output of a prediction algorithm
also called the dependent variable or label
test accuracy
percentage of predictions that the model predict correctly, based on held-out (previously unseen) test data
test data
data that was not used in training that is instead used to evaluate the perforance of a model
Tidy Data Format
Tidy data is a database format that ensures data is easy to manipulate, model and visualize. The specific rules of Tidy Data are as follows: Each variable is a column, each row is an observation, and each observable unit is a table.
original paper
token
a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (typically a word, but more gneeral)
TraceBack
an error message in python that traces back from the line of code that had caused the exception back through all of the functions that called other functions to reach that line. This is sometimes call tracing back through the stack
training accuracy
percentage of predictions that the model predict correctly, based on the training data
training algorithm
another namde for learning algorithm
training data
data used to fit the model to the specific domain or problem, provided to the learning algorithm
the data used to find patterns and determine parameter values that will work for the problem at hand
transpose
swap the rows and columns of a matrix or dataframe
true negative
items in the negative class that were predicted in the negative class
items correctly predicted as members of the negative class
true positive
items in the positive class that were predicted in the postive class
items correctly predicted as members of the positive class
unsupervised learning
a type of machine learning that does not use target variables at learning (fit) time.
machine learning from unlabeled examples
Web Scraping
the process of extracting data from a website. In the context of this class, this is usually done using
the python library beautiful soup and a html parser to retrieve specific data.