Class 15: Intro to ML & Modeling¶
What is a Model?¶
A model is a simplified representation of some part of the world. A famous quote about models is:
All models are wrong, but some are useful –George Box[^wiki]
You might have seen models in chemistry class, for example about an atom:
An atom doesn’t actually look like this, but this is a useful representation to help people learn how they function.
In machine learning, we use models, that are generally statistical models.
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process
Types of Models in Machine Learning¶
Starting from a dataset, we first make an additional designation about how we will use the different variables (columns). We will call most of them the features, which we denote mathematically with \(\mathbf{X}\) and we’ll choose one to be the target or labels, denoted by \(\mathbf{y}\).
The core assumption for just about all machine learning is that there exists some function \(f\) so that for the \(i\)th sample
Then with different additional assumptions we get different types of machine learning:
if both features (\(\mathbf{X}\)) and target (\(\mathbf{y}\)) are observed (contained in our dataset) it’s supervised learning code
if only the features (\(\mathbf{X}\)) are observed, it’s unsupervised learning code
There are more types, that we won’t cover much in class but will mention here for completeness:
if we have many samples for the features (\(\mathbf{X}\)) but only labels for some it’s semi-supervised learning
if we have only features (\(\mathbf{X}\)) and an oracle (a person or some other labeler) that we can ask for labels from with a budget, so we can’t get all of them, it’s active learning)
reinforcement learning involves taking actions and getting rewards.
Supervised Learning¶
we’ll focus on supervised learning first. we can take that same core assumption and use it with additional information about our target variable to determine learning task we are working to do.
if \(y_i\) are discrete (eg flower species) we are doing classification
if \(y_i\) are continuous (eg height) we are doing regression
Machine Learning Pipeline¶
To do machine learning we start with training data which we put as input to the learning algorithm. A learning algorithm might be a generic optimization procedure or a specialized procedure for a specific model. The learning algorithm outputs a trained model or the parameters of the model. When we deploy a model we pair the fit model with a prediction algorithm or decision algorithm to evaluate a new sample in the world.
In experimenting and design, we need testing data to evaluate how well our learning algorithm understood the world. We need to use previously unseen data, because if we don’t we can’t tell if the prediction algorithm is using a rule that the learning algorithm produced or just looking up from a lookup table the result. This can be thought of like the difference between memorization and understanding.
When the model does well on the training data, but not on test data, we say that it does not generalize well.
Try it yourself¶
List different machine learning applications you’ve interacted with and try to figure out what the features would be, what the target would be, and then what type of learning it is
Glossary¶
Tip
This week we’ve learned a lot of new terms. Contribute definition below to form a glossary.
Term |
Definition |
---|---|
Model |
a mathematical representation of assumptions about the world |