Class 15: Intro to ML & Modeling

handwritten notes from class

What is a Model?

A model is a simplified representation of some part of the world. A famous quote about models is:

All models are wrong, but some are useful –George Box[^wiki]

You might have seen models in chemistry class, for example about an atom:

https://upload.wikimedia.org/wikipedia/commons/a/a5/Bohr_atom_model_English.svg

Fig. 1 Brighterorange / CC BY-SA

An atom doesn’t actually look like this, but this is a useful representation to help people learn how they function.

In machine learning, we use models, that are generally statistical models.

A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population). A statistical model represents, often in considerably idealized form, the data-generating process

wikipedia

Types of Models in Machine Learning

Starting from a dataset, we first make an additional designation about how we will use the different variables (columns). We will call most of them the features, which we denote mathematically with \(\mathbf{X}\) and we’ll choose one to be the target or labels, denoted by \(\mathbf{y}\).

The core assumption for just about all machine learning is that there exists some function \(f\) so that for the \(i\)th sample

\[ y_i = f(\mathbf{x}_i) \]

Then with different additional assumptions we get different types of machine learning:

There are more types, that we won’t cover much in class but will mention here for completeness:

  • if we have many samples for the features (\(\mathbf{X}\)) but only labels for some it’s semi-supervised learning

  • if we have only features (\(\mathbf{X}\)) and an oracle (a person or some other labeler) that we can ask for labels from with a budget, so we can’t get all of them, it’s active learning)

  • reinforcement learning involves taking actions and getting rewards.

Supervised Learning

we’ll focus on supervised learning first. we can take that same core assumption and use it with additional information about our target variable to determine learning task we are working to do.

\[ y_i = f(\mathbf{x}_i) \]
  • if \(y_i\) are discrete (eg flower species) we are doing classification

  • if \(y_i\) are continuous (eg height) we are doing regression

Machine Learning Pipeline

To do machine learning we start with training data which we put as input to the learning algorithm. A learning algorithm might be a generic optimization procedure or a specialized procedure for a specific model. The learning algorithm outputs a trained model or the parameters of the model. When we deploy a model we pair the fit model with a prediction algorithm or decision algorithm to evaluate a new sample in the world.

In experimenting and design, we need testing data to evaluate how well our learning algorithm understood the world. We need to use previously unseen data, because if we don’t we can’t tell if the prediction algorithm is using a rule that the learning algorithm produced or just looking up from a lookup table the result. This can be thought of like the difference between memorization and understanding.

When the model does well on the training data, but not on test data, we say that it does not generalize well.

Try it yourself

  1. List different machine learning applications you’ve interacted with and try to figure out what the features would be, what the target would be, and then what type of learning it is

Glossary

Tip

This week we’ve learned a lot of new terms. Contribute definition below to form a glossary.

Term

Definition

Model

a mathematical representation of assumptions about the world