Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Nonlinear Regression

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
sns.set_theme(font_scale=2,palette='colorblind')

We will use the same data

test_samples = 20
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
X_train,X_test, y_train,y_test = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=test_samples,random_state=0)

And retrain a model like Tuesday

regr_db = linear_model.LinearRegression()
regr_db.fit(X_train, y_train)
Loading...

and again score it

y_pred = regr_db.predict(X_test)
regr_db.score(X_test, y_test)
0.5195333332288746

This is better than Tuesday’s score.

This time it is better just by using more data to train

train_samples,_ = X_train.shape
total_samples, _ = diabetes_X.shape
train_samples, train_samples
(422, 422)

Above we used an integer to the test_size parameter so we set the number of samples instead of the percentage of the data. We used 20 for testing, which is only 4.52% of the data. Which is a lot less than the 25% used before, so wwith more training data we can get a better model.

Polynomial Regression

Polynomial regression is still a linear problem. Linear regression solves for the βi\beta_i for a dd dimensional problem.

y=β0+β1x1++βdxd=idβixiy = \beta_0 + \beta_1 x_1 + \ldots + \beta_d x_d = \sum_i^d \beta_i x_i

Quadratic regression solves for

y=β0+idβixi+jdidβd+ixixj+idxi2y = \beta_0 + \sum_i^d \beta_i x_i + \sum_j^d \sum_i^d \beta_{d+i} x_i x_j + \sum_i^d x_i^2

This is still a linear problem, because we can create a new XX' matrix that has the polynomial values of each feature and solve for more β\beta values.

So if our original features are x1,x2,,xdx_1, x_2, \ldots, x_d our new XX' will have 3 types of features original(x1,x2,,xdx_1, x_2, \ldots, x_d), squared(x12,x22,,xd2x_1^2, x_2^2, \ldots, x_d^2) and interactions (x1x2,x1x2,,xd1xdx_1x_2, x_1x_2, \ldots, x_{d-1}x_d).

We use a transformer object, which works similarly to the estimators, but does not use targets.

First, we instantiate.

poly = PolynomialFeatures()

Then we can fit transform on the training data and tranform on the test data:

X2_train =  poly.fit_transform(X_train)
X2_test = poly.transform(X_test)

This changes the shape a lot, now we have a lot more features

X2_train.shape
(422, 66)
Solution to Exercise 1

We can break down this total into different types, the original ones (x0,x1,,x9x_0, x_1, \ldots, x_9), those squared, (x02,x12,,x92x_0^2, x_1^2, \ldots, x_9^2), every pair (x0x1,x0x1,,x7x8,x8x9x_0x_1, x_0x_1, \ldots, x_7x_8, x_8x_9) and a constant (so we do not need the intercept separately).

Math
Python

for dd orginal features the new total will be:

2d+i=1d1i+12*d + \sum_{i=1}^{d-1}i + 1

so for d=10d=10 that is:

210+(1+2+3+4+5+6+7+8+9)+12*10 + (1+2+3+4+5+6+7+8+9) + 1

Now we can fit a model and score ite

regr_db2 = linear_model.LinearRegression()
regr_db2.fit(X2_train, y_train)
regr_db2.score(X2_test, y_test)
0.549017964337913

And we get even better performance than adding data alone did above.

regr_db2.coef_
array([ 1.98152760e-09, 2.26718871e+01, -2.84321941e+02, 4.77972133e+02, 3.55751429e+02, -1.05551594e+03, 7.64836015e+02, 1.92998441e+02, 1.29510126e+02, 9.95077095e+02, 6.97824322e+01, 1.35199820e+03, 3.45185945e+03, 1.47333609e+02, -6.34976797e+01, -3.44685551e+03, -1.39445637e+03, 5.68431155e+03, 5.24944560e+03, 1.93320706e+03, 1.44078438e+03, -1.71687299e+00, 1.07459375e+03, 1.71895547e+03, 1.01185087e+04, -8.18260464e+03, -2.98856060e+03, -2.75793035e+03, -2.63196845e+03, 4.87268727e+02, 3.31912403e+02, 2.57451083e+03, -5.38594465e+03, 3.88190888e+03, 2.88432660e+03, 4.95180080e+02, 2.98317779e+03, -5.23610008e+02, -6.08039517e+01, 8.85704665e+03, -5.89220694e+03, -3.89995076e+03, -1.62903703e+03, -2.50355665e+03, -2.07339059e+03, 9.56513316e+04, -1.35146919e+05, -8.45241707e+04, -4.18163194e+04, -8.84048804e+04, -6.63039722e+03, 5.00187584e+04, 5.43127248e+04, 2.05139512e+04, 6.78912952e+04, 4.41920280e+03, 2.10042065e+04, 2.61797171e+04, 3.74988567e+04, 5.16395623e+03, 1.12705277e+04, 1.13915240e+04, 4.08084006e+03, 2.05240548e+04, 2.20132713e+03, 1.91176349e+03])

Loading a Pretrained model!

In class, we went over the installation and login from the sharing a model tutorial in preparation for A5.

from huggingface_hub import hf_hub_download
import skops.io as sio
hf_hub_download(repo_id="CSC310-fall25/example_decision_tree", filename="model.pkl",local_dir='.')
dt_loaded = sio.load('model.pkl')
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[12], line 1
----> 1 from huggingface_hub import hf_hub_download
      2 import skops.io as sio
      3 hf_hub_download(repo_id="CSC310-fall25/example_decision_tree", filename="model.pkl",local_dir='.')

ModuleNotFoundError: No module named 'huggingface_hub'
dt_loaded.predict(np.asarray([[5,6], [1,3]]))

Questions After Class