Nonlinear Regression

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
sns.set_theme(font_scale=2,palette='colorblind')

We will use the same data

test_samples = 20
diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)
X_train,X_test, y_train,y_test = train_test_split(diabetes_X, diabetes_y ,
                                                  test_size=test_samples,random_state=0)

And retrain a model like Tuesday

regr_db = linear_model.LinearRegression()
regr_db.fit(X_train, y_train)

and again score it

y_pred = regr_db.predict(X_test)
regr_db.score(X_test, y_test)

0.5195333332288746

This is better than Tuesday’s score.

This time it is better just by using more data to train

train_samples,_ = X_train.shape
total_samples, _ = diabetes_X.shape
train_samples, train_samples

(422, 422)

Above we used an integer to the test_size parameter so we set the number of samples instead of the percentage of the data. We used 20 for testing, which is only 4.52% of the data. Which is a lot less than the 25% used before, so wwith more training data we can get a better model.

Polynomial Regression¶

Polynomial regression is still a linear problem. Linear regression solves for the $\beta_i$ for a $d$ dimensional problem.

y = \beta_0 + \beta_1 x_1 + \ldots + \beta_d x_d = \sum_i^d \beta_i x_i

(1)

Quadratic regression solves for

y = \beta_0 + \sum_i^d \beta_i x_i + \sum_j^d \sum_i^d \beta_{d+i} x_i x_j + \sum_i^d x_i^2

(2)

This is still a linear problem, because we can create a new $X'$ matrix that has the polynomial values of each feature and solve for more $\beta$ values.

So if our original features are $x_1, x_2, \ldots, x_d$ our new $X'$ will have 3 types of features original( $x_1, x_2, \ldots, x_d$ ), squared( $x_1^2, x_2^2, \ldots, x_d^2$ ) and interactions ( $x_1x_2, x_1x_2, \ldots, x_{d-1}x_d$ ).

We use a transformer object, which works similarly to the estimators, but does not use targets.

First, we instantiate.

poly = PolynomialFeatures()

Then we can fit transform on the training data and tranform on the test data:

X2_train =  poly.fit_transform(X_train)
X2_test = poly.transform(X_test)

This changes the shape a lot, now we have a lot more features

X2_train.shape

(422, 66)

Solution to Exercise 1

We can break down this total into different types, the original ones ( $x_0, x_1, \ldots, x_9$ ), those squared, ( $x_0^2, x_1^2, \ldots, x_9^2$ ), every pair ( $x_0x_1, x_0x_1, \ldots, x_7x_8, x_8x_9$ ) and a constant (so we do not need the intercept separately).

Math

Python

for $d$ orginal features the new total will be:

2*d + \sum_{i=1}^{d-1}i + 1

(3)

so for $d=10$ that is:

2*10 + (1+2+3+4+5+6+7+8+9) + 1

(4)

Now we can fit a model and score ite

regr_db2 = linear_model.LinearRegression()
regr_db2.fit(X2_train, y_train)
regr_db2.score(X2_test, y_test)

0.549017964337913

And we get even better performance than adding data alone did above.

regr_db2.coef_

array([ 1.98152760e-09,  2.26718871e+01, -2.84321941e+02,  4.77972133e+02,
        3.55751429e+02, -1.05551594e+03,  7.64836015e+02,  1.92998441e+02,
        1.29510126e+02,  9.95077095e+02,  6.97824322e+01,  1.35199820e+03,
        3.45185945e+03,  1.47333609e+02, -6.34976797e+01, -3.44685551e+03,
       -1.39445637e+03,  5.68431155e+03,  5.24944560e+03,  1.93320706e+03,
        1.44078438e+03, -1.71687299e+00,  1.07459375e+03,  1.71895547e+03,
        1.01185087e+04, -8.18260464e+03, -2.98856060e+03, -2.75793035e+03,
       -2.63196845e+03,  4.87268727e+02,  3.31912403e+02,  2.57451083e+03,
       -5.38594465e+03,  3.88190888e+03,  2.88432660e+03,  4.95180080e+02,
        2.98317779e+03, -5.23610008e+02, -6.08039517e+01,  8.85704665e+03,
       -5.89220694e+03, -3.89995076e+03, -1.62903703e+03, -2.50355665e+03,
       -2.07339059e+03,  9.56513316e+04, -1.35146919e+05, -8.45241707e+04,
       -4.18163194e+04, -8.84048804e+04, -6.63039722e+03,  5.00187584e+04,
        5.43127248e+04,  2.05139512e+04,  6.78912952e+04,  4.41920280e+03,
        2.10042065e+04,  2.61797171e+04,  3.74988567e+04,  5.16395623e+03,
        1.12705277e+04,  1.13915240e+04,  4.08084006e+03,  2.05240548e+04,
        2.20132713e+03,  1.91176349e+03])

Loading a Pretrained model!¶

In class, we went over the installation and login from the sharing a model tutorial in preparation for A5.

from huggingface_hub import hf_hub_download
import skops.io as sio
hf_hub_download(repo_id="CSC310-fall25/example_decision_tree", filename="model.pkl",local_dir='.')
dt_loaded = sio.load('model.pkl')

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[12], line 1
----> 1 from huggingface_hub import hf_hub_download
      2 import skops.io as sio
      3 hf_hub_download(repo_id="CSC310-fall25/example_decision_tree", filename="model.pkl",local_dir='.')

ModuleNotFoundError: No module named 'huggingface_hub'

dt_loaded.predict(np.asarray([[5,6], [1,3]]))

Polynomial Regression¶

Loading a Pretrained model!¶

Questions After Class¶