{ "cells": [ { "cell_type": "markdown", "id": "c63e966d", "metadata": {}, "source": [ "# Linear Regression" ] }, { "cell_type": "code", "execution_count": 1, "id": "05abcb9a", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from sklearn import datasets, linear_model\n", "from sklearn.metrics import mean_squared_error, r2_score\n", "from sklearn.model_selection import train_test_split\n", "import pandas as pd\n", "sns.set_theme(font_scale=2,palette='colorblind')" ] }, { "cell_type": "markdown", "id": "0ccdb53f", "metadata": {}, "source": [ "## Setting upa linear regression" ] }, { "cell_type": "code", "execution_count": 2, "id": "1659e88e", "metadata": {}, "outputs": [], "source": [ "tips = sns.load_dataset(\"tips\")" ] }, { "cell_type": "code", "execution_count": 3, "id": "edcadada", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
\n", "
" ], "text/plain": [ " total_bill tip sex smoker day time size\n", "0 16.99 1.01 Female No Sun Dinner 2" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips.head(1)" ] }, { "cell_type": "markdown", "id": "99708294", "metadata": {}, "source": [ "We're going to predict **tip** from **total bill** using 80% of the data for training.\n", "This is a regression problem because the target, *tip* is a continuous value,\n", "the problems we've seen so far were all classification, species of iris and the\n", "character in that corners data were both categorical. \n", "\n", "Using linear regression is also a good choice because it makes sense that the tip\n", "would be approximately linearly related to the total bill, most people pick some\n", "percentage of the total bill. If we our prior knowledge was that people\n", "typically tipped with some more complicated function, this would not be a good\n", "model." ] }, { "cell_type": "code", "execution_count": 4, "id": "19db7a6c", "metadata": {}, "outputs": [], "source": [ "# sklearn requires 2D object of features even for 1 feature\n", "tips_X = tips['total_bill'].values\n", "tips_X = tips_X[:,np.newaxis] # add an axis\n", "tips_y = tips['tip']\n", "\n", "tips_X_train,tips_X_test, tips_y_train, tips_y_test = train_test_split(\n", " tips_X,\n", " tips_y,\n", " train_size=.8,\n", " random_state=0)" ] }, { "cell_type": "markdown", "id": "3e07dac2", "metadata": {}, "source": [ "To see what that new bit of code did, we can examine the shapes:" ] }, { "cell_type": "code", "execution_count": 5, "id": "7c8f4928", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(244, 1)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips_X.shape" ] }, { "cell_type": "markdown", "id": "b8906d08", "metadata": {}, "source": [ "what we ended up is 2 dimensions (there are two numbers) even though the second\n", "one is 1." ] }, { "cell_type": "code", "execution_count": 6, "id": "40337d93", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(244,)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips['total_bill'].values.shape" ] }, { "cell_type": "markdown", "id": "87be68d6", "metadata": {}, "source": [ "this, without the `newaxis` is one dimension, we can see that because there is\n", "no number after the comma. \n", "\n", "Now that our data is ready, we create the linear regression estimator object" ] }, { "cell_type": "code", "execution_count": 7, "id": "33e323b4", "metadata": {}, "outputs": [], "source": [ "regr = linear_model.LinearRegression()" ] }, { "cell_type": "markdown", "id": "bc0222cf", "metadata": {}, "source": [ "Now we fit the model." ] }, { "cell_type": "code", "execution_count": 8, "id": "900fd9fb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regr.fit(tips_X_train,tips_y_train)" ] }, { "cell_type": "markdown", "id": "188b9d7a", "metadata": {}, "source": [ "We can examine the coefficients and intercept." ] }, { "cell_type": "code", "execution_count": 9, "id": "55f904e4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(array([0.0968534]), 1.0285439454607272)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regr.coef_, regr.intercept_" ] }, { "cell_type": "markdown", "id": "fdeacda9", "metadata": {}, "source": [ "These define a line (y = mx+b) coef is the slope.\n", "\n", "\n", "```{important}\n", "This is what our model *predicts* the tip will be based on the past data. It is\n", "important to note that this is not what the tip *should* be by any sort of\n", "virtues. For example, a typical normative rule for tipping is to tip 15% or 20%.\n", "the model we learned, from this data, however is ~%10 + $1. (it's actually\n", "9.68% + $1.028)\n", "```\n", "\n", "To interpret this, we can apply it for a single value. We trained this to\n", "predict the tip from the total bill. So, we can put in any value that's a\n", "plausible total bill and get the predicted tip." ] }, { "cell_type": "code", "execution_count": 10, "id": "11c1fc61", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2.75059744])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_bill = np.asarray([17.78]).reshape(1,-1)\n", "regr.predict(my_bill)" ] }, { "cell_type": "markdown", "id": "de866f10", "metadata": {}, "source": [ "We can also apply the function, as usual." ] }, { "cell_type": "code", "execution_count": 11, "id": "2aa9ba5d", "metadata": {}, "outputs": [], "source": [ "tips_y_pred = regr.predict(tips_X_test)" ] }, { "cell_type": "markdown", "id": "1696ff02", "metadata": {}, "source": [ "This gives a vector of values." ] }, { "cell_type": "code", "execution_count": 12, "id": "1073fcd5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2.7321953 , 2.79999268, 2.91621676, 1.73073111, 2.60434881,\n", " 1.58545101, 2.76415692, 3.28813383, 2.7864332 , 4.38451435,\n", " 3.47699796, 3.47021823, 2.39127132, 2.28763818, 2.32831661,\n", " 3.97288739, 1.83726986, 2.38449158, 2.84745085, 3.26585755,\n", " 3.93995723, 3.05471713, 2.57819839, 2.48521912, 2.33703342,\n", " 2.61693975, 2.20628132, 3.91477534, 3.4779665 , 2.55592211,\n", " 2.45519457, 2.23727441, 2.52202341, 2.05422148, 2.79999268,\n", " 2.32541101, 2.66827205, 2.02903959, 5.7094689 , 2.57626132,\n", " 1.85954614, 2.23243174, 2.54817383, 3.91961801, 2.26439336,\n", " 2.67214619, 2.79515001, 3.11864037, 2.68183153])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tips_y_pred" ] }, { "cell_type": "markdown", "id": "191f9628", "metadata": {}, "source": [ "To visualize in more detail, we'll plot the data as black points and the\n", "predictions as blue points. To highlight that this is a perfectly linear\n", "prediction, we'll also add a line for the prediction." ] }, { "cell_type": "code", "execution_count": 13, "id": "199372f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "filenames": { "image/png": "/home/runner/work/BrownFall21/BrownFall21/_build/jupyter_execute/notes/2021-10-25_24_1.png" }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.scatter(tips_X_test,tips_y_test, color='black')\n", "plt.plot(tips_X_test,tips_y_pred, color='blue')\n", "plt.scatter(tips_X_test,tips_y_pred, color='blue')" ] }, { "cell_type": "markdown", "id": "bf38a493", "metadata": {}, "source": [ "## Evaluating Regression - Mean Squared Error\n", "\n", "From the plot, we can see that there is some error for each point, so accuracy\n", "that we've been using, won't work. One idea is to look at how much error there\n", "is in each prediction, we can look at that visually first." ] }, { "cell_type": "code", "execution_count": 14, "id": "7687d3a3", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "filenames": { "image/png": "/home/runner/work/BrownFall21/BrownFall21/_build/jupyter_execute/notes/2021-10-25_26_0.png" }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.scatter(tips_X_test, tips_y_test, color='black')\n", "plt.plot(tips_X_test, tips_y_pred, color='blue', linewidth=3)\n", "\n", "# draw vertical lines frome each data point to its predict value\n", "[plt.plot([x,x],[yp,yt], color='red', linewidth=3)\n", " for x, yp, yt in zip(tips_X_test, tips_y_pred,tips_y_test)];" ] }, { "cell_type": "markdown", "id": "706f85d4", "metadata": {}, "source": [ "We can use the average length of these red lines to capture the error. To get\n", "the length, we can take the difference between the prediction and the data for\n", "each point. Some would be positive and others negative, so we will square each\n", "one then take the average." ] }, { "cell_type": "code", "execution_count": 15, "id": "39dcc42e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.821309064276629" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(tips_y_test, tips_y_pred)" ] }, { "cell_type": "markdown", "id": "c4a1c597", "metadata": {}, "source": [ "We can get back to the units being dollars, by taking the square root." ] }, { "cell_type": "code", "execution_count": 16, "id": "f5bdb7d7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9062610353957787" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sqrt(mean_squared_error(tips_y_test, tips_y_pred))" ] }, { "cell_type": "markdown", "id": "18fd6034", "metadata": {}, "source": [ "This is equivalent to using absolute value instead" ] }, { "cell_type": "code", "execution_count": 17, "id": "3c001a04", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6564074900962107" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.mean(np.abs(tips_y_test - tips_y_pred))" ] }, { "cell_type": "markdown", "id": "94c8bfa0", "metadata": {}, "source": [ "## Evaluating Regression - R2\n", "\n", "We can also use the $R^2$ regression coefficient." ] }, { "cell_type": "code", "execution_count": 18, "id": "54cd4edb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5906895098589039" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(tips_y_test,tips_y_pred)" ] }, { "cell_type": "markdown", "id": "2629dae8", "metadata": {}, "source": [ "This is a bit harder to interpret, but we can use some additional plots to\n", "visualize.\n", "This code simulates data by randomly picking 20 points, spreading them out\n", "and makes the “predicted” y values by picking a slope of 3. Then I simulated various levels of noise, by sampling noise and multiplying the same noise vector by different scales and adding all of those to a data frame with the column name the r score for if that column of target values was the truth.\n", "\n", "Then I added some columns of y values that were with different slopes and different functions of x. These all have the small amount of noise.\n", "\n", "````{margin}\n", "```{tip}\n", "[Facet Grids](https://seaborn.pydata.org/generated/seaborn.FacetGrid.html) allow more customization than the figure level plotting functions\n", "we have used otherwise, but each of those combines a FacetGrid with a\n", "particular type of plot.\n", "```\n", "````" ] }, { "cell_type": "code", "execution_count": 19, "id": "2584b655", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "filenames": { "image/png": "/home/runner/work/BrownFall21/BrownFall21/_build/jupyter_execute/notes/2021-10-25_36_1.png" }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x = 10*np.random.random(20)\n", "y_pred = 3*x\n", "ex_df = pd.DataFrame(data = x,columns = ['x'])\n", "ex_df['y_pred'] = y_pred\n", "n_levels = range(1,18,2)\n", "# sample 0 mean noise\n", "noise = (np.random.random(20)-.5)*2\n", "# add varying noise levels\n", "for n in n_levels:\n", " # add noise, scaled\n", " y_true = y_pred + n* noise\n", " # compute the r2 in the column name, assign the \"true\" (data) here\n", " ex_df['r2 = '+ str(np.round(r2_score(y_pred,y_true),3))] = y_true\n", "\n", "# add functions\n", "f_x_list = [2*x,3.5*x,.5*x**2, .03*x**3, 10*np.sin(x)+x*3,3*np.log(x**2)]\n", "for fx in f_x_list:\n", " y_true = fx + noise\n", " # compute the r2 in the column name, assign the \"true\" (data) here\n", " ex_df['r2 = '+ str(np.round(r2_score(y_pred,y_true),3))] = y_true \n", "\n", "# melt the data frame for plotting\n", "xy_df = ex_df.melt(id_vars=['x','y_pred'],var_name='rscore',value_name='y')\n", "# create a FacetGrid so that we can add two types of plots per subplot\n", "g = sns.FacetGrid(data = xy_df,col='rscore',col_wrap=3,aspect=1.5,height=3)\n", "g.map(plt.plot, 'x','y_pred',color='k')\n", "g.map(sns.scatterplot, \"x\", \"y\",)" ] }, { "cell_type": "markdown", "id": "f18055c2", "metadata": {}, "source": [ "## Multivariate Regression\n", "\n", "We can also load data from Scikit learn.\n", "\n", "This dataset includes 10 features measured on a given date and an measure of\n", "diabetes disease progression measured one year later. The predictor we can train\n", "with this data might be someting a doctor uses to calculate a patient's risk." ] }, { "cell_type": "code", "execution_count": 20, "id": "1c5b14d5", "metadata": {}, "outputs": [], "source": [ "diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y = True)" ] }, { "cell_type": "code", "execution_count": 21, "id": "90fe5250", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(442, 10)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "diabetes_X.shape" ] }, { "cell_type": "code", "execution_count": 22, "id": "7063808b", "metadata": {}, "outputs": [], "source": [ "diabetes_X_train, diabetes_X_test, diabetes_y_train, diabetes_y_test = train_test_split(\n", " diabetes_X, diabetes_y)\n", "regr_diabetes = linear_model.LinearRegression()" ] }, { "cell_type": "code", "execution_count": 23, "id": "b2df03e7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regr_diabetes.fit(diabetes_X_train,diabetes_y_train)" ] }, { "cell_type": "markdown", "id": "b4558c3a", "metadata": {}, "source": [ "## What score does linear regression use?" ] }, { "cell_type": "code", "execution_count": 24, "id": "80f9abb3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.43874612898797793" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "regr_diabetes.score(diabetes_X_test,diabetes_y_test)" ] }, { "cell_type": "code", "execution_count": 25, "id": "ddabb670", "metadata": {}, "outputs": [], "source": [ "diabetes_y_pred = regr_diabetes.predict(diabetes_X_test)" ] }, { "cell_type": "code", "execution_count": 26, "id": "128ec817", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.43874612898797793" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "r2_score(diabetes_y_test,diabetes_y_pred)" ] }, { "cell_type": "code", "execution_count": 27, "id": "ef6d20f0", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3166.4747190611843" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mean_squared_error(diabetes_y_test,diabetes_y_pred)" ] }, { "cell_type": "markdown", "id": "2188a078", "metadata": {}, "source": [ "It uses the R2 score. \n", "\n", "This model predicts what lab measure a patient will have one year in the future\n", "based on lab measures in a given day. Since we see that this is not a very high\n", "r2, we can say that this is not a perfect predictor, but a Doctor, who better\n", "understands the score would have to help interpret the core.\n", "\n", "## Questions After class\n", "\n", "### How I should use these with data most effectively? What is the proper use of these methods?\n", "```{toggle}\n", "To answer continuous prediction tasks, like the ones we saw today. The notes\n", "above include more interpretation than we discussed in class, so read carefully\n", "for that.\n", "```\n", "\n", "### Why is that even when random state is set to 0 numbers are still a little different compared to yours and my neighbor even\n", "```{toggle}\n", "[random state](https://scikit-learn.org/stable/glossary.html#term-random_state)\n", "sets the seed that's used internally and should work to\n", "[control the randomness](https://scikit-learn.org/stable/common_pitfalls.html#randomness)\n", "and produce reproducible results.\n", "If your results are just a little different, like that it could be a rounding\n", "error, maybe you somehow set a default for display that's different.\n", "\n", "See for example [these options](https://stackoverflow.com/questions/25200609/apply-round-off-setting-to-whole-notebook)\n", "```" ] } ], "metadata": { "jupytext": { "text_representation": { "extension": ".md", "format_name": "myst", "format_version": 0.13, "jupytext_version": "1.10.3" } }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "source_map": [ 12, 15, 24, 28, 32, 34, 47, 58, 61, 63, 67, 69, 75, 77, 80, 82, 85, 87, 103, 106, 110, 112, 115, 117, 122, 126, 133, 140, 146, 148, 151, 153, 156, 158, 164, 166, 184, 212, 221, 225, 229, 235, 237, 241, 245, 250, 254, 256 ] }, "nbformat": 4, "nbformat_minor": 5 }