Overview

This chapter covers regression problems and analysis, introducing us to linear regression, as well as multiple linear regression and gradient descent. By the end of this chapter, you will be able to distinguish between regression and classification problems. You will be able to implement gradient descent in linear regression problems, and also apply it to other model architectures. You will also be able to use linear regression to construct a linear model for data in an *x-y* plane, evaluate the performance of linear models, and use the evaluation to choose the best model. In addition, you will be able to execute feature engineering to create dummy variables for constructing complicated linear models.

# Introduction

In *Chapter 1*, *Fundamentals*, and *Chapter 2*, *Exploratory Data Analysis and Visualization*, we introduced the concept of supervised machine learning in Python and the essential techniques required for loading, cleaning, exploring, and visualizing raw data sources. We discussed the importance of fully understanding the data before moving on to further analysis, as well as how the initial data preparation process can sometimes account for the majority of the time spent on the project as a whole. In particular, we considered correlations among all the variables, finding and addressing missing values, and understanding the shape of data via histograms, bar plots, and density plots. In this chapter, we will delve into the model building process and will construct our first supervised machine learning solution using linear regression.

# Regression and Classification Problems

We discussed two distinct methods, supervised learning and unsupervised learning, in *Chapter 1*, *Fundamentals*. Supervised learning problems aim to map input information to a known output value or label, but there are two further subcategories to consider. Supervised learning problems can be further divided into regression or classification problems. Regression problems, which are the subject of this chapter, aim to predict or model continuous values, for example, predicting the temperature tomorrow in degrees Celsius, from historical data, or forecasting future sales of a product on the basis of its sales history. In contrast, classification problems, rather than returning a continuous value, predict membership of one or more of a specified number of classes or categories. The example supervised learning problem in *Chapter 1*, *Fundamentals*, where we wanted to determine or predict whether a hairstyle was from the 1960s or 1980s, is a good example of a supervised classification problem. There, we attempted to predict whether a hairstyle was from one of two distinct groups or classes, class 1 being the 1960s and class 2 being the 1980s. Other classification problems include predicting whether a passenger of the Titanic survived, or the classic MNIST problem (http://yann.lecun.com/exdb/mnist/). (MNIST is a database of 70,000 labeled images of handwritten digits 0 through 9. The task in classifying examples from MNIST is to take one of the 70,000 input images and predict or classify which digit, 0-9, is written in the image. The model must predict the membership of the image in one of 10 different classes.)

## The Machine Learning Workflow

Before we begin with regression problems, we will first look at the six major stages involved in creating any machine learning model, supervised regression or otherwise. These stages are as follows:

- Business understanding
- Data understanding
- Data preparation
- Modeling
- Evaluation
- Deployment

This workflow is described by a well-known open industry standard called CRISP-DM (cross-industry standard process for data mining) and can be viewed as follows:

It is advised that you ensure you are completely confident in your understanding of this pipeline and of what is described in this section, as each of these stages is critical in achieving good model performance as well as meeting the needs of the business. Here, we review the key aspects of each stage.

### Business Understanding

The first stage of any data analysis and modeling project is not to jump into the data or building, but rather to understand why we are analyzing the data and what the impact of our models and conclusions on the business will be. As an individual working with the data, you may not have all the domain knowledge needed for this stage; the solution is to spend time engaging with the stakeholders in the business who know the pain points and business goals. It's very important not to underestimate this stage. Also note from the flowchart that there is feedback between the business understanding and data understanding stages, as well as from the evaluation stage to the business understanding stage. In other words, these are ongoing stages and you should endeavor to continuously discover as much as you can about the business aspects of the problem on which you are working. In the initial work in this phase, you should also formulate a preliminary overall plan for the project.

### Data Understanding

In most real projects, there are multiple potential data sources that may vary over time. This stage is intended to acquire data and understand it enough to choose data for the solution of the problem. This could result in determining a need for more data. The basic steps are to determine what data is available initially and make a data inventory. Then, review the data, which may include reading it into Python and doing an assessment of the data quality; the common issues of missing values, anomalous values, and so on can be uncovered here and discussed with the business team to determine the best actions. Although methods to impute (fill in by means of a calculation) missing values are widely described in popular literature, you should not jump immediately to applying tools to "fix" problems in the data—the goal here is to understand them and review the appropriate actions with the business stakeholders. Note that it may be more appropriate to discard data instances with missing values than to impute them or undertake a process to find the missing values.

In addition to the data inventory, a key output of this stage is a report describing the data, what has been found, and expected actions. To get to that output, some EDA (Exploratory Data Analysis) is needed, as was described in *Chapter 2*, *Exploratory Data Analysis and Visualization*.

### Data Preparation

In data preparation, we use the data determined to be appropriate from the prior stage and apply any cleaning and transformations that are needed for it to be used in modeling. This was the focus of a significant component of *Chapter 1*, *Fundamentals*, and thus will not be the subject of further analysis in this section. It is important, however, that the criticality of the data specification, collection, and cleaning/tidying process is well understood. We cannot expect to produce a high-performing system if the input data is sub-optimal. One common phrase that you should always remember with regard to data quality is *garbage in, garbage out*. If you use poor quality data, you are going to produce poor quality results. In our hairstyle example, we are looking for a sample size at least in the order of hundreds, ideally thousands that has been correctly labeled as either from the 1960s or 1980s. We do not want samples that have been incorrectly labeled or are even from either era.

Note that during data preparation, it is entirely possible to discover additional aspects of the data and that additional visualization may be required during the process to get to the dataset for modeling.

### Modeling

The modeling stage is comprised of two sub-stages: model architecture specification and model training.

- Model architecture specification: These may be iteratively related in more complex projects. In many cases, there are multiple possible model types (such as linear regression, artificial neural network, gradient boosting, and others) that may be applicable to the problem at hand. Thus, it is sometimes beneficial to investigate more than one model architecture and to do that, the models must be trained and compared in terms of their predictive capability.
- Training: The second sub-stage of modeling is training, wherein we use the existing data and known outcomes in a process to "learn" the parameters of the candidate model. Here, we must establish the design and execution of the training process; the details of that will vary, depending on the model architecture chosen and the scale of the input data. For example, for very large datasets, we may have to stream or flow the data through the training process as the data is too large for the computer memory, while for smaller data, we can simply use the data all at once.

### Evaluation

The next stage of the workflow is the evaluation of the model, which yields the final performance metric. This is the mechanism through which we know whether the model is worth publishing, is better than a previous version, or whether it has been effectively translated across programming languages or development environments. We will cover some of these metrics in more detail in *Chapter 7*, *Model Evaluation*, and, as such, this will not be discussed in detail at this stage. Just keep in mind that whatever approach is used, it needs to be capable of consistently reporting and independently measuring the performance of the model against the metric using an appropriate sample from the data.

### Deployment

In a complete data analytics workflow, most models, once developed, need to be deployed in order to be used. Deployment is critical in some applications, such as where a model might underlie a recommendation system on an e-commerce site, and the model has to be redeployed to the web application each time it is updated. Deployment can take many forms, from simply sharing a Jupyter notebook, to automated code updates to a website on a code commit, to a master repository. Although important, deployment is beyond the scope of this book and we won't address it much going forward.

Before moving on to regression modeling, let's do some final data preparation exercises. For this purpose, we have created a synthetic dataset of recorded air temperatures from the years 1841 to 2010, which is available in the accompanying code bundle of this book or on GitHub at https://packt.live/2Pu850C. This dataset is composed of values designed to demonstrate the subject matter of this chapter and should not be mistaken for data collected from a scientific study.

## Exercise 3.01: Plotting Data with a Moving Average

As we discussed in *Chapter 1*, *Fundamentals*, and in the preceding section, a thorough understanding of the dataset being used is critical if a high-performing model is to be built. So, with this in mind, let's use this exercise to load, plot, and interrogate the data source:

- Import the
**numpy**,**pandas**, and**matplotlib**packages:import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

- Use the pandas
**read_csv**function to load the CSV file containing the**synth_temp.csv**dataset, and then display the first five lines of data:df = pd.read_csv('../Datasets/synth_temp.csv')

df.head()

The output will be as follows:

- For our purposes, we don't want to use all this data, but let's look at how many points there are per year. Create a
**print**statement to output the number of points for the years 1841, 1902, and 2010, and make a simple plot of the number of points per year:# take a quick look at the number of data points per year

print('There are ' + str(len(df.loc[df['Year'] == 1841])) \

+ ' points in 1841\n' + 'and ' \

+ str(len(df.loc[df['Year'] == 2010])) \

+ ' points in 2010\n' + 'and ' \

+ str(len(df.loc[df['Year'] == 1902])) \

+ ' points in 1902')

# seeing there are different numbers of points, let's do a quick chart

fig, ax = plt.subplots()

ax.plot(df['Year'].unique(), [len(df.loc[df['Year'] == i]) \

for i in df['Year'].unique()])

plt.show()

The output will be as follows:

We see varying numbers of points per year. Also note that we don't have the information on exactly when in each year the various points were measured. If that were important, we would want to ask the appropriate business stakeholder if the information could be obtained.

- Let's slice the DataFrame to remove all rows through 1901, as we can see that there is much less data in those years:
# slice 1902 and forward

df = df.loc[df.Year > 1901]

df.head()

The output will be as follows:

- Make a quick plot to visualize the data:
# quick plot to understand what we have so far

fig, ax = plt.subplots()

ax.scatter(df.Year, df.RgnAvTemp)

plt.show()

The output will be as follows:

- We can see that there is quite a range for each year. Group the data by year and use the
**agg**method of the DataFrame to create annual averages. This works around the issue that we have multiple points at unknown dates in each year, but uses all the data:# roll up by year

df_group_year = (df.groupby('Year').agg('mean')\

.rename(columns = {'RgnAvTemp' : 'AvgTemp'}))

print(df_group_year.head())

print(df_group_year.tail())

The output will be as follows:

As before, perform a quick visualization, as follows:

# visualize result of averaging over each year

fig, ax = plt.subplots()

ax.scatter(df_group_year.index, df_group_year['AvgTemp'])

plt.show()

The data will now appear as follows:

- Given that the data is still noisy, a moving average filter can provide a useful indicator of the overall trend. A moving average filter simply computes the average over the last
*N*values and assigns this average to the*N*th sample. Compute the values for a moving average signal for the temperature measurements using a window of 10 years:window = 10

smoothed_df = \

pd.DataFrame(df_group_year.AvgTemp.rolling(window).mean())

smoothed_df.colums = 'AvgTemp'

print(smoothed_df.head(14))

print(smoothed_df.tail())

We will obtain the following output:

Notice that the first 9 samples are

**NaN**, which is because of the size of the moving average filter window. The window size is 10, hence, 9 (10-1) samples are required to generate the first average, and thus the first 9 samples are**NaN**. There are additional options to the**rolling()**method that can extend the values to the left or right, or allow the early values to be based on fewer points. In this case, we'll just filter them out:# filter out the NaN values

smoothed_df = smoothed_df[smoothed_df['AvgTemp'].notnull()]

# quick plot to understand what we have so far

fig, ax = plt.subplots()

ax.scatter(smoothed_df.index, smoothed_df['AvgTemp'])

plt.show()

The output will be as follows:

- Finally, plot the measurements by year along with the moving average signal:
fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1]);

# Raw data

raw_plot_data = df[df.Year > 1901]

ax.scatter(raw_plot_data.Year, \

raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'blue', s = 1.5)

# Annual averages

annual_plot_data = df_group_year\

.filter(items = smoothed_df.index, axis = 0)

ax.scatter(annual_plot_data.index, \

annual_plot_data.AvgTemp, \

label = 'Annual average', c = 'k')

# Moving averages

ax.plot(smoothed_df.index, smoothed_df.AvgTemp, \

c = 'r', linestyle = '--', \

label = f'{window} year moving average')

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

The output will be as follows:

- We can improve the plot by focusing on the part we are most interested in, the annual average values, by adjusting the
*y*scale. This is an important aspect of most visualizations in that the scale should be optimized to convey the most information to the reader:fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1]);

# Raw data

raw_plot_data = df[df.Year > 1901]

ax.scatter(raw_plot_data.Year, raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'blue', s = 1.5)

# Annual averages

annual_plot_data = df_group_year\

.filter(items = smoothed_df.index, axis = 0)

ax.scatter(annual_plot_data.index, annual_plot_data.AvgTemp, \

label = 'Annual average', c = 'k')

# Moving averages

ax.plot(smoothed_df.index, smoothed_df.AvgTemp, c = 'r', \

linestyle = '--', \

label = f'{window} year moving average')

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(17, 20)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

The final plot should appear as follows:

Looking at *Figure 3.11*, we can immediately make a few interesting observations. First, the temperature remained relatively consistent from the year 1902 to about 1950, after which there is an increasing trend through to the end of the data. Second, there is scatter or noise in the measurements, even after averaging within each year. Third, there appears to be a shift at 1960, which might represent a change in measurement methods or some other factor; we might want to follow up with the business team to understand this more fully.

Finally, note that the moving average values tend to be to the right of the raw data during periods in which there are trends. This is a direct result of the default parameters in the **rolling()** method; each moving average value is the average of 9 points to the left and the current point.

Note

To access the source code for this specific section, please refer to https://packt.live/316S0o6.

You can also run this example online at https://packt.live/2CmpJPZ. You must execute the entire Notebook in order to get the desired result.

## Activity 3.01: Plotting Data with a Moving Average

For this activity, we have acquired a dataset of weather information from Austin, Texas (**austin_weather.csv**), available in the accompanying source code, and will be looking at the changes in average daily temperature. We will plot a moving average filter for this dataset.

Note

The original dataset can be found here: https://www.kaggle.com/grubenm/austin-weather

The steps to be performed are as follows:

- Import
**pandas**and**matplotlib.pyplot.** - Load the dataset into a pandas DataFrame from the CSV file.
- We only need the
**Date**and**TempAvgF**columns; remove all others from the dataset. - Initially, we will only be interested in the first year's data, so we need to extract that information only.
Create a column in the DataFrame for the

**year**value and extract the year value as an integer from the strings in the**Date**column and assign these values to the**Year**column. - Repeat this process to extract the month values and store the values as integers in the
**Month**column. - Repeat this process one more time to store the day values as integers in the
**Day**column. - Copy the first year's worth of data to a DataFrame.
- Compute a 20-day moving average filter.
- Plot the raw data and moving average signal, with the
*x*axis being the day number in the year.The output should be as follows:

Note

The solution for this activity can be found via this link.

You have learned how to load data from a CSV file, how to remove columns that are not required, how to extract information from text fields containing dates as strings, how to smooth data using a moving average, and how to visualize the results.

# Linear Regression

We will start our investigation into regression models with the selection of a linear model. Linear models, while being a great first choice due to their intuitive nature, are also very powerful in their predictive power, assuming datasets contain some degree of linear or polynomial relationship between the input features and values. The intuitive nature of linear models often arises from the ability to view data as plotted on a graph and observe a trending pattern in the data with, say, the output (the *y*-axis value for the data) trending positively or negatively with the input (the *x*-axis value). The fundamental components of linear regression models are also often learned during high school mathematics classes. You may recall that the equation of a straight line is defined as follows:

Here, *x* is the input value and *y* is the corresponding output or predicted value. The parameters of the model are the slope of the line (the change in the *y* values divided by the change in *x*, also called the gradient), noted by *β*1 in the equation, as well as the *y*-intercept value, *β*1, which indicates where the line crosses the *y* axis. With such a model, we can provide values for the *β*1 and *β*0 parameters to construct a linear model.

For example, *y = 1+ 2 * x* has a slope of 2, indicating that the changes in the *y* values are at a rate of twice that of *x*; the line crosses the *y* intercept at 1, as you can see in the following diagram:

So, we have an understanding of the parameters that are required to define a straight line, but this isn't really doing anything particularly interesting. We just dictated the parameters of the model to construct a line. What we want to do is take a dataset and construct a model that best describes a dataset. In terms of the previous section, we want to choose the model architecture as a linear model, and then train the model to find the best values of *β*0 and *β*1. As mentioned before, this dataset needs to have something that approximates a linear relationship between the input features and output values for a linear model to be a good choice.

## Least Squares Method

Many of the various techniques used in machine learning actually significantly pre-date the use of machine learning as a description. Some embody elements of statistics, and others have been used in the sciences to "fit" data for a very long time. The least squares method of finding the equation of a straight line that best represents a set of data is one of these, originally created in the early 1800s. The method can be used to illustrate many key ideas of the supervised learning of regression models, and so we'll start with it here.

The least squares method focuses on minimizing the square of the error between the predicted *y* values and the actual *y* values. The idea of minimizing an error is fundamental in machine learning and is the basis for essentially all learning algorithms.

Although simple linear regression using the least squares method can be written down as simple algebraic expressions, most packages (like scikit-learn) will have more general optimization methods "under the hood."

## The Scikit-Learn Model API

The scikit-learn API uses a similar code pattern irrespective of the type of model being constructed. The general flow is:

- Import the class for the model type you want to use.
Here, we will use

**from sklearn.linear_model import LinearRegression**. - Instantiate an instance of the model class. This is where hyperparameters are set. For simple linear regression, we can use defaults.
- Use the
**fit**method with the*x*and*y*data we want to model. - Inspect the results, get metrics, and then visualize them.

Let's use this workflow to create a linear regression model in the next exercise.

## Exercise 3.02: Fitting a Linear Model Using the Least Squares Method

In this exercise, we will construct our first linear regression model using the least squares method to visualize the air temperatures over a yearly timeframe and evaluate the performance of the model using evaluation metrics:

Note

We will be using the same **synth_temp.csv** dataset as in *Exercise 3.01: Plotting Data with a Moving Average*.

- Import the
**LinearRegression**class from the**linear_model**module of scikit-learn, along with the other packages we need:import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

- Load the data. For this exercise, we'll use the same synthetic temperature data as was used previously:
# load the data

df = pd.read_csv('../Datasets/synth_temp.csv')

- Repeat the preprocessing of the data from before:
# slice 1902 and forward

df = df.loc[df.Year > 1901]

# roll up by year

df_group_year = df.groupby(['Year']).agg({'RgnAvTemp' : 'mean'})

df_group_year.head(12)

# add the Year column so we can use that in a model

df_group_year['Year'] = df_group_year.index

df_group_year = \

df_group_year.rename(columns = {'RgnAvTemp' : 'AvTemp'})

df_group_year.head()

The data should appear as follows:

- Instantiate the
**LinearRegression**class. Then, we can fit the model using our data. Initially, we will just fit the temperature data to the years. In the following code, note that the method requires the*x*data to be a 2D array and that we are passing only the year. We also need to use the**reshape**method and, in the**(-1, 1)**parameters,**-1**means that "the value is inferred from the length of the array and remaining dimensions":# construct the model and inspect results

linear_model = LinearRegression(fit_intercept = True)

linear_model.fit(df_group_year['Year'].values.reshape((-1, 1)), \

df_group_year.AvTemp)

print('model slope = ', linear_model.coef_[0])

print('model intercept = ', linear_model.intercept_)

r2 = linear_model.score(df_group_year['Year']\

.values.reshape((-1, 1)), \

df_group_year.AvTemp)

print('r squared = ', r2)

Note

Refer to the following link for more reading on scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

The output will be as follows:

Note the use of the

**score**method, which is a method of the**model**object, to obtain the r2 value. This metric, called the coefficient of determination, is a widely used metric for linear regression. The closer r2 is to 1, the more closely our model is predicting the data. There are multiple formulas that can be used to compute r2. Here is an example:From

*Figure 3.17*, you can get some understanding of r2 by noting that the numerator sums the errors from the predictions, while the denominator sums the variation of the data from the mean. Thus, r2 increases as the prediction errors get smaller. It's important to emphasize here that r2 is just a measure of "goodness of fit"—in this case, how well a simple straight line fits the given data. In more complex, real-world supervised learning problems, we would use a more robust approach to optimize the model and choose the best/final model. In particular, in general, we evaluate the model on data*not*used to train it, because evaluating it on the training data would give an overly optimistic measure of performance. This will be discussed in*Chapter 7*,*Model Evaluation.* - To visualize the results, we need to pass some data to the
**predict**method of the model. A simple way to do that is to just reuse the data we used to fit the model:# generate predictions for visualization

pred_X = df_group_year.loc[:, 'Year']

pred_Y = linear_model.predict(df_group_year['Year']\

.values.reshape((-1, 1)))

- Now, we have everything we need to visualize the result:
fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1]);

# Raw data

raw_plot_data = df[df.Year > 1901]

ax.scatter(raw_plot_data.Year, raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'red', s = 1.5)

# Annual averages

ax.scatter(df_group_year.Year, df_group_year.AvTemp, \

label = 'Annual average', c = 'k', s = 10)

# linear fit

ax.plot(pred_X, pred_Y, c = "blue", linestyle = '-.', \

linewidth = 4, label = 'linear fit')

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(15, 21)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

The output will be as follows:

From *Figure 3.18*, it's evident that a straight line isn't a very good model of the data. We'll return to this issue after an activity.

Note

To access the source code for this specific section, please refer to https://packt.live/2NwANg1.

You can also run this example online at https://packt.live/2Z1qQfT. You must execute the entire Notebook in order to get the desired result.

We have seen how to load in some data, import the **LinearRegression** class from scikit-learn, and use the **fit**, **score**, and **predict** methods to construct a model, look at a performance metric, and then visualize the results. Along the way, we introduced the least squares method, gave some of the mathematical background, and showed how some of the calculations work.

We saw that for our synthetic temperature data, a linear model doesn't fit the data all that well. That's okay. In most cases, it is good practice to generate a baseline model early on in the project to serve as a benchmark against which the performance of more sophisticated models can be compared. So we can consider the linear model we developed here to be a naïve baseline model.

Before continuing, it is important to note that when reporting the performance of machine learning models, the data used to train the model is *not* to be used to evaluate it, as it will give an optimistic view of the model's performance. We will cover the concept of validation, which includes evaluating and reporting model performance, in *Chapter 7*, *Model Evaluation*. For the purpose of this chapter, however, we will use the training data to check the model's performance; just remember that once you have completed *Chapter 7*, *Model Evaluation*, you will know better.

## Activity 3.02: Linear Regression Using the Least Squares Method

For this activity, we will use the Austin, Texas weather dataset that we used in the previous activity. We will plot a linear regression model using the least squares method for the dataset.

The steps to be performed are as follows:

- Import the requisite packages, classes, and suchlike. Refer to
*Exercise 3.02: Fitting a Linear Model Using the Least Squares Method*if necessary. - Load the data from the csv (
**austin_weather.csv**). - Inspect the data (using the
**head()**and**tail()**methods).The output for

**df.head()**will be as follows:The output for

**df.tail()**will be as follows: - Drop everything except the
**Date**and**TempAvgF**columns. - Create new
**Year**,**Month**, and**Day**columns and populate them by parsing the**Date**column. - Create a new column for a moving average and populate it with a 20-day moving average of the
**TempAvgF**column. - Slice one complete year of data to use in a model. Ensure the year doesn't have missing data due to the moving average. Also, create a column for
**Day_of_Year**(it should start at 1). - Create a scatterplot of the raw data (the original
**TempAvgF**column) and overlay it with a line for the 20-day moving average.The plot will be as follows:

- Create a linear regression model using the default parameters, that is, calculate a
*y*intercept for the model and do not normalize the data. - Now fit the model, where the input data is the day number for the year (1 to 365) and the output is the average temperature. Print the parameters of the model and the r2 value.
The results should be as follows:

model slope: [0.04304568]

model intercept: 62.23496914044859

model r squared: 0.09549593659736466

- Generate predictions from the model using the same
*x*data. - Create a new scatterplot, as before, adding an overlay of the predictions of the model.

Note

The solution for this activity can be found via this link.

Building on the previous exercise, you have seen how to load and use the **LinearRegression** class from scikit-learn and the **fit**, **score**, and **predict** methods. Not surprisingly, a simple linear model that produces a straight line isn't the best model for this data. In later exercises, we will investigate ways in which we might address that.

You have learned how to load data, structure it for the scikit-learn API, and use the **LinearRegression** class to fit a simple line to the data. It is evident that this is a poor model for this data, so we will explore ways to improve our model, beginning with the next topic, *Linear Regression with Categorical Variables*.

## Linear Regression with Categorical Variables

There is an aspect of the model architecture selection phase that somewhat overlaps the data preparation phase: feature engineering. Broadly, feature engineering involves creating additional features (columns in our case) and using them in the model to improve model performance. Features may be engineered by transforming existing features (such as taking the logarithm and square root) or may be generated in some way and added to the dataset. As an example of the latter, we can extract the month, the day of the month, the day of the week, and so on from the date information in a dataset. Although a new feature such as the month could be a numeric value, in the majority of cases in supervised learning, simply using a numeric value of such a feature is not best practice. A simple idea would be as follows: if we code January to December as 1 to 12, a model might give more weight to December since it is 12 times larger than January. Also, when the date changes from December back to January, there would be an artificial step change in the value. Thus, such a feature is considered to be nominal categorical. Nominal categorical variables are features with multiple possible values but where the ordering of the values does not contain any information, and could even be misleading. There are also categorical variables that do have an implied order, which are called ordinal categorical variables. Examples include "tiny," "small," "medium," "large," "extra-large," and "huge."

To handle either type of categorical data in most machine learning models, we still have to convert it to numbers. The general approach to such a conversion is called encoding. A very powerful but easy to understand encoding method is to convert a categorical feature using **one-hot encoding**.

When using one-hot encoding, each possible value of the categorical feature becomes a column. In the column corresponding to a given value, a **1** is entered if that instance of data had the feature at that value, otherwise, a **0** is entered. An example will make this much clearer, as seen in the following figure:

So, by creating these columns and inserting the ones in the appropriate locations, we let the model "know" about the presence of the nominal categorical variable, but don't give extra weight to any given value. In the example in *Figure 3.23*, if we were trying to model dog life expectancy, before using one-hot encoding, we only had diet and weight as predictors. After applying one-hot encoding, we would expect to get a better model, since our intuition would be that some breeds live longer than others, all other factors being equal. In the following exercise, we'll see how to use encoding to leverage the power of linear models to model complex behavior.

Note

There are a number of other possible ways to encode categorical variables; see, for example, *A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers*: https://pdfs.semanticscholar.org/0c43/fb9cfea23e15166c58e24106ce3605b20229.pdf

In some cases, the best method may depend on the type of model being used. For example, linear regression has a requirement that none of the features are linearly dependent on any others (we will discuss this further later in this chapter). One-hot encoding actually introduces this problem, because the nth category can actually be determined from the other *n-1* categories—intuitively, in *Figure 3.23*, if beagle, boxer, chihuahua, collie, and german shepherd are all 0, then miniature dachshund will be 1 (note that we are assuming that an instance may not have more than one valid category). Thus, in linear regression, we use a slightly different encoding called dummy variables. The only difference between dummy variables and one-hot encoding is that we drop one of the **n** columns to eliminate the dependence.

## Exercise 3.03: Introducing Dummy Variables

In this exercise, we will introduce dummy variables into our linear regression model to improve its performance.

We will be using the same **synth_temp** dataset as was used in the previous exercise:

- Import the required packages and classes:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

- Load the data:
# load data

df = pd.read_csv('../Datasets/synth_temp.csv')

- Slice the DataFrame from 1902 onward, and then compute yearly averages:
# slice 1902 and forward

print(df.head())

df = df.loc[df.Year > 1901]

print(df.head())

The output will be as follows:

# roll up by year

df_group_year = df.groupby(['Year', 'Region'])\

.agg({'RgnAvTemp':'mean'})

"""

note that the .droplevel() method removes the multiindex

added by the .agg() method() to make things simpler

later on in our analysis

"""

print(df_group_year.head(12))

print(df_group_year.tail(12))

The data should appear as follows:

- Add a
**Year**column using the index (which is in calendar years) level**0**, and the**Region**column using the index level**1**:# add the region column so we can use that for dummy variables

df_group_year['Region'] = df_group_year.index.get_level_values(1)

# add the Year column so we can use that in a model

df_group_year['Year'] = df_group_year.index.get_level_values(0)

# reset the index on the long axis

df_group_year = df_group_year.droplevel(0, axis = 0)

df_group_year = df_group_year.reset_index(drop = True)

- Perhaps the temperature levels or variation differs by region. Let's look at the overall average temperatures for each region:
# inspect data by region

region_temps = df_group_year.groupby('Region').agg({'RgnAvTemp':'mean'})

colors = ['red', 'green', 'blue', 'black', 'lightcoral', \

'palegreen','skyblue', 'lightslategray', 'magenta', \

'chartreuse', 'lightblue', 'olive']

fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1])

ax.bar(region_temps.index, region_temps.RgnAvTemp, \

color = colors, alpha = 0.5)

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

ax.set_xlabel('Region', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.tick_params(labelsize = 12)

plt.show()

The result should appear as follows:

We see that, on average, the regions vary from one another by as many as 5 degrees. Thus, it might benefit the model to take the region into account. To do that, we will create dummy variables from the

**Region**column. - Pandas has a DataFrame method called
**get_dummies()**that we can use for our needs. First, we create a new DataFrame with the new columns. Note that they are already populated with zeros and ones. We then concatenate the dummy variable columns to our data and drop the**Region**column as it is now redundant:# convert the categorical variable 'region' to dummy vars

dummy_cols = pd.get_dummies(df_group_year.Region, \

drop_first = True)

df_group_year = pd.concat([df_group_year, dummy_cols], axis = 1)

print(df_group_year.head())

print(df_group_year.tail())

The result should be as follows:

Note that in the

**get_dummies**method, we set the**drop_first = True**parameter to remove one of the columns, as discussed earlier. - We now create a linear model, as before, using the
**Year**column and all the dummy columns:linear_model = LinearRegression(fit_intercept = True)

linear_model.fit(df_group_year.loc[:, 'Year':'L'], \

df_group_year.RgnAvTemp)

r2 = linear_model.score(df_group_year.loc[:, 'Year':'L'], \

df_group_year.RgnAvTemp)

print('r squared ', r2)

The output will be as follows:

r squared 0.7778768442731825

- The r2 value is much higher than before, which looks promising. Generate predictions from the DataFrame with the dummy variables, and then visualize everything on a plot:
# construct data to predict from model

pred_X = df_group_year.drop(['RgnAvTemp', 'Region'], axis = 1)

pred_Y = linear_model.predict(pred_X.values)

preds = pd.concat([df_group_year.RgnAvTemp, \

df_group_year.Region, \

pred_X, pd.Series(pred_Y)], axis = 1)

preds.rename(columns = {0 : 'pred_temp'}, inplace = True)

print(preds.head())

The data should appear as follows:

- For plotting, we'll reduce the clutter by sampling from the predictions:
# define a sample of the raw data and predictions

# set a seed so results are repeatable

np.random.seed(42)

plot_data = preds.sample(n = 100)

fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1])

# Raw data

raw_plot_data = plot_data

ax.scatter(raw_plot_data.Year, raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'red', s = 1.5)

# Annual averages

annual_plot_data = df_group_year.groupby('Year').agg('mean')

ax.scatter(annual_plot_data.index, annual_plot_data.RgnAvTemp, \

label = 'Annual average', c = 'k', s = 10)

- Let's also visualize the linear fit results:
fit_data = plot_data

for i in range(len(plot_data.Region.unique())):

region = plot_data.Region.unique()[i]

plot_region = fit_data.loc[fit_data.Region == region, :]

ax.scatter(plot_region.Year, plot_region.pred_temp, \

edgecolor = colors[i], facecolor = "none", \

s = 80, label = region)

# draw faint lines connecting the raw to the predicted

for i in fit_data.index:

ax.plot([fit_data.Year[i], fit_data.Year[i]], \

[fit_data.pred_temp[i], fit_data.RgnAvTemp[i]], \

'-', linewidth = 0.1, c = "red")

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(15, 21)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

Note

For error-free execution, you should run the cell only after writing the code for both

*steps 9*and*10*.The plot should appear as follows:

What we can see is that the model is predicting different levels for different regions, which, while still not wholly following the trend, accounts for much more of the variation than before.

- Let's now finish by plotting just one region to get a feel for how well the model works:
# let's plot just one region

region_B = preds.loc[preds.B == 1, :]

np.random.seed(42)

plot_data = region_B.sample(n = 50)

fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1])

# Raw data

ax.scatter(plot_data.Year, plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'red', s = 1.5)

ax.scatter(plot_data.Year, plot_data.pred_temp, \

label = "Predictions", facecolor = "none", \

edgecolor = "blue", s = 80)

- Draw faint lines connecting the raw to the predicted values:
for i in plot_data.index:

ax.plot([plot_data.Year[i], plot_data.Year[i]], \

[plot_data.pred_temp[i], plot_data.RgnAvTemp[i]], \

'-', linewidth = 0.1, c = "red")

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(16, 21)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

Note

For error-free execution, you should run the cell only after writing the code for both

*steps 11*and*12*.The result should be as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/2YogxDt.

You can also run this example online at https://packt.live/311LDCx. You must execute the entire Notebook in order to get the desired result.

We now have an improved model that follows much of the variation in the data. However, we can see that we have still not captured the change in trend that is apparent around 1960. To address that, we'll explore using linear regression to fit a model by using the powers of the *x* data, known as a polynomial model. First, you will practice using dummy variables using the Austin temperature dataset.

## Activity 3.03: Dummy Variables

For this activity, we will use the Austin, Texas, weather dataset that we used in the previous activity. In this activity, we will use dummy variables to enhance our linear regression model for this dataset.

The steps to be performed are as follows:

- Load the
**LinearRegression**class from scikit-learn, along with the**fit**,**score**, and**predict**methods, as well as**pandas**and**matplotlib. pyplot**. - Load the
**austin_weather.csv**dataset, drop all but the**Date**and**TempAvgF**columns, and create**Year**,**Month**, and**Day**columns from the data. - Create a 20-day moving average column and populate it, and then slice the first complete year of data (days 1 through 365—this will be the year 2015). After slicing, reset the index of the DataFrame (
**Pandas**core method,**reset_index**). Now, create a**Day_of_Year**column and populate it (keep in mind that the first day should be**1**, not**0**). - Plot the raw data and moving average against
**Day_of_Year**.The plot should appear as follows:

Now, investigate whether adding the month to the model could improve the model. To do this, create a

**dummy_vars**DataFrame using the pandas**get_dummies**method on the**Month**column of the DataFrame, and then rename the dummy columns**Jan**through**Dec**. Now, concatenate**dummy_vars**to the DataFrame in a new DataFrame called**df_one_year**.Display the DataFrame and confirm that the dummy columns are present.

- Use a least squares linear regression model and fit the model to the
**Day_of_Year**values and the dummy variables to predict**TempAvgF**. - Get the model parameters and the r2 value. The r2 value should be much larger than in the previous activity.
- Using the
**Day_of_Year**values and the dummy variables, predict the temperature for the**df_one_year**data. - Plot the raw data, the 20-day moving average, and the new prediction.
The output will be as follows:

Note

The solution for this activity can be found via this link.

You have learned how to use the scikit-learn API with the **LinearRegression** class to fit data augmented with dummy variables. The **get_dummies** pandas method was used to generate additional variable columns encoding the months to improve the model. A useful property of the new model is that it accounts for the seasonal variation in temperature, as well as any overall trend. However, it is rather piecewise, which may or may not meet the business needs for prediction.

At this stage, you should be comfortable with basic linear regression as well as using the scikit-learn interface and the **get_dummies** pandas method. We have also touched on a few visualization points and introduced the idea of feature engineering in the context of using dummy variables. We will now move on to polynomial regression, which takes feature engineering in a different direction while still leveraging the power of linear regression.

## Polynomial Models with Linear Regression

Linear regression models are not constrained to straight-line linear models. We can fit some more complicated models using the exact same techniques. In the synthetic temperature data, we can see an upward curve in the trend. Therefore, in addition to any overall (linear in time) trend, there may be a trend related to a positive power of time. If we build a model using integer powers of the independent variable, this is called polynomial regression. For powers up to 2, the equation would be as follows. Note that we refer to the order of the polynomial as the highest power, so this is a polynomial of order 2:

The addition of this squared term transforms the trendline from a straight line to one that has curvature. In general, polynomial models can be very powerful to fit given data, but they may not extrapolate very well outside the range of the data. This would be an example of overfitting and is especially true as the order of the polynomial increases.

Therefore, in general, you should make limited use of polynomial regression and keep the order low, unless there is a clear business case or a known underlying model that indicates doing otherwise:

Note that here, we are looking at a simple model where there is only one feature, the variable *x*. In more complex cases, we might have several features. The number of terms in the equation will increase quickly as the number of features increases. To construct a polynomial, regression packages such as scikit-learn offer methods to automatically generate polynomial features (for example, **sklearn.preprocessing.PolynomialFeatures**). Here, we will build a simple polynomial model manually to illustrate the approach.

## Exercise 3.04: Polynomial Models with Linear Regression

In order to fit a polynomial model using linear regression, we need to create the features raised to the desired powers. Recall the earlier discussion on feature engineering; we are going to engineer new features, which are the original independent variables raised to a power:

- Beginning with the
**synth_temp.csv**data, load the packages and classes, and then preprocess the data as before:import pandas as pd

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

# load the data

df = pd.read_csv('../Datasets/synth_temp.csv')

# slice 1902 and forward

df = df.loc[df.Year > 1901]

# roll up by year

df_group_year = df.groupby(['Year']).agg({'RgnAvTemp' : 'mean'})

- Now, we add the
**Year**column using the index, and then calculate a**Year2**column by raising the**Year**column to the power of 2:# add the Year column so we can use that in a model

df_group_year['Year'] = df_group_year.index

df_group_year = df_group_year\

.rename(columns = {'RgnAvTemp' : 'AvTemp'})

# add a Year**2 column to build a polynomial model of degree 2

df_group_year['Year2'] = df_group_year['Year']**2

print(df_group_year.head())

print(df_group_year.tail())

The result is as follows:

- Fit the data to the model. This time, we will need to provide two sets of values as the inputs to the model,
**Year**and**Year2**, which is equivalent to passing*x*and*x*2 to the polynomial equation. As we are providing two columns of data, we do not need to reshape the input data as it will be provided as an*N x 2*array by default. The target*y*value remains the same:# construct the model and inspect results

linear_model = LinearRegression(fit_intercept = True)

linear_model.fit(df_group_year.loc[:, ['Year', 'Year2']], \

df_group_year.AvTemp)

print('model coefficients = ', linear_model.coef_)

print('model intercept = ', linear_model.intercept_)

r2 = linear_model.score(df_group_year.loc[:, ['Year', 'Year2']], \

df_group_year.AvTemp)

print('r squared = ', r2)

The output will be as follows:

model coefficients = [-1.02981369e+00 2.69257683e-04]

model intercept = 1002.0087338444181

r squared = 0.9313996496373635

- The model has improved on the dummy variable method, but let's visualize the results to see whether it is a more reasonable fit. First, generate predictions. Here, we take an additional step to extend the predictions out to the next 10 years to see whether those predictions appear reasonable. In most supervised learning problems, the end goal is to predict values for previously unknown data. As our model simply uses the
**Year**and**Year2**variables, we can generate a list of year values and then square them as before, for the next 10 years:# generate predictions for visualization

pred_X = df_group_year.loc[:, ['Year', 'Year2']]

pred_Y = linear_model.predict(pred_X)

# generate predictions for the next 10 years

pred_X_future = pd.DataFrame(list(range(2011, 2021)))\

.rename(columns = {0 : 'Year'})

pred_X_future['Year2'] = pred_X_future['Year']**2

pred_Y_future = linear_model.predict(pred_X_future)

- Now, create a visualization:
fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1]);

# Raw data

raw_plot_data = df

ax.scatter(raw_plot_data.Year, raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'red', s = 1.5)

# Annual averages

ax.scatter(df_group_year.Year, df_group_year.AvTemp, \

label = 'Annual average', c = 'k', s = 10)

# linear fit

ax.plot(pred_X.Year, pred_Y, c = "blue", linestyle = '-.', \

linewidth = 4, label = 'linear fit')

- Visualize the future predictions:
ax.plot(pred_X_future.Year, pred_Y_future, c = "purple", \

linestyle = '--', linewidth = 4, \

label = 'future predictions')

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2021, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(15, 21)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

The result is as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/3fSusaR.

You can also run this example online at https://packt.live/2BulmCd. You must execute the entire Notebook in order to get the desired result.

Referring to *Figure 3.36,* we can see the performance benefit in using the polynomial model, with the trendline almost following the 10-year moving average. This is a reasonably good fit given the amount of noise in the yearly average raw data. In such a case, it should not be expected that the model will fit the data perfectly. If our model was to perfectly fit the observed examples, there would be a very strong case for overfitting the data, leading to poor predictive power with unseen examples. As an example, imagine we fitted a 10th order polynomial to this data.The resulting model would wiggle up and down and might be moving up or down very steeply at the last data point, which would lead to poor predictions. In this case, using order 2, we see that the future trend seems reasonable, although that remains to be demonstrated.

## Activity 3.04: Feature Engineering with Linear Regression

We have tried a standard linear model as well as a model including dummy variables. In this activity, we will create some periodic features to try and get a better fit for the data. Periodic features are derived from functions that repeat over some range of the independent variable. In *Figure 3.37*, we can see that the data at the beginning of the year is near the same values and, in between, the temperature increases and then decreases. This is intuitively reasonable because we know that in temperate climates, there is an annual temperature cycle. Thus, we might improve the model if we include features that are periodic on a time scale of 1 year. We can construct sine and cosine functions that have the desired behavior.

When fitting a model with engineered periodic features, we face an additional challenge to determine how to line up the periodic cycle of the features to the actual data. You can think of this as a time offset in this case, which we don't know a priori. You could also think of the offset as a hyperparameter—fitting the model does not give us the value so we have to find the best value in some other way. In the following diagram, the needed offset is Δt:

Fortunately, if we use sine and cosine functions for our features, there is a way to address this. It is a mathematical fact that any given sine function at a single period, such as the raw data in *Figure 3.37*, can be expressed as a linear combination of a sine and cosine function of the same period. In the next figure, we show a sine and cosine function, each with a period of 365 days, and the raw data from *Figure 3.37*. We also show a linear combination of the sine and cosine function that matches the raw data very well. Thus, to fit a sine (or cosine) function to our data, we simply need to engineer two features, one as the sine of the time, and the other as the cosine of the time. The linear regression will then find the best coefficients just like any other feature. Note that this implies we know the period:

The last thing we need to know is how to formulate the sine and cosine functions. For this, we can use the NumPy methods, **sin** and **cos**. We know we want a period of 1 year, and our data is in days. The correct way to write a sine function with a 365-day period is as follows:

Alternatively, in Python a sine function with a 365-day period is as follows:

Now, let's proceed with the activity to use a periodic function to fit the Austin temperature data. The steps to be performed are as follows:

- Load the packages and classes (
**numpy**,**pandas**,**LinearRegression**, and**matplotlib**). - Perform the preprocessing as before, through to the step where the
**Day_of_Year**column is created. - Add a column for the sine of
**Day_of_Year**and another for the cosine of**Day_of_Year**. - Perform a linear regression of the average temperature versus the
**Day_of_Year**and the sine and cosine features. - Print the parameters of the model and the r2 score.
- Generate predictions using the new features.
- Visualize the raw data and the new model.
- The output will be as follows:

You have by now learned that we can engineer features by applying functions to existing features, such as polynomial functions or periodic functions. You have seen how to construct period functions using sine and cosine functions to fit an arbitrary sine or cosine function. In this case, we assumed a period of 365 days, which is reasonable based on the annual weather cycle of temperate regions on Earth. In an actual business case, we may not be certain of the period, or there might be more than one cycle occurring at the same time (such as weekly, monthly, and quarterly cycles in sales). In addition, using functions such as polynomials and sines/cosines can easily lead to overfitting, resulting in very poor extrapolation.

Note

The solution for this activity can be found via this link.

In *Figure 3.41*, we see that the new model smoothly varies during the year, returning at the end of the year to a value near the beginning. Since we already knew that there is an overall trend during the year, this model, because it includes the **Day_of_Year** feature, shows that the temperature at the end of the year is somewhat higher than at the beginning. In terms of feature engineering, we could consider using the date (converting it to, say, an integer) and fitting the model over more than 1 year's worth of data to try and capture longer-term trends.

## Generic Model Training

The least squares method of constructing a linear regression model is a useful and accurate method of training, assuming that the dimensionality of the dataset is low and that the system memory is sufficiently large to be able to manage the dataset.

In recent times, large datasets have become more readily available, with universities, governments, and even some companies releasing large datasets for free online; as such, it may be relatively easy to exceed system memory when using the least squares method of regression modeling. In this situation, we will need to employ a different method of training the algorithm, such as gradient descent, which is not as susceptible to high dimensionality, allows large datasets to be trained, and avoids the use of memory-intensive matrix operations.

Before we look at gradient descent in a little more detail, we will revisit the process of training a model in a more general form, as most training methods, including gradient descent, adhere to this generic process. The following is an overview of the parameter update loop of the model training process:

The training process involves the repeated exposure of the model and its parameters (including hyperparameters) to a set of sample training data and passing the predicted values issued by the model to a specified cost or error function. The cost function is, with some of the hyperparameters, what determines how to calculate the updates in the "update parameters" block in *Figure 3.42*.

The cost function is used to determine how close the model is to its target values and a measure of progress throughout the training process. However, the cost function is also used to determine the parameter updates, in combination with some of the hyperparameters. For example, in our linear regression case, the cost function is the mean squared error.

The least squares method, which we showed as a method to build a linear regression model, minimizes the MSE, hence, least squares. We can therefore update our diagram of the training process to the following:

## Gradient Descent

The process of gradient descent can be summarized as a means of updating the parameters of the model proportionally and in response to an error within the system, as defined by the cost function. There are a number of cost functions that can be selected, depending on the type of model being fitted or the problem being solved. We will select the simple, but effective, mean squared error cost function.

Recall that the equation of a straight line can be written as follows:

The following figure is a plot of the cost function, *J*, for ranges of values, *β*0 and *β*1. The optimal set of parameters are those for which the cost function is a minimum, and this point is called the global minima of the cost function. We can then make an analogy with trying to find the lowest point in a valley while hiking. Intuitively, wherever we are standing, if we are not at the bottom, then we are on a slope, and to get to the bottom, we would head downhill. Thus, the slope is the gradient, and finding the minimum is gradient descent:

As you can see in the preceding figure, on each training cycle, the parameters *β*0 and *β*1 are updated to move in the direction of the steepest slope (the gradient), and eventually, the minimum of *J(β)* is found.

Let's look at the gradient descent algorithm in greater detail:

- Gradient descent starts by taking an initial, random guess at the values for all
*β*. Note that in some models, this step, called initialization, may be constrained by choosing a particular distribution from which the initial values are sampled. The choice of initialization may be considered a hyperparameter and can affect the final outcome, especially in complicated models such as artificial neural networks. Most methods implemented in Python have good defaults, and it is common to use the defaults. - A prediction for each of the samples in the training set is made using the random values for
*β*, and the cost function*J(β)*is then computed. - The values for
*β*are then updated, making a small adjustment proportional to the error, in an attempt to minimize the error. In general, it is not the best approach to try to move all the way from the current*β*values to the values that would minimize*J(β)*because the loss surface may not be smooth as shown in*Figure 3.45*. Most real loss surfaces, even in three dimensions, have multiple peaks and valleys. For more complicated cost function surfaces, there are multiple minima, called local minima, and the lowest of all the minima is the global minima. Non-convex cost functions may present challenges in finding the global minima, and a lot of effort in the research community has been devoted to finding ways to efficiently find the global minima of non-convex surfaces. The hyperparameter**learning rate**, denoted by*γ*, is used to adjust the step size on each pass of the training.

This process is visualized in the following graph, simplified to two dimensions:

In this simplified case, using a low learning rate leads to Path A, and we get stuck in a local minimum. Path C results from a very high learning rate and does not converge. Path B uses an intermediate learning rate value and converges to the global minimum.

As a general hint for setting the learning rate, start larger, say around 0.1, and if a solution cannot be found, that is, the error is an **NaN** or is varying wildly, reduce the learning rate by a factor of 10. Once a learning rate is found that allows a relatively smooth decrease of the cost function versus epochs, other hyperparameters, including the learning rate, can be tested to achieve the best results.

While this process may sound complicated, it isn't anywhere near as scary as it looks. Gradient descent can be summarized by making a one-time-only guess at the values for the parameters, calculating the error in the guess, making small adjustments to the parameters, and continually repeating the process until the error converges at a minimum value. To reinforce our understanding, let's look at a more concrete example. We will use gradient descent to train the original linear regression model we constructed in *Exercise 3.02: Fitting a Linear Model Using the Least Squares Method*, replacing the least squares method with gradient descent.

## Exercise 3.05: Linear Regression with Gradient Descent

In this exercise, we will implement a gradient descent algorithm manually. We will start with the **synth_temp.csv** data, as before:

- Import the packages and classes as before:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.metrics import r2_score

Before we can start the gradient descent process, we need to implement some key functions.

- Write a function to define our linear model. This is where the advantage of using the shortened form of the linear model comes in handy. We can use linear algebra multiplication between the parameters (
*β*) and the input values,*x*:# model function

def h_x(Beta, X):

# calculate the matrix dot product of X and the Betas

return np.dot(Beta, X).flatten()

- We also need to write a function to evaluate the cost function,
*J(β)*:# cost function

def J_beta(pred, true):

# mean squared error

return np.mean((pred - true) ** 2)

- Finally, we need to implement the function to update the parameters:
# update function

def update(pred, true, X, gamma):

return gamma * np.sum((true - pred) * X, axis = 1)

- Next, load the data, slicing from 1902 forward, computing the annual averages, and adding the
**Year**column:# load the data

df = pd.read_csv('../Datasets/synth_temp.csv')

# slice 1902 and forward

df = df.loc[df.Year > 1901]

# roll up by year

df_group_year = df.groupby(['Year']).agg({'RgnAvTemp' : 'mean'})

# add the Year column so we can use that in a model

df_group_year['Year'] = df_group_year.index

df_group_year = \

df_group_year.rename(columns = {'RgnAvTemp' : 'AvTemp'})

- Now, we will build the training data. First, we need to scale the data to between
**0**and**1**before using gradient descent. Some machine learning algorithms can work well with raw data (such as regular linear regression), but when using gradient descent, if the variables have very different scales, then the gradient values will be much larger in the axes of some of the parameters than others. Left unscaled, the data in raw form could distort the descent along the cost function surface, skewing the results. Intuitively, in our case, the**Year**data is in the order of thousands, while the**AvTemp**data is in the order of tens. Thus, the**Year**variable would dominate in terms of its influence on the parameters.There are a variety of scaling methods used in machine learning. Examples are normalization to a specific range (such as (0, 1) or (-1, 1)), and standardization (where the data is scaled to a mean of 0 and standard deviation of 1). Here, we will normalize both the

**x**and**y**data to the range (0, 1):# scale the data and add the X0 series

X_min = df_group_year.Year.min()

X_range = df_group_year.Year.max() - df_group_year.Year.min()

Y_min = df_group_year.AvTemp.min()

Y_range = df_group_year.AvTemp.max() - df_group_year.AvTemp.min()

scale_X = (df_group_year.Year - X_min) / X_range

train_X = pd.DataFrame({'X0' : np.ones(df_group_year.shape[0]), \

'X1' : scale_X}).transpose()

train_Y = (df_group_year.AvTemp - Y_min) / Y_range

print(train_X.iloc[:, :5])

print(train_Y[:5])

The output should appear as follows:

Note that the

**train_Y**values are the true values, also called*ground truth*. - As we have learned, we need to initialize the parameter values. Note that we use the NumPy
**random.seed()**method with a constant value. Setting**random.seed**will reproduce the same results every time you run the notebook. This is useful during model development and while exploring hyperparameters so that you can see the impact of changes you make versus the impact of the random initialization. The**reshape()**method is used to put the data into the correct matrix form:# initialize Beta and the learning rate gamma

np.random.seed(42)

Beta = np.random.randn(2).reshape((1, 2)) * 0.1

print('initial Beta\n', Beta)

The values should look something like the following:

initial Beta [[ 0.04967142 -0.01382643]]

- We also need to set a couple of hyperparameters, the learning rate, gamma, and the maximum number of times that we will go through the training cycle (epochs):
gamma = 0.0005

max_epochs = 100

- Make an initial prediction and calculate the error or cost in that prediction using the defined
**h_x**and**J_beta**functions:y_pred = h_x(Beta, train_X)

print('Initial cost J(Beta) = ' + str(J_beta(y_pred, train_Y)))

The output will be as follows:

Initial cost J(Beta) = 0.18849128813354338

- We are now ready to use a loop to iterate through the training. Here, we are storing the
**epoch**and**cost**values so that we can visualize them later. Also, we are printing out the**cost**function and**epoch**every 10 epochs:epochs = []

costs = []

for epoch in range(max_epochs):

Beta += update(y_pred, train_Y, train_X, gamma)

y_pred = h_x(Beta, train_X)

cost = J_beta(y_pred, train_Y)

if epoch % 10 == 0:

print('New cost J(Beta) = ' + str(round(cost, 3)) \

+ ' at epoch ' + str(epoch))

epochs.append(epoch)

costs.append(cost)

The output will be as follows:

Observe in

*Figure 3.48*the relatively rapid decrease in the cost function in the first 20 cycles, before the improvement slows down. This is a very typical pattern for gradient descent training when the learning rate is at a reasonable value. - Visualize the training history:
# plot training history

fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1])

ax.plot(epochs, costs)

ax.tick_params(labelsize = 14)

ax.set_ylabel('Cost Function J(' + r'$\theta$' + ')', \

fontsize = 18)

ax.set_xlabel('Epoch', fontsize = 18)

plt.show()

The output will be as follows:

In

*Figure 3.49*, we can see that the cost function was still decreasing when we stopped the updates. Thus, we could rerun the training with a larger**max_epochs**hyperparameter and see whether the results improved. - Use the
**r2_score**function from**sklearn.metrics**, which we imported earlier, to compute the R-squared score for the model trained using gradient descent:# calculate the r squared value

r2 = r2_score(train_Y, y_pred)

print('r squared = ', r2)

The output should be similar to the following:

r squared = 0.5488427996385263

Note that you could vary the learning rate and max epochs parameters to see the impact on the training history and the r2 value.

- Now, we generate predictions using the training data so that we can visualize the resulting model. In this cell, we first predict using the scaled training data since the model coefficients are based on the scaled inputs, and then we scale the results back to "real" values using the (
**Y_min**and**Y_range**) values we saved earlier in the scaling process. Note the use of our model function,**h_x**, to generate the predictions. Also, for convenience, we replace**pred_X**with the original year values for use in visualization:# generate predictions for visualization

pred_X = train_X

# make predictions

pred_Y = h_x(Beta, pred_X)

# scale predictions back to real values

pred_Y = (pred_Y * Y_range) + Y_min

# replace the X with the original values

pred_X = df_group_year['Year']

- One impact of doing the regression with scaled data is that the
*β*values are relative to the scaled data, not the original data. In many cases, we would like to obtain the unscaled parameter values. In particular, in linear regression, the unscaled coefficients are interpretable as the unit change in the dependent variable for a unit change in the variable associated with the parameter. Specifically, in our case here,*β*1 is the "slope" of the line and represents the change in average annual temperature for a change of 1 year. We can now calculate the parameters of the unscaled model:# scale the coefficients back to real values

Beta0 = (Y_min + Y_range * Beta[0, 0] \

- Y_range * Beta[0, 1] * X_min / X_range)

Beta1 = Y_range * Beta[0, 1] / X_range

- Visualize the results:
fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1])

# Raw data

raw_plot_data = df

ax.scatter(raw_plot_data.Year, raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'red', s = 1.5)

# Annual averages

ax.scatter(df_group_year.Year, df_group_year.AvTemp, \

label = 'Annual average', c = 'k', s = 10)

# linear fit

ax.plot(pred_X, pred_Y, c = "blue", linestyle = '-.', \

linewidth = 4, label = 'linear fit')

- Put the model on the plot:
ax.text(1902, 20, 'Temp = ' + str(round(Beta0, 2)) \

+' + ' + str(round(Beta1, 4)) + ' * Year', \

fontsize = 16, backgroundcolor = 'white')

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(15, 21)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

Note

For error-free execution, you should run the cell only after writing the code for both

*steps 15*and*16*.The output will be as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/3diOR76.

You can also run this example online at https://packt.live/2YWvviZ. You must execute the entire Notebook in order to get the desired result.

You have just trained your first model with gradient descent. This is an important step as this simple tool can be used to construct more complicated models such as logistic regression and neural network models. We must first, however, note one important observation: the r-squared value produced by the gradient descent model is not as high as the least squares model, and the equation of the line is different.

That being said, there are many more options available to modify the gradient descent process, including different types of gradient descent algorithms and more advanced uses of learning rate and the way the data is supplied during training. These modifications fall outside the scope of this book, as an entire book could be written on the gradient descent process and methods for improving performance. With enough experimentation, we would be able to match the two results to any arbitrary level of precision, but in this case, that would not be an effective use of time.

In this exercise, we implemented gradient descent directly; however, we would not typically use this implementation, but instead leverage an existing, highly optimized package. The scikit-learn method of gradient descent contains a number of optimizations and can be used in only a few lines of code. The following is from the scikit-learn documentation (refer to https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html): "SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate)."

In our examples so far, we have used all the data directly in the linear regression method or used all the data for every update in our implementation by means of gradient descent. However, with gradient descent, we have total control over when to update our estimates of the parameters.

## Exercise 3.06: Optimizing Gradient Descent

In this exercise, we will use the scikit-learn module **SGDRegressor**, which utilizes stochastic gradient descent to train models.

In this exercise, we start with the **synth_temp.csv** data, as before:

- Import the packages and classes as before, adding
**SGDRegressor**:import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.metrics import r2_score

from sklearn.linear_model import SGDRegressor

- Load the data and carry out the same preprocessing and scaling as before:
# load the data

df = pd.read_csv('../Datasets/synth_temp.csv')

# slice 1902 and forward

df = df.loc[df.Year > 1901]

# roll up by year

df_group_year = df.groupby(['Year']).agg({'RgnAvTemp' : 'mean'})

# add the Year column so we can use that in a model

df_group_year['Year'] = df_group_year.index

df_group_year = df_group_year\

.rename(columns = {'RgnAvTemp' : 'AvTemp'})

# scale the data

X_min = df_group_year.Year.min()

X_range = df_group_year.Year.max() - df_group_year.Year.min()

Y_min = df_group_year.AvTemp.min()

Y_range = df_group_year.AvTemp.max() - df_group_year.AvTemp.min()

scale_X = (df_group_year.Year - X_min) / X_range

train_X = scale_X.ravel()

train_Y = ((df_group_year.AvTemp - Y_min) / Y_range).ravel()

- We instantiate the model by calling
**SGDRegressor**, and pass the hyperparameters. Here, we set the NumPy**random.seed**method and, as we are not passing a seed or method to**SGDRegressor**, it uses the NumPy random generator:# create the model object

np.random.seed(42)

model = SGDRegressor(loss = 'squared_loss', max_iter = 100, \

learning_rate = 'constant', eta0 = 0.0005, \

tol = 0.00009, penalty = 'none')

- We fit the model by calling the
**fit**method of the model object:# fit the model

model.fit(train_X.reshape((-1, 1)), train_Y)

The output should be as follows, echoing the parameters used in the call:

- We now want to retrieve the coefficients from the model and rescale them as in the previous exercise so that we can compare the results directly:
Beta0 = (Y_min + Y_range * model.intercept_[0] \

- Y_range * model.coef_[0] * X_min / X_range)

Beta1 = Y_range * model.coef_[0] / X_range

print(Beta0)

print(Beta1)

The output should be as follows:

-0.5798539884018439

0.009587734834970016

- As before, we now generate predictions, and then use the
**r2_score**function to calculate r2. Note that since we are using a scikit-learn method, we use the**predict**method on the model object to return predictions:# generate predictions

pred_X = df_group_year['Year']

pred_Y = model.predict(train_X.reshape((-1, 1)))

# calculate the r squared value

r2 = r2_score(train_Y, pred_Y)

print('r squared = ', r2)

The result will be something similar to the following:

r squared = 0.5436475116024911

- Finally, we rescale the predictions back to actual temperatures for visualization:
# scale predictions back to real values

pred_Y = (pred_Y * Y_range) + Y_min

- Now, visualize the results:
fig = plt.figure(figsize=(10, 7))

ax = fig.add_axes([1, 1, 1, 1])

# Raw data

raw_plot_data = df

ax.scatter(raw_plot_data.Year, raw_plot_data.RgnAvTemp, \

label = 'Raw Data', c = 'red', s = 1.5)

# Annual averages

ax.scatter(df_group_year.Year, df_group_year.AvTemp, \

label = 'Annual average', c = 'k', s = 10)

# linear fit

ax.plot(pred_X, pred_Y, c = "blue", linestyle = '-.', \

linewidth = 4, label = 'linear fit')

# put the model on the plot

ax.text(1902, 20, 'Temp = ' + str(round(Beta0, 2)) +' + ' \

+ str(round(Beta1, 4)) + ' * Year', fontsize = 16)

ax.set_title('Mean Air Temperature Measurements', fontsize = 16)

# make the ticks include the first and last years

tick_years = [1902] + list(range(1910, 2011, 10))

ax.set_xlabel('Year', fontsize = 14)

ax.set_ylabel('Temperature ($^\circ$C)', fontsize = 14)

ax.set_ylim(15, 21)

ax.set_xticks(tick_years)

ax.tick_params(labelsize = 12)

ax.legend(fontsize = 12)

plt.show()

The output will be as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/2zWIadm.

You can also run this example online at https://packt.live/3eqhroj. You must execute the entire Notebook in order to get the desired result.

Compare this graph to the one constructed using the manual implementation of gradient descent. Notice the similarities: this provides us with confidence that both implementations of gradient descent are correct. However, we now have the use of all the power of the scikit-learn implementation of SGD. For more complex problems, this could be critical to success. For example, you may have noticed that **SGDRegressor** supports regularization methods, but we did not use them. Some regularization methods add an adjustment to the cost function equation to apply a penalty (a factor to make a parameter smaller) to parameter values that are large in comparison to others. There are other methods available specific to some models (for example, artificial neural networks have several additional regularization approaches that can be used). One important use of regularization is to reduce overfitting, which we have touched on but has not been present in the simple models we have used so far. Further discussion relating to regularization can be found in *Chapter 6*, *
Ensemble Modeling*.

## Activity 3.05: Gradient Descent

In this activity, we will implement the same model as *Activity 3.02: Linear Regression Using the Least Squares Method*; however, we will use the gradient descent process.

The steps to be performed are as follows:

- Import the modules and classes; in this case:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from sklearn.metrics import r2_score

from sklearn.linear_model import SGDRegressor

- Load the data (
**austin_weather.csv**) and preprocess up to the point of creating the**Day_of_Year**column and slicing one full year (2015). - Create scaled
*X*and*Y*data, scaling between 0 and 1 in each case. - Instantiate a model using
**SGDRegressor**. Remember to set the NumPy**random.seed()**method. - Fit the model.
- Extract the rescaled model coefficients,
**Theta0**and**Theta1**, and print them. - Generate predictions using the scaled data, use the
**r2_score**method to get the r2 value of the fit, and then print out r2. - Rescale the predictions to use for plotting.
- Create a visualization with the raw data, the 20-day moving averages, and the new linear fit line. Include the model equation on the chart.
The output should be as follows:

Note

The solution for this activity can be found via this link.

By now, you should be comfortable using the scikit-learn **SGDRegressor** interface as well as understanding the basic process of gradient descent. You should also have some idea as to when using SGD is preferred over, say, standard linear regression.

We have now covered data smoothing and simple linear regression, have developed a gradient descent algorithm by hand, and used the scikit-learn **SGDRegressor** interface to apply gradient descent to linear regression. You've seen how to do some feature engineering with dummy variables, polynomial features, and sine/cosine features. You may have noticed along the way that most of the code is used to prepare the data and visualize results. This is not unusual for machine learning—understanding and working with data is generally the largest task, and critical to success. We'll now move on to another application of regression, multiple linear regression.

# Multiple Linear Regression

We have already covered regular linear regression, as well as linear regression with polynomial and other terms, and considered training them with both the least squares method and gradient descent. This section of the chapter considers an additional type of linear regression: multiple linear regression, where more than one variable (or feature) is used to construct the model. In fact, we have already used multiple linear regression without calling it as such—when we added dummy variables, and again when we added the sine and cosine terms, we were fitting multiple *x* variables to predict the single *y* variable.

Let's consider a simple example of where multiple linear regression naturally arises as a modeling solution. Suppose you were shown the following chart, which is the total annual earnings of a hypothetical tech worker over a long career. You can see that over time, their pay increased, but there are some odd jumps and changes in the data slope:

You might guess that this worker changed jobs from time to time, causing some of the jumps. However, suppose from the data we are given, the compensation is the multiplication of their average hours per week each year and an hourly rate. Well, intuitively, the total in each year would be the product of the total hours worked and the rate. Instead of a simple linear model of income versus year, we could build a multiple linear model using year, rate, and hours per week to predict the totals. In this hypothetical case, using multiple linear regression versus simple linear regression results in the following chart:

The red circles appear to be a more satisfying model than the simple blue line. There are still features in the data not explained—perhaps there were bonuses or retirement fund matches in some years, and perhaps there are other things we are unaware of in relation to the data. Nonetheless, it makes sense to use multiple *x* variables in such a case.

Before moving on, let's cover a few details. We will be using the **corr()** method in pandas, which takes a DataFrame and calculates the pairwise correlation among all the variables. There are two key things that we will be investigating in the following exercise. First, when executing regression with multiple *x* variables, if any of the variables are highly correlated, this can cause issues with the model. This problem is known as **multicollinearity** and can cause the coefficient estimates to be unstable to small changes in the data or the model. In an extreme case, where one variable is actually a linear combination of other variables, the model can become singular; in some methods, the coefficient for one of the linearly dependent variables may be returned as **Inf**, or other errors may result. Secondly, we would also like to know whether the variables will have any impact on predictions. Let's solve an exercise to understand this better.

## Exercise 3.07: Multiple Linear Regression

For this exercise, we will use a UCI dataset that has the power output of a combined cycle power plant and several possible explanatory variables:

Note

The original data and description are available from https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Alternatively, you can find the data in our repository here: https://packt.live/2Pu850C

- Load the modules and classes; note that we are adding seaborn here to be used in some visualizations:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.linear_model import LinearRegression

- Load and inspect the data:
Note

The triple-quotes (

**"""**) shown in the code snippet below are used to denote the start and end points of a multi-line code comment. Comments are added into code to help explain specific bits of logic."""

load and inspect data

from the description file

(https://archive.ics.uci.edu/ml/machine-learning-databases/00294/)

the variables are:

note: some var names are incorrect in the description file

Ambient Temperature (AT)

Ambient Pressure (AP)

Relative Humidity (RH)

Exhaust Vacuum (V)

and the dependent variable is

net hourly electrical energy output (PE)

"""

power_data = pd.read_csv\

('../Datasets/combined_cycle_power_plant.csv')

print(power_data.shape)

print(power_data.head())

missings = power_data.isnull().sum()

print(missings)

The result should appear as follows:

- Since we have not used this data before, let's do some very quick EDA. First, we'll look at the correlation among all the variables:
"""

quick EDA

correlation analysis

"""

corr = power_data.corr()

# mask for heatmap in seaborn

mask = np.ones((power_data.shape[1], power_data.shape[1]))

mask = [[1 if j< i else 0 \

for j in range(corr.shape[0])] \

for i in range(corr.shape[1])]

fig, ax = plt.subplots(figsize = (10, 7))

"""

plot the correlation matrix as a heatmap

blanking out the upper triangle (duplicates)

"""

sns.heatmap(corr, cmap = 'jet_r', square = True, linewidths = 0.5, \

center = 0, annot = True, mask = mask, \

annot_kws = {"size" : 12}, \

xticklabels = power_data.columns, \

yticklabels = power_data.columns)

plt.show()

The chart should appear as follows:

In the preceding data, the strongest correlation among the

*x*variables is the**0.84**between**V**and**AT**. Hence, there should be no multicollinearity here.We would like some indication that the

*x*variables will impact on the thing we are trying to predict,**PE**. In the last row of the chart, we can see a significant correlation between**PE**and all the other variables, which is a good indicator they will all be valuable in the model. If there were a very large number of features, we might be interested in dropping variables that are less important in terms of reducing noise in the model. However, these correlation coefficients are only pairwise and even if we saw a low correlation, more work would be required to justify removing a variable.Regarding the visualization, we are using the seaborn

**heatmap**function to generate the plot. Because the correlation values are symmetric, the correlation between, say,**AT**and**V**is the same as**V**and**AT**. Therefore, the upper-right triangle of the grid would mirror the lower-left triangle, so the**heatmap**method provides a way to blank any squares we want. This is accomplished with the**mask**variable in the call, which takes a matrix the same shape as the correlation matrix, and blanks any squares corresponding to**False**values in the**mask**. We used a nested list comprehension to put the**False**values (integer**0**) into the**mask**.Finally, note that the values along the diagonal are all

**1**; by definition, a variable is perfectly correlated to itself. You might see examples where these squares are used to plot the variable distributions or other valuable information, as well as using the upper-right (or lower-left) triangle squares for additional information, such as the plots in the next step. - Use the seaborn pairplot to visualize the pairwise relationships among all the variables. This information augments the correlation plot and, as noted, is sometimes combined with it in a single grid:
# (2) look at the pairwise variable relationships

plot_grid = sns.pairplot(power_data)

The result should appear as follows:

By default, this chart shows the scatterplot of each variable against the other variables and the distributions of each variable along the diagonal. Note that the upper-right triangle is the mirror image of the lower-right triangle, and that the axes are flipped on each chart.

Along the diagonal, we see the distributions of all the variables. We can see that the

**RH**variable is skewed to the left; we could consider applying a transform to that column, such as**numpy.log(power_data['RH'])**or**numpy.sqrt(power_data['RH'])**. For this exercise, we will leave them as is.The other thing we can observe from this chart is the scatterplots along the bottom row; note that

**AT**, which has the largest negative correlation to**PE**, shows a clear negative trend of**PE**versus**AT**, which intuitively makes sense. As we move to the right, the correlations become less strong, and that is consistent with the scatterplots. In the third chart,**PE**versus**AP**, we can see some indication of the positive correlation. - Now, we structure the data for the linear regression model, fit the model, and get predictions and the r2 value. This is done in the same way as we did it in the previous exercises:
# structure data

X_train = power_data.drop('PE', axis = 1)

Y_train = power_data['PE']

# fit the model

model = LinearRegression()

model.fit(X_train, Y_train)

# get predictions

Y_pred = model.predict(X_train)

r2 = model.score(X_train, Y_train)

print('model coefficients ' + str(model.coef_))

print('r2 value ' + str(round(r2, 3)))

The output should appear as follows:

The relatively high r2 value is a good sign that the model may be effective at predicting

**PE**. - With multiple linear regression, we can't visualize the results easily as a plot of the predicted variable versus an
*x*variable. However, a very powerful visualization that can be used in almost any situation is simply to plot the predicted value against the true value. It's best to make this plot symmetric (the axis limits should be the same for*x*and*y*) for ease of interpretation—perfect predictions would then lie along the diagonal. We add a diagonal line to aid in visual interpretation:fig, ax = plt.subplots(figsize=(10, 10))

# set some limits

PE_range = max(power_data.PE) - min(power_data.PE)

plot_range = [min(power_data.PE) - 0.05 * PE_range, \

max(power_data.PE) + 0.05 * PE_range]

ax.scatter(Y_train, Y_pred)

ax.set_xlim(plot_range)

ax.set_ylim(plot_range)

ax.set_xlabel('Actual PE value', fontsize = 14)

ax.set_ylabel('Predicted PE value', fontsize = 14)

ax.plot(plot_range, plot_range, c = "black")

plt.show()

The result is as follows:

*Figure 3.60* indicates that the majority of predictions are along the diagonal. There are a few values that are predicted significantly higher than most, and we might want to investigate those particular points. In addition, at the highest values, the prediction tends to be too low on average, which may indicate that there is some feature engineering or other data we need in the model. Recall that we did not transform any of the variables; that would be useful to try here to see whether results subsequently improved. We will not pursue this further here.

Note

To access the source code for this specific section, please refer to https://packt.live/2CwfIzZ.

You can also run this example online at https://packt.live/37UzZuK. You must execute the entire Notebook in order to get the desired result.

We have seen that using multiple linear regression is a powerful addition to the toolset, and is a very easy extension to methods we have already mastered. In fact, multiple linear regression can often perform as well or better than more complex models on regression problems. Although we are not covering it here, a benefit of using multiple linear regression versus, let's say, an artificial neural network is that the coefficients of the multiple linear regression model can be interpreted as the estimates of the effects on the predicted variable of each *x* variable; there are times when that interpretation is extremely valuable.

# Summary

In this chapter, we took our first big leap into constructing machine learning models and making predictions with labeled datasets. We began our analysis by looking at a variety of different ways to construct linear models, starting with the precise least squares method, which is very good when modeling small amounts of data that can be processed using the available computer memory. The performance of linear models can be improved using dummy variables, which we created from categorical variables, adding additional features and context to the model. We then used linear regression analysis with a polynomial model to further improve performance, fitting a more natural curve to the dataset, and we investigated other non-linear feature engineering with the addition of sine and cosine series as predictors.

As a generalization from explicit linear regression, we implemented the gradient descent algorithm, which we noted, while not as precise as the least squares method (for a given number of iterations or epochs), would be able to process arbitrarily large datasets and larger numbers of variables. In addition, using generalized gradient descent introduces a number of other parameters, so-called hyperparameters, that can be optimized by us, the data scientists, to improve model performance. We deferred further investigation of model optimization to *Chapter 7*, *Model Evaluation.*

Now that we have a sound understanding of linear regression, we will look at autoregression models in depth in the next chapter.