Overview

This chapter will teach you how to implement autoregression as a method for forecasting values depending on past values. By the end of this chapter, you will be able to create an autoregression model and construct time series regression models using autoregression. You will be fully equipped to use autoregressors to model datasets and predict future values.

# Introduction

In the previous chapter, we studied the different methods used to construct linear regression models. We learned how to use the least squares method to develop linear models. We made use of dummy variables to improve the performance of these linear models. We also performed linear regression analysis with a polynomial model to improve the model's performance. Next, we implemented the gradient descent algorithm, which handles large datasets and large numbers of variables with ease.

In this chapter, we will be developing autoregression models. Autoregression is a special type of regression that can be used to predict future values based on the experience of previous data in the set.

# Autoregression Models

Autoregression models are classical or "standard" modeling methods used on time series data (that is, any dataset that changes with time) and can complement the linear regression techniques covered previously. Autoregression models are often used for forecasting in the economics and finance industry as they are useful with univariate time series (where there are no *x* variables other than time) and with very large datasets (such as streaming data or high-frequency sensor data) where the linear algebra operations might run into memory or performance issues on very large datasets. The "auto" part of autoregression refers to the fact that these models leverage correlation of a time series to itself in the past, hence autoregression. In addition, many systems do not have an associated causal model—the time series data is said to be stochastic. An example is stock price data over time. Although many attempts have been made, and continue to be made, to develop predictive causal models of stock market behavior, very few have been successful. Thus, we can treat a given stock symbol price over time as a stochastic series and use autoregression to attempt to model it.

Note

To illustrate autoregression, we will use the S&P daily closing prices from 1986 to 2018, which are available in the repository associated with this book (https://packt.live/2w3ZkDw).

The original dataset can be found here:

https://www.kaggle.com/pdquant/sp500-daily-19862018

A graphical view of this data is shown in the following figure:

The main principle behind autoregression models is that, given enough previous observations, a reasonable prediction for the future can be made; that is, we are essentially constructing a model using the dataset as a regression against itself, using past values as the predictors. A key factor in choosing an autoregression model is that there is enough correlation of the future values to past values at specific lag times. Lag time refers to how far into the past the model uses data to predict the future.

## Exercise 4.01: Creating an Autoregression Model

In this exercise, we will use autoregression and attempt to predict the S&P 500 closing price 1 year into the future:

Note

This exercise will work on an earlier version of pandas, ensure that you downgrade the version of pandas using the command:

**pip install pandas==0.24.2**

- Import the necessary packages and classes. In this exercise, we introduce the
**statsmodels**package, which includes a wide range of statistical and modeling functions, including autoregression. If you have not previously installed**statsmodels**from a terminal prompt, use the following command:conda install -c anaconda statsmodels

If you are not using Anaconda (or Miniconda) and instead install via

**pip**, use the following command:pip install -U statsmodels

Once

**statsmodels**is installed on your system, load the following:import pandas as pd

import numpy as np

from statsmodels.tsa.ar_model import AR

from statsmodels.graphics.tsaplots import plot_acf

import matplotlib.pyplot as plt

- Load the S&P 500 data (
**spx.csv**) and convert the**date**column into a**datetime**data type:df = pd.read_csv('../Datasets/spx.csv')

df['date'] = pd.to_datetime(df['date'])

print(df.head())

print(df.tail())

We'll get the following output:

If you look carefully at the data in

*Figure 4.2*, you might see that some data is missing—for example, there is no data on**1/4/1986**and**1/5/1986**. These are weekend dates during which the market is closed. An autoregression model, especially one containing a lot of data, will not be sensitive to the missing values for at least two reasons. First, since the model moves forward one period at a time and, in higher-order models, uses multiple past values, the prediction will be less sensitive to missing values compared to a model that was, say, based on a single lag value. Second, in the case such as here, where most missing values are periodic (Saturdays repeat every 7 days, Sundays repeat every 7 days, and so on), then, if we have a lot of data, the model will automatically account for those days. However, as we will see, the uncertainty of predicting far beyond the end of existing data can become large for autoregression models. - Plot the raw dataset versus the
**date**data type:fig, ax = plt.subplots(figsize = (10, 7))

ax.plot(df.date, df.close)

ax.set_title('S&P 500 Daily Closing Price', fontsize = 16)

ax.set_ylabel('Price ($)', fontsize = 14)

ax.tick_params(axis = 'both', labelsize = 12)

plt.show()

The output will be as follows:

- Before constructing an autoregression model, we should first check to see whether the model is able to be used as a regression against itself. As we noted earlier, the success of an autoregression model depends on the ability to use past values in a linear model to predict future values. That means the future values should be strongly correlated to past values. We can investigate this using the
**statsmodels****plot_acf**function (plotting the autocorrelation function). We mentioned before that how far back we look into the past is called the lag; we will override the default maximum value of**plot_acf**and plot lags from 0 days to 4,000 days:max_lag = 4000

fig, ax = plt.subplots(figsize = (10, 7))

acf_plot = plot_acf(x = df.close, ax = ax, lags = max_lag, \

use_vlines = False, alpha = 0.9, \

title = 'Autocorrelation of SPX vs. lag')

ax.grid(True)

ax.text(1000, 0.01, '90% confidence interval')

ax.set_xlabel('Lag', fontsize = 14)

ax.tick_params(axis = 'both', labelsize = 12)

plt.show()

The result should appear as follows:

We can understand this chart as follows. First, by definition at 0 lag, a series is perfectly correlated to itself so the

**autocorrelation****function**(**ACF**) value is 1.0. Then, we see that as we increase the lag, the series is less and less correlated, meaning each data point farther back in time has less information about the value we are predicting. This is very typical of stochastic or random series (note that periodic series will have peaks and valleys in the ACF plot—we will see this later). Also, by choosing the value in the function call, we plotted a 90% confidence interval shaded in blue. The meaning of this interval is that any lag outside the interval is considered statistically significant—in other words, the correlation is statistically valid. To build an autocorrelation model, we must have ACF values outside the confidence interval to be successful. In this case, we have a fairly high correlation from 0 to some number of days, which we can use in the model (more on that shortly). Finally, we can see that at lags above about 2,750 days, the correlation is negative—that is, the past values predict the opposite to the future. However, in this case, those long-term negative lags are not very significant. - To get some intuition for the ACF results, let's choose a fairly short lag of 100 days and plot both the original data and the lagged data on one chart. We can do that using the
**pandas.shift()**function:spx_shift_100 = df.copy()

spx_shift_100['close'] = df.close.shift(100)

fix, ax = plt.subplots(figsize = (10, 7))

ax.plot(df.date, df.close, c = "blue")

ax.plot(spx_shift_100.date, spx_shift_100.close, c = "red")

plt.show()

The output will be as follows:

In

*Figure 4.5*, the red line indicates the values from 100 days ago, as compared to the blue line, which indicates actual (present) values for a given date. We see that during periods when the value is increasing, the past values are below the actual, and when the value is decreasing, the opposite is true. This makes intuitive sense. Importantly, during large portions of the time period covered, the vertical space between the two curves looks about constant. That means, intuitively, that the relationship of the past to the present is roughly similar. If you think about these curves for some time, you will also see the limitation of autoregression—the predictions will always look like the recent history, so when things change, the predictions will be less accurate until the model "catches up" with the new behavior. - There is one more way in which we can visualize the correlation we are analyzing. In the multiple linear regression case, we introduced a chart where we plotted the predicted values against the actual values; perfect prediction was along the diagonal. Similarly, if, instead of plotting the lagged values and the actual values, both versus time, we could plot the lagged values against the actual values. Let's see what this would look like:
print(spx_shift_100.head(), '\n', spx_shift_100.tail())

fig, ax = plt.subplots(figsize = (7, 7))

ax.scatter(df.loc[100:, 'close'], spx_shift_100.loc[100:, 'close'])

ax.set_xlim(0, 3000)

ax.set_ylim(0, 3000)

plt.show()

The output will be as follows:

date close

0 1986-01-02 NaN

1 1986-01-03 NaN

2 1986-01-06 NaN

3 1986-01-07 NaN

4 1986-01-08 NaN

date close

8187 2018-06-25 2823.81

8188 2018-06-26 2821.98

8189 2018-06-27 2762.13

8190 2018-06-28 2648.94

8191 2018-06-29 2695.14

The plot will be as follows:

*Figure 4.6*shows that most of the values with a lag of 100 days are along a diagonal line, meaning that the relationship between the lagged values and the present values is similar across all actual values. - We could create plots like
*Figure 4.6*across a range of lag times to try to understand which lags would be useful in a model and which would not be useful. Before we do that, it would be useful to have the values of the ACF from*Figure 4.6*, so we could relate what we see in the lag plots to the correlation function value. The**plot_acf**function of**statsmodels**is based upon an underlying**numpy**function,**correlate**. We can use that to get the values shown in*Figure 4.6*:"""

the statsmodels plot_acf is based upon the numpy correlate

function, so we can generate the actual values for

illustration and so we can label some later plots

the standard presentation of an acf plot has the value at

lag 0 == 1; the correlate function returns unscaled

values so we get the first value for scaling to 1

the values to be tested in the function must have

the mean of the un-shifted series subtracted from

both series

"""

corr0 = np.correlate(df.close[0: ] - df.close.mean(), \

df.close[0: ] - df.close.mean(), \

mode = 'valid')

corrs = [np.correlate(df.close[:(df.close.shape[0] - i)] \

- df.close.mean(), df.close[i: ] \

- df.close.mean(), mode = 'valid')

for i in range(max_lag)] / corr0

Note that we subtract the mean of the underlying series (

**df.close**) from every value used in the**correlate**function. This is a mathematical subtlety to be consistent with the output of the**plot_act**function output. - Now, rather than creating many charts by hand at different lag values, let's create a function to generate a grid of plots that we can use with various lag ranges, numbers of plots, and so on. We will pass the
**df.close**series to the function, along with the preceding**corrs**values, and with parameters to control the plots:"""

utility function to plot out a range of

plots depicting self-correlation

"""

def plot_lag_grid(series, corrs, axis_min, axis_max, \

num_plots, total_lag, n_rows, n_cols):

lag_step = int(total_lag / num_plots)

fig = plt.figure(figsize = (18, 16))

for i in range(num_plots):

corr = corrs[lag_step * i]

ax = fig.add_subplot(n_rows, n_cols, i + 1)

ax.scatter(series, series.shift(lag_step * i))

ax.set_xlim(axis_min, axis_max)

ax.set_ylim(axis_min, axis_max)

ax.set_title('lag = ' + str(lag_step * i))

ax.text(axis_min + 0.05 * (axis_max - axis_min), \

axis_max - 0.05 * (axis_max - axis_min), \

'correlation = ' + str(round(corr[0], 3)))

fig.tight_layout()

plt.show()

- We are now prepared to get a deeper understanding of how the correlation relates to the lag plots. We'll call the function:
"""

create a grid to see how well the data at increasing

lags correlates to the original data

'perfect' correlation will appear as a diagonal line

the farther from the line, the poorer the correlation

"""

plot_lag_grid(df.close, corrs, df.close.min(), df.close.max(), \

num_plots = 16, total_lag = 480, \

n_rows = 4, n_cols = 4)

This will produce the following:

In

*Figure 4.7*, we can see a gradual degradation in lag chart appearance in direct relation to the ACF function value shown on each plot. This gives us the idea that trying to use longer lags will add noise to the model, which looks like it would be fairly large by lag 60, and not too large at lag 30. Now, we can use another**statsmodels**function to develop a model and see how that fits in with our idea. - The
**statsmodels**function,**AR**, along with the associated**fit**method, builds an autoregression model. Used with the defaults, it will determine the max lag, return all the parameters from lag 1 to the max lag, and allow us to predict both in the range of the existing data and into the future:"""

statsmodels AR function builds an autoregression model

using all the defaults, it will determine the max lag

and provide all the model coefficients

"""

model = AR(df.close)

model_fit = model.fit()

# model fit now contains all the model information

max_lag = model_fit.k_ar

"""

note that by using defaults, the maximum lag is

computed as round(12*(nobs/100.)**(1/4.))

see https://www.statsmodels.org/devel/generated/statsmodels.tsa.ar_model.AR.fit.html#statsmodels.tsa.ar_model.AR.fit

"""

print('Max Lag: ' + str(max_lag))

print('Coefficients: \n' + str(model_fit.params))

The output will be as follows:

Note that there are 36 coefficients for each of the weights and one constant—the function determined that the maximum lag to use in the model is 36 days. It is easy to make predictions and visualize the result:

# how far into the future we want to predict

max_forecast = 365

# generate predictions from the model

pred_close = pd.DataFrame({'pred_close': \

model_fit.predict(start = max_lag, \

end = df.shape[0] \

+ max_forecast - 1)})

# attach the dates for visualization

pred_close['date'] = df.loc[pred_close.index, 'date'].reindex()

pred_close.loc[(max(df.index) + 1):, 'date'] = \

pd.to_datetime([max(df.date) \

+ pd.Timedelta(days = i) \

for i in range(1, max_forecast + 1)])

"""

visualize the predictions overlaid on the real data

as well as the extrapolation to the future

"""

fig, ax = plt.subplots(figsize = (10, 7))

ax.plot(df.date, df.close, c = "blue", linewidth = 4, \

label = 'actual SPX close')

ax.plot(pred_close.loc[0 : len(df.close), 'date'], \

pred_close.loc[0 : len(df.close), 'pred_close'], \

c = "yellow", linewidth = 0.5, \

label = 'predicted SPX close')

ax.plot(pred_close.loc[len(df.close):, 'date'], \

pred_close.loc[len(df.close):, 'pred_close'], \

c = "red", linewidth = 2, label = 'forecast SPX close')

ax.set_xlabel('Date', fontsize = 14)

ax.tick_params(axis = 'both', labelsize = 12)

ax.legend()

plt.show()

The result will look as follows:

Note that the predictions do an excellent job of following the dataset, and that after the dataset has ended, the predictions are relatively linear. Given that the model is constructed from a linear model of the previous samples, and that, after yt+1, the predictions are less and less certain as they become based on past predictions, each of which has some error.

- The fit is quite good. We can use the same approach as for linear regression to compare the predictions versus the actual values. We have to do a little work to deal with the fact that the predicted values don't start until the maximum lag beyond the start date of the original data—we need at least the maximum lag number of past values to predict the next one—in this case, 36 values. The result is that we need to offset the index of the datasets we want to compare, which we did not have to do in the linear regression case:
# compare predicted vs. actual

fig, ax = plt.subplots(figsize = (10, 7))

ax.scatter(df.loc[max_lag:(df.shape[0] - 1), 'close'], \

pred_close.loc[max_lag:(df.shape[0] - 1), 'pred_close'])

ax.tick_params(axis = 'both', labelsize = 12)

ax.set_xlabel('SPX actual value', fontsize = 14)

ax.set_ylabel('SPX predicted value', fontsize = 14)

plt.show()

This provides the following plot:

In

*Figure 4.10*, it appears as if the predictions are good across all values. We can, however, dig a little deeper by looking at the residuals, which are the differences between the actual and predicted values: - Calculate the residuals using the same approach as used previously, to account for the date offsets:
fig, ax = plt.subplots(figsize = (10, 7))

residuals = pd.DataFrame({'date' : (df.loc[max_lag:\

(df.shape[0] - 1), 'date']),

'residual' : df.loc[max_lag:\

(df.shape[0] - 1), 'close'] \

- pred_close.loc\

[max_lag:(df.shape[0] - 1), \

'pred_close']})

ax.scatter(residuals.date, residuals.residual)

ax.tick_params(axis = 'both', labelsize = 12)

ax.set_xlabel('Date', fontsize = 14)

ax.set_ylabel('residual (' + r'$SPX_{act} - SPX_{pred}$' \

+ ')', fontsize = 14)

plt.show()

This produces the following plot:

*Figure 4.11*shows that the residuals are uniformly spread around 0, meaning that there appears to be minimal bias in the model, and they seem to increase somewhat over time. This latter characteristic isn't automatically an indication of an issue—it is better to view this data as a percentage of the actual value—intuitively, the residual for an equally accurate model would be larger as the value gets larger. We can convert the data into a percent and look at that. - Calculate the percentage of residuals simply by dividing by the actual value (and multiplying by 100). Note that percent values have issues if the actual values are near zero, but, in this case, that is not an issue:
fig, ax = plt.subplots(figsize = (10, 7))

pct_residuals = pd.DataFrame({'date' : residuals.date, \

'pct_residual' : 100 \

* residuals.residual \

/ df.loc[max_lag:(df.shape[0] - 1), \

'close']})

ax.scatter(pct_residuals.date, pct_residuals.pct_residual)

ax.tick_params(axis = 'both', labelsize = 12)

ax.set_xlabel('Date', fontsize = 14)

ax.set_ylabel('% residual 100 *(' \

+ r'$SPX_{act} - SPX_{pred}$' + ') / ' \

+ r'$SPX_{act}$', fontsize = 14)

plt.show()

The output will be as follows:

Note

To access the source code for this specific section, please refer to https://packt.live/3eAi6DG.

You can also run this example online at https://packt.live/2Z0uEh4. You must execute the entire Notebook in order to get the desired result.

Now that the exercise is successfully completed, upgrade the version of pandas to continue to smoothly run the exercises and activities present in the rest of the book. To upgrade pandas, run:

**pip install pandas==1.0.3**

We now see that the percent error is very similar across the entire period, with a few periods where it gets larger. Excluding one outlier in 1987, most of the values are within 10 percent, and the vast majority are within 5 percent. We would conclude that this seems to be a fairly good model. However, we should reserve judgment until we see future results—predictions beyond the data we used in the model. We saw in *Figure 4.12* that the future prediction was a fairly linear upward trend after the end of the data.

We'll leave it as an exercise for you to go and get more recent S&P 500 closing data and compare this data with the one year prediction from this model. Also, we must stress that fully qualifying such a model requires separating out training, validation, and test data and performing tests such as cross-validation, which will be covered in *Chapter 6*, *Ensemble Modeling*. As a teaser, the worst error over the year after the data was about 20%, the average error 1.4%, and the error at the very end of the forecast period was 0.8%. We must *stress* that it is very challenging to predict stock markets and it might be fortuitous that the forecast period was somewhat uniformly positive.

From this exercise using an autoregression model, we can see that there is significant predictive power in using these models even when there is missing data. The autoregression model shown for the S&P 500 dataset was able to effectively provide predictions within the range of observed samples. However, outside of this range, when predicting future values for which no measurements have been taken, the predictive power may be somewhat limited. In this particular case, the future predictions seem reasonable. Let's now do an exercise that may be more challenging for this type of model.

## Activity 4.01: Autoregression Model Based on Periodic Data

In this activity, we will now use an autoregression model to fit the Austin weather dataset and predict future values. This data has different characteristics to the stock market data and will illustrate some of the challenges associated with applying autoregression.

The steps to be performed are as follows:

- Import the packages and classes needed. As in the stock market exercise, we need
**pandas**,**numpy**, the**AR**function from**statsmodels.tsa.ar_model**, the**plot_acf**function from**statsmodels.graphics.tsaplots**, and, of course,**matplotlib.pyplot**. - Load the Austin weather data (
**austin_weather.csv**) and convert the**Date**column to**datetime**as before. - Plot the complete set of average temperature values (
**df.TempAvgF**) with**Date**on the*x*axis.The output should be as follows:

- Construct an autocorrelation plot to see whether the average temperature can be used with an autoregression model. Consider how this plot is different from
*Exercise 4.01, Creating an Autoregression Model*and why.The plot should be as follows:

- Extract the actual ACF values using the
**numpy.correlate()**function as in*Exercise 4.01, Creating an Autoregression Model*. - Use the same
**plot_lag_grid**function to investigate the correlation versus the lag plots at various correlation values. Consider the fact that the raw data obviously repeats on a period of around 365 days, but also that an autocorrelation model might not be effective at such long lags. Look at both short and long lags and understand the data.The output for short lags will be as follows:

The output for longer lags will be as follows:

- Use the
**statsmodels****AR**function and the**model.fit()**method to model the data. Get the maximum model lag from the**model.fit()**method. Print the coefficients. How many terms are used (by default) in this case? - Use a maximum forecast period of 365 days (ask—why is this reasonable?) and generate predictions from the model. Using the same methods as before, match the correct dates to the predictions so that we can visualize them on the same plot as the raw data.
- Plot the predictions (including the 365-day forecast), as well as the original dataset.
The results should look as follows:

We can see from *Figure 4.18* that, as in the stock market case, the predictions in the range of the data used to fit the model are good. However, the future predictions don't seem nearly as good. They start on a reasonable trend, but then level off to a value that clearly isn't correct for some of the forecast period. This is a great example of the limits of autoregression models; they may be very effective for hard-to-predict series over short time periods, but the long term may be significantly in error. The only way to rigorously evaluate such a model before trusting it in production is to use the methods in *Chapter 6*, *Ensemble Modeling*.

Note

The solution for this activity can be found via this link.

In this section, we've used a variety of tools such as autocorrelation, lag plots, and autoregression to build a predictive model for a time series. Such models are fast to build, can work well on univariate data (where we have no *x* values other than time), and can provide good short-range predictions.

We have explored autoregression models as an alternative to linear regression models for time series data. Autoregression models can be very useful for univariate time series where we don't have any underlying model, or obvious predictors we can use for the *x* variables. Autoregression is most widely used in time series, and particularly in economic or financial modeling where we consider the series to be stochastic or a random variable. We have seen that, in certain cases, autoregression models can be extremely powerful, but that these models can be limited in cases where there is periodic or other non-constant behaviors versus time. Fundamentally, this limitation is due to autoregression being a linear combination of past values, so the future predictions are always reflective of the recent past. In general, autoregression models are most useful for relatively short-term predictions, although the definition of short term can be quite relative.

# Summary

In this chapter, we have investigated the use of autoregression models, which predict future values based on the temporal behavior of prior data in the series. Using autoregression modeling, we were able to accurately model the closing price of the S&P 500 over the years 1986 to 2018 and a year into the future. On the other hand, the performance of autoregression modeling to predict annually periodic temperature data for Austin, Texas, seemed more limited.

Now that we have experience with regression problems, we will turn our attention to classification problems in the next chapter.