4. Autoregression – The Supervised Learning Workshop

4. Autoregression

Overview

This chapter will teach you how to implement autoregression as a method for forecasting values depending on past values. By the end of this chapter, you will be able to create an autoregression model and construct time series regression models using autoregression. You will be fully equipped to use autoregressors to model datasets and predict future values.

Introduction

In the previous chapter, we studied the different methods used to construct linear regression models. We learned how to use the least squares method to develop linear models. We made use of dummy variables to improve the performance of these linear models. We also performed linear regression analysis with a polynomial model to improve the model's performance. Next, we implemented the gradient descent algorithm, which handles large datasets and large numbers of variables with ease.

In this chapter, we will be developing autoregression models. Autoregression is a special type of regression that can be used to predict future values based on the experience of previous data in the set.

Autoregression Models

Autoregression models are classical or "standard" modeling methods used on time series data (that is, any dataset that changes with time) and can complement the linear regression techniques covered previously. Autoregression models are often used for forecasting in the economics and finance industry as they are useful with univariate time series (where there are no x variables other than time) and with very large datasets (such as streaming data or high-frequency sensor data) where the linear algebra operations might run into memory or performance issues on very large datasets. The "auto" part of autoregression refers to the fact that these models leverage correlation of a time series to itself in the past, hence autoregression. In addition, many systems do not have an associated causal model—the time series data is said to be stochastic. An example is stock price data over time. Although many attempts have been made, and continue to be made, to develop predictive causal models of stock market behavior, very few have been successful. Thus, we can treat a given stock symbol price over time as a stochastic series and use autoregression to attempt to model it.

Note

To illustrate autoregression, we will use the S&P daily closing prices from 1986 to 2018, which are available in the repository associated with this book (https://packt.live/2w3ZkDw).

The original dataset can be found here:

https://www.kaggle.com/pdquant/sp500-daily-19862018

A graphical view of this data is shown in the following figure:

Figure 4.1: S&P 500 Daily Closing Price

The main principle behind autoregression models is that, given enough previous observations, a reasonable prediction for the future can be made; that is, we are essentially constructing a model using the dataset as a regression against itself, using past values as the predictors. A key factor in choosing an autoregression model is that there is enough correlation of the future values to past values at specific lag times. Lag time refers to how far into the past the model uses data to predict the future.

Exercise 4.01: Creating an Autoregression Model

In this exercise, we will use autoregression and attempt to predict the S&P 500 closing price 1 year into the future:

Note

This exercise will work on an earlier version of pandas, ensure that you downgrade the version of pandas using the command:

pip install pandas==0.24.2

  1. Import the necessary packages and classes. In this exercise, we introduce the statsmodels package, which includes a wide range of statistical and modeling functions, including autoregression. If you have not previously installed statsmodels from a terminal prompt, use the following command:

    conda install -c anaconda statsmodels

    If you are not using Anaconda (or Miniconda) and instead install via pip, use the following command:

    pip install -U statsmodels

    Once statsmodels is installed on your system, load the following:

    import pandas as pd

    import numpy as np

    from statsmodels.tsa.ar_model import AR

    from statsmodels.graphics.tsaplots import plot_acf

    import matplotlib.pyplot as plt

  2. Load the S&P 500 data (spx.csv) and convert the date column into a datetime data type:

    df = pd.read_csv('../Datasets/spx.csv')

    df['date'] = pd.to_datetime(df['date'])

    print(df.head())

    print(df.tail())

    We'll get the following output:

    Figure 4.2: S&P 500 historical data

    If you look carefully at the data in Figure 4.2, you might see that some data is missing—for example, there is no data on 1/4/1986 and 1/5/1986. These are weekend dates during which the market is closed. An autoregression model, especially one containing a lot of data, will not be sensitive to the missing values for at least two reasons. First, since the model moves forward one period at a time and, in higher-order models, uses multiple past values, the prediction will be less sensitive to missing values compared to a model that was, say, based on a single lag value. Second, in the case such as here, where most missing values are periodic (Saturdays repeat every 7 days, Sundays repeat every 7 days, and so on), then, if we have a lot of data, the model will automatically account for those days. However, as we will see, the uncertainty of predicting far beyond the end of existing data can become large for autoregression models.

  3. Plot the raw dataset versus the date data type:

    fig, ax = plt.subplots(figsize = (10, 7))

    ax.plot(df.date, df.close)

    ax.set_title('S&P 500 Daily Closing Price', fontsize = 16)

    ax.set_ylabel('Price ($)', fontsize = 14)

    ax.tick_params(axis = 'both', labelsize = 12)

    plt.show()

    The output will be as follows:

    Figure 4.3: Plot of the S&P 500 closing prices

  4. Before constructing an autoregression model, we should first check to see whether the model is able to be used as a regression against itself. As we noted earlier, the success of an autoregression model depends on the ability to use past values in a linear model to predict future values. That means the future values should be strongly correlated to past values. We can investigate this using the statsmodels plot_acf function (plotting the autocorrelation function). We mentioned before that how far back we look into the past is called the lag; we will override the default maximum value of plot_acf and plot lags from 0 days to 4,000 days:

    max_lag = 4000

    fig, ax = plt.subplots(figsize = (10, 7))

    acf_plot = plot_acf(x = df.close, ax = ax, lags = max_lag, \

                        use_vlines = False, alpha = 0.9, \

                        title = 'Autocorrelation of SPX vs. lag')

    ax.grid(True)

    ax.text(1000, 0.01, '90% confidence interval')

    ax.set_xlabel('Lag', fontsize = 14)

    ax.tick_params(axis = 'both', labelsize = 12)

    plt.show()

    The result should appear as follows:

    Figure 4.4: Autocorrelation plot of the S&P 500 closing price versus lag (days)

    We can understand this chart as follows. First, by definition at 0 lag, a series is perfectly correlated to itself so the autocorrelation function (ACF) value is 1.0. Then, we see that as we increase the lag, the series is less and less correlated, meaning each data point farther back in time has less information about the value we are predicting. This is very typical of stochastic or random series (note that periodic series will have peaks and valleys in the ACF plot—we will see this later). Also, by choosing the value in the function call, we plotted a 90% confidence interval shaded in blue. The meaning of this interval is that any lag outside the interval is considered statistically significant—in other words, the correlation is statistically valid. To build an autocorrelation model, we must have ACF values outside the confidence interval to be successful. In this case, we have a fairly high correlation from 0 to some number of days, which we can use in the model (more on that shortly). Finally, we can see that at lags above about 2,750 days, the correlation is negative—that is, the past values predict the opposite to the future. However, in this case, those long-term negative lags are not very significant.

  5. To get some intuition for the ACF results, let's choose a fairly short lag of 100 days and plot both the original data and the lagged data on one chart. We can do that using the pandas.shift() function:

    spx_shift_100 = df.copy()

    spx_shift_100['close'] = df.close.shift(100)

    fix, ax = plt.subplots(figsize = (10, 7))

    ax.plot(df.date, df.close, c = "blue")

    ax.plot(spx_shift_100.date, spx_shift_100.close, c = "red")

    plt.show()

    The output will be as follows:

    Figure 4.5: S&P 500 closing price (blue) and with a lag of 100 days (red)

    In Figure 4.5, the red line indicates the values from 100 days ago, as compared to the blue line, which indicates actual (present) values for a given date. We see that during periods when the value is increasing, the past values are below the actual, and when the value is decreasing, the opposite is true. This makes intuitive sense. Importantly, during large portions of the time period covered, the vertical space between the two curves looks about constant. That means, intuitively, that the relationship of the past to the present is roughly similar. If you think about these curves for some time, you will also see the limitation of autoregression—the predictions will always look like the recent history, so when things change, the predictions will be less accurate until the model "catches up" with the new behavior.

  6. There is one more way in which we can visualize the correlation we are analyzing. In the multiple linear regression case, we introduced a chart where we plotted the predicted values against the actual values; perfect prediction was along the diagonal. Similarly, if, instead of plotting the lagged values and the actual values, both versus time, we could plot the lagged values against the actual values. Let's see what this would look like:

    print(spx_shift_100.head(), '\n', spx_shift_100.tail())

    fig, ax = plt.subplots(figsize = (7, 7))

    ax.scatter(df.loc[100:, 'close'], spx_shift_100.loc[100:, 'close'])

    ax.set_xlim(0, 3000)

    ax.set_ylim(0, 3000)

    plt.show()

    The output will be as follows:

            date close

    0 1986-01-02 NaN

    1 1986-01-03 NaN

    2 1986-01-06 NaN

    3 1986-01-07 NaN

    4 1986-01-08 NaN

               date close

    8187 2018-06-25 2823.81

    8188 2018-06-26 2821.98

    8189 2018-06-27 2762.13

    8190 2018-06-28 2648.94

    8191 2018-06-29 2695.14

    The plot will be as follows:

    Figure 4.6: S&P closing value with a lag of 100 days versus actual values

    Figure 4.6 shows that most of the values with a lag of 100 days are along a diagonal line, meaning that the relationship between the lagged values and the present values is similar across all actual values.

  7. We could create plots like Figure 4.6 across a range of lag times to try to understand which lags would be useful in a model and which would not be useful. Before we do that, it would be useful to have the values of the ACF from Figure 4.6, so we could relate what we see in the lag plots to the correlation function value. The plot_acf function of statsmodels is based upon an underlying numpy function, correlate. We can use that to get the values shown in Figure 4.6:

    """

    the statsmodels plot_acf is based upon the numpy correlate

    function, so we can generate the actual values for

    illustration and so we can label some later plots

    the standard presentation of an acf plot has the value at

    lag 0 == 1; the correlate function returns unscaled

    values so we get the first value for scaling to 1

    the values to be tested in the function must have

    the mean of the un-shifted series subtracted from

    both series

    """

    corr0 = np.correlate(df.close[0: ] - df.close.mean(), \

                         df.close[0: ] - df.close.mean(), \

                         mode = 'valid')

    corrs = [np.correlate(df.close[:(df.close.shape[0] - i)] \

                          - df.close.mean(), df.close[i: ] \

                          - df.close.mean(), mode = 'valid')

            for i in range(max_lag)] / corr0

    Note that we subtract the mean of the underlying series (df.close) from every value used in the correlate function. This is a mathematical subtlety to be consistent with the output of the plot_act function output.

  8. Now, rather than creating many charts by hand at different lag values, let's create a function to generate a grid of plots that we can use with various lag ranges, numbers of plots, and so on. We will pass the df.close series to the function, along with the preceding corrs values, and with parameters to control the plots:

    """

    utility function to plot out a range of

    plots depicting self-correlation

    """

    def plot_lag_grid(series, corrs, axis_min, axis_max, \

                      num_plots, total_lag, n_rows, n_cols):

        lag_step = int(total_lag / num_plots)

        fig = plt.figure(figsize = (18, 16))

        for i in range(num_plots):

            corr = corrs[lag_step * i]

            ax = fig.add_subplot(n_rows, n_cols, i + 1)

            ax.scatter(series, series.shift(lag_step * i))

            ax.set_xlim(axis_min, axis_max)

            ax.set_ylim(axis_min, axis_max)

            ax.set_title('lag = ' + str(lag_step * i))

            ax.text(axis_min + 0.05 * (axis_max - axis_min), \

                    axis_max - 0.05 * (axis_max - axis_min), \

                    'correlation = ' + str(round(corr[0], 3)))

        fig.tight_layout()

        plt.show()

  9. We are now prepared to get a deeper understanding of how the correlation relates to the lag plots. We'll call the function:

    """

    create a grid to see how well the data at increasing

    lags correlates to the original data

    'perfect' correlation will appear as a diagonal line

    the farther from the line, the poorer the correlation

    """

    plot_lag_grid(df.close, corrs, df.close.min(), df.close.max(), \

                  num_plots = 16, total_lag = 480, \

                  n_rows = 4, n_cols = 4)

    This will produce the following:

    Figure 4.7: Lag plots at various values

    In Figure 4.7, we can see a gradual degradation in lag chart appearance in direct relation to the ACF function value shown on each plot. This gives us the idea that trying to use longer lags will add noise to the model, which looks like it would be fairly large by lag 60, and not too large at lag 30. Now, we can use another statsmodels function to develop a model and see how that fits in with our idea.

  10. The statsmodels function, AR, along with the associated fit method, builds an autoregression model. Used with the defaults, it will determine the max lag, return all the parameters from lag 1 to the max lag, and allow us to predict both in the range of the existing data and into the future:

    """

    statsmodels AR function builds an autoregression model

    using all the defaults, it will determine the max lag

    and provide all the model coefficients

    """

    model = AR(df.close)

    model_fit = model.fit()

    # model fit now contains all the model information

    max_lag = model_fit.k_ar

    """

    note that by using defaults, the maximum lag is

    computed as round(12*(nobs/100.)**(1/4.))

    see https://www.statsmodels.org/devel/generated/statsmodels.tsa.ar_model.AR.fit.html#statsmodels.tsa.ar_model.AR.fit

    """

    print('Max Lag: ' + str(max_lag))

    print('Coefficients: \n' + str(model_fit.params))

    The output will be as follows:

    Figure 4.8: Lag coefficients from the statsmodels AR function

    Note that there are 36 coefficients for each of the weights and one constant—the function determined that the maximum lag to use in the model is 36 days. It is easy to make predictions and visualize the result:

    # how far into the future we want to predict

    max_forecast = 365

    # generate predictions from the model

    pred_close = pd.DataFrame({'pred_close': \

                               model_fit.predict(start = max_lag, \

                                                 end = df.shape[0] \

                                                 + max_forecast - 1)})

    # attach the dates for visualization

    pred_close['date'] = df.loc[pred_close.index, 'date'].reindex()

    pred_close.loc[(max(df.index) + 1):, 'date'] = \

    pd.to_datetime([max(df.date) \

                   + pd.Timedelta(days = i) \

                   for i in range(1, max_forecast + 1)])

    """

    visualize the predictions overlaid on the real data

    as well as the extrapolation to the future

    """

    fig, ax = plt.subplots(figsize = (10, 7))

    ax.plot(df.date, df.close, c = "blue", linewidth = 4, \

            label = 'actual SPX close')

    ax.plot(pred_close.loc[0 : len(df.close), 'date'], \

            pred_close.loc[0 : len(df.close), 'pred_close'], \

            c = "yellow", linewidth = 0.5, \

            label = 'predicted SPX close')

    ax.plot(pred_close.loc[len(df.close):, 'date'], \

            pred_close.loc[len(df.close):, 'pred_close'], \

            c = "red", linewidth = 2, label = 'forecast SPX close')

    ax.set_xlabel('Date', fontsize = 14)

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.legend()

    plt.show()

    The result will look as follows:

    Figure 4.9: S&P 500 closing vales, predicted values, and forecasted (future) values from the autoregression model with order 36

    Note that the predictions do an excellent job of following the dataset, and that after the dataset has ended, the predictions are relatively linear. Given that the model is constructed from a linear model of the previous samples, and that, after yt+1, the predictions are less and less certain as they become based on past predictions, each of which has some error.

  11. The fit is quite good. We can use the same approach as for linear regression to compare the predictions versus the actual values. We have to do a little work to deal with the fact that the predicted values don't start until the maximum lag beyond the start date of the original data—we need at least the maximum lag number of past values to predict the next one—in this case, 36 values. The result is that we need to offset the index of the datasets we want to compare, which we did not have to do in the linear regression case:

    # compare predicted vs. actual

    fig, ax = plt.subplots(figsize = (10, 7))

    ax.scatter(df.loc[max_lag:(df.shape[0] - 1), 'close'], \

               pred_close.loc[max_lag:(df.shape[0] - 1), 'pred_close'])

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.set_xlabel('SPX actual value', fontsize = 14)

    ax.set_ylabel('SPX predicted value', fontsize = 14)

    plt.show()

    This provides the following plot:

    Figure 4.10: Predicted S&P 500 closing values versus the actual values

    In Figure 4.10, it appears as if the predictions are good across all values. We can, however, dig a little deeper by looking at the residuals, which are the differences between the actual and predicted values:

  12. Calculate the residuals using the same approach as used previously, to account for the date offsets:

    fig, ax = plt.subplots(figsize = (10, 7))

    residuals = pd.DataFrame({'date' : (df.loc[max_lag:\

                                       (df.shape[0] - 1), 'date']),

                              'residual' : df.loc[max_lag:\

                                          (df.shape[0] - 1), 'close'] \

                                           - pred_close.loc\

                                           [max_lag:(df.shape[0] - 1), \

                                           'pred_close']})

    ax.scatter(residuals.date, residuals.residual)

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.set_xlabel('Date', fontsize = 14)

    ax.set_ylabel('residual (' + r'$SPX_{act} - SPX_{pred}$' \

                  + ')', fontsize = 14)

    plt.show()

    This produces the following plot:

    Figure 4.11: Residual values versus time for the autoregression model of the S&P 500 closing price

    Figure 4.11 shows that the residuals are uniformly spread around 0, meaning that there appears to be minimal bias in the model, and they seem to increase somewhat over time. This latter characteristic isn't automatically an indication of an issue—it is better to view this data as a percentage of the actual value—intuitively, the residual for an equally accurate model would be larger as the value gets larger. We can convert the data into a percent and look at that.

  13. Calculate the percentage of residuals simply by dividing by the actual value (and multiplying by 100). Note that percent values have issues if the actual values are near zero, but, in this case, that is not an issue:

    fig, ax = plt.subplots(figsize = (10, 7))

    pct_residuals = pd.DataFrame({'date' : residuals.date, \

                                  'pct_residual' : 100 \

                                  * residuals.residual \

                                  / df.loc[max_lag:(df.shape[0] - 1), \

                                                    'close']})

    ax.scatter(pct_residuals.date, pct_residuals.pct_residual)

    ax.tick_params(axis = 'both', labelsize = 12)

    ax.set_xlabel('Date', fontsize = 14)

    ax.set_ylabel('% residual 100 *(' \

                  + r'$SPX_{act} - SPX_{pred}$' + ') / ' \

                  + r'$SPX_{act}$', fontsize = 14)

    plt.show()

    The output will be as follows:

Figure 4.12: Residual values as a percent of actual value versus time

Note

To access the source code for this specific section, please refer to https://packt.live/3eAi6DG.

You can also run this example online at https://packt.live/2Z0uEh4. You must execute the entire Notebook in order to get the desired result.

Now that the exercise is successfully completed, upgrade the version of pandas to continue to smoothly run the exercises and activities present in the rest of the book. To upgrade pandas, run:

pip install pandas==1.0.3

We now see that the percent error is very similar across the entire period, with a few periods where it gets larger. Excluding one outlier in 1987, most of the values are within 10 percent, and the vast majority are within 5 percent. We would conclude that this seems to be a fairly good model. However, we should reserve judgment until we see future results—predictions beyond the data we used in the model. We saw in Figure 4.12 that the future prediction was a fairly linear upward trend after the end of the data.

We'll leave it as an exercise for you to go and get more recent S&P 500 closing data and compare this data with the one year prediction from this model. Also, we must stress that fully qualifying such a model requires separating out training, validation, and test data and performing tests such as cross-validation, which will be covered in Chapter 6, Ensemble Modeling. As a teaser, the worst error over the year after the data was about 20%, the average error 1.4%, and the error at the very end of the forecast period was 0.8%. We must stress that it is very challenging to predict stock markets and it might be fortuitous that the forecast period was somewhat uniformly positive.

From this exercise using an autoregression model, we can see that there is significant predictive power in using these models even when there is missing data. The autoregression model shown for the S&P 500 dataset was able to effectively provide predictions within the range of observed samples. However, outside of this range, when predicting future values for which no measurements have been taken, the predictive power may be somewhat limited. In this particular case, the future predictions seem reasonable. Let's now do an exercise that may be more challenging for this type of model.

Activity 4.01: Autoregression Model Based on Periodic Data

In this activity, we will now use an autoregression model to fit the Austin weather dataset and predict future values. This data has different characteristics to the stock market data and will illustrate some of the challenges associated with applying autoregression.

The steps to be performed are as follows:

  1. Import the packages and classes needed. As in the stock market exercise, we need pandas, numpy, the AR function from statsmodels.tsa.ar_model, the plot_acf function from statsmodels.graphics.tsaplots, and, of course, matplotlib.pyplot.
  2. Load the Austin weather data (austin_weather.csv) and convert the Date column to datetime as before.
  3. Plot the complete set of average temperature values (df.TempAvgF) with Date on the x axis.

    The output should be as follows:

    Figure 4.13: Plot of Austin temperature data over several years

  4. Construct an autocorrelation plot to see whether the average temperature can be used with an autoregression model. Consider how this plot is different from Exercise 4.01, Creating an Autoregression Model and why.

    The plot should be as follows:

    Figure 4.14: Autocorrelation versus lag (days)

  5. Extract the actual ACF values using the numpy.correlate() function as in Exercise 4.01, Creating an Autoregression Model.
  6. Use the same plot_lag_grid function to investigate the correlation versus the lag plots at various correlation values. Consider the fact that the raw data obviously repeats on a period of around 365 days, but also that an autocorrelation model might not be effective at such long lags. Look at both short and long lags and understand the data.

    The output for short lags will be as follows:

    Figure 4.15: Lag plots with short lags

    The output for longer lags will be as follows:

    Figure 4.16: Lag plots with longer lags

  7. Use the statsmodels AR function and the model.fit() method to model the data. Get the maximum model lag from the model.fit() method. Print the coefficients. How many terms are used (by default) in this case?
  8. Use a maximum forecast period of 365 days (ask—why is this reasonable?) and generate predictions from the model. Using the same methods as before, match the correct dates to the predictions so that we can visualize them on the same plot as the raw data.
  9. Plot the predictions (including the 365-day forecast), as well as the original dataset.

    The results should look as follows:

Figure 4.17: Austin temperature data, in-data predictions, and out-of-data forecast

Figure 4.18: Detail of predictions near the end of the data

We can see from Figure 4.18 that, as in the stock market case, the predictions in the range of the data used to fit the model are good. However, the future predictions don't seem nearly as good. They start on a reasonable trend, but then level off to a value that clearly isn't correct for some of the forecast period. This is a great example of the limits of autoregression models; they may be very effective for hard-to-predict series over short time periods, but the long term may be significantly in error. The only way to rigorously evaluate such a model before trusting it in production is to use the methods in Chapter 6, Ensemble Modeling.

Note

The solution for this activity can be found via this link.

In this section, we've used a variety of tools such as autocorrelation, lag plots, and autoregression to build a predictive model for a time series. Such models are fast to build, can work well on univariate data (where we have no x values other than time), and can provide good short-range predictions.

We have explored autoregression models as an alternative to linear regression models for time series data. Autoregression models can be very useful for univariate time series where we don't have any underlying model, or obvious predictors we can use for the x variables. Autoregression is most widely used in time series, and particularly in economic or financial modeling where we consider the series to be stochastic or a random variable. We have seen that, in certain cases, autoregression models can be extremely powerful, but that these models can be limited in cases where there is periodic or other non-constant behaviors versus time. Fundamentally, this limitation is due to autoregression being a linear combination of past values, so the future predictions are always reflective of the recent past. In general, autoregression models are most useful for relatively short-term predictions, although the definition of short term can be quite relative.

Summary

In this chapter, we have investigated the use of autoregression models, which predict future values based on the temporal behavior of prior data in the series. Using autoregression modeling, we were able to accurately model the closing price of the S&P 500 over the years 1986 to 2018 and a year into the future. On the other hand, the performance of autoregression modeling to predict annually periodic temperature data for Austin, Texas, seemed more limited.

Now that we have experience with regression problems, we will turn our attention to classification problems in the next chapter.