Regression Simulation

Abstract

A California specialty food client wants to expand their offerings with a dairy division but first they need an understanding of the production trends. They have asked us to create a regression model for either Milk, Cottage Cheese or Ice Cream productions in California.

Let's load our packages and the California dairy production dataset. Then let's look at our data frame and get an idea of the data we loaded.

In [2]:
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
import statsmodels.tsa.seasonal as sts
from statsmodels.tsa.stattools import adfuller
import statsmodels.graphics.tsaplots as splt
from scipy.stats import zscore
import sklearn.linear_model as lm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
In [3]:
#Load dataset
df = pd.read_csv(fileName, header = 'infer')
In [4]:
#print top 5 rows and shape
print(df.head())
print(df.shape)
   Year Month  Cotagecheese.Prod  Icecream.Prod  Milk.Prod  N.CA.Fat.Price  \
0  1995   Jan              4.370         51.595      2.112          0.9803   
1  1995   Feb              3.695         56.086      1.932          0.8924   
2  1995   Mar              4.538         68.453      2.162          0.8924   
3  1995   Apr              4.280         65.722      2.130          0.8967   
4  1995   May              4.470         73.730      2.227          0.8967   

   Month.Count  monthNumSqred  monthNumCubed  
0            1              1              1  
1            2              4              8  
2            3              9             27  
3            4             16             64  
4            5             25            125  
(228, 9)
In [5]:
#print columns
print(df.columns)
Index(['Year', 'Month', 'Cotagecheese.Prod', 'Icecream.Prod', 'Milk.Prod',
       'N.CA.Fat.Price', 'Month.Count', 'monthNumSqred', 'monthNumCubed'],
      dtype='object')

We should look at the value counts of our Year column to determine the years we will be working with, and so we can set our index as the date range.

In [6]:
#print unique year and value counts
print(df['Year'].unique())
print(df['Year'].value_counts())
[1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 2010 2011 2012 2013]
2013    12
2003    12
1996    12
1997    12
1998    12
1999    12
2000    12
2001    12
2002    12
2004    12
2012    12
2005    12
2006    12
2007    12
2008    12
2009    12
2010    12
2011    12
1995    12
Name: Year, dtype: int64

Looking at the unique Years and Year value counts we can see that our years go from 1995 to 2013 and all years include 12 months of data. Our dataset is already sorted. With this information we can determine that the date range for the index needs to be 1/1/1995 to 12/31/2013. Now let's set our index for the dataframe as our determined date range, with a monthly frequency.

In [7]:
#set index of data frame to date range 1/1/1995-12/31/2013
df.index = pd.date_range(start = '1/1/1995', end = '12/31/2013', freq = 'M') 

Let's take another look at our data frame now.

In [8]:
#print top 5 rows and shape
print(df.head())
print(df.shape)
            Year Month  Cotagecheese.Prod  Icecream.Prod  Milk.Prod  \
1995-01-31  1995   Jan              4.370         51.595      2.112   
1995-02-28  1995   Feb              3.695         56.086      1.932   
1995-03-31  1995   Mar              4.538         68.453      2.162   
1995-04-30  1995   Apr              4.280         65.722      2.130   
1995-05-31  1995   May              4.470         73.730      2.227   

            N.CA.Fat.Price  Month.Count  monthNumSqred  monthNumCubed  
1995-01-31          0.9803            1              1              1  
1995-02-28          0.8924            2              4              8  
1995-03-31          0.8924            3              9             27  
1995-04-30          0.8967            4             16             64  
1995-05-31          0.8967            5             25            125  
(228, 9)

Now that our date index has been set let's plot the time series plots of our three California dairy productions.

In [9]:
fig, (ax1, ax2, ax3) = plt.subplots(3, 1) 
df['Cotagecheese.Prod'].plot(ax = ax1) 
df['Icecream.Prod'].plot(ax = ax2) 
df['Milk.Prod'].plot(ax = ax3) 
ax1.set_ylabel('Cottage Cheese') 
ax2.set_ylabel('Ice Cream') 
ax3.set_ylabel('Milk') 
ax3.set_xlabel('Date') 
ax1.set_title('California production time series') 
plt.tight_layout() 

These 3 production time serieses look interesting but let's focus on the milk production for our analysis since cottage cheese production seems to be in decline and would not be a good avenue for our client to pursue, and ice cream seems very seasonal. The milk time series is not stationary, it seems to have gradually increased before reaching a peak where it is currently fluctuating. Let's create a log of the milk production column since the seasonality of the Milk production time series seems to be multiplicative.

In [10]:
df['Milk_log'] = np.log(df['Milk.Prod'] ) 
df['Milk_log'].plot() 
plt.title('Time series plot of Log of California Milk production') 
plt.ylabel('Value') 
plt.xlabel('Date') 
df.columns 
Out[10]:
Index(['Year', 'Month', 'Cotagecheese.Prod', 'Icecream.Prod', 'Milk.Prod',
       'N.CA.Fat.Price', 'Month.Count', 'monthNumSqred', 'monthNumCubed',
       'Milk_log'],
      dtype='object')

Decomposition

Now let's create a function that will decompse a time series into a Seasonal, Trend, Residual (STL) model and plot the outputs. The function will then return a data frame with the Seasonal, Trend, and Residual components.

In [11]:
#function to STL decompose given Time series
def decomp_ts(ts, model = 'additive'):
    res = sts.seasonal_decompose(ts, model = model) 
    res.plot() 
    return(pd.DataFrame({'resid': res.resid, 
                         'trend': res.trend, 
                         'seasonal': res.seasonal},
                       index = ts.index) ) 

Now let's decompse our Milk Log column, plot the outputs, and print the head and tail of our resulting STL model dataframe.

In [12]:
milk_decomp = decomp_ts(df['Milk_log']) 
print(milk_decomp.head(12)) #print top 12 rows
print(milk_decomp.tail(12)) #print bottow 12 rows
               resid     trend  seasonal
1995-01-31       NaN       NaN  0.009169
1995-02-28       NaN       NaN -0.063646
1995-03-31       NaN       NaN  0.046570
1995-04-30       NaN       NaN  0.026694
1995-05-31       NaN       NaN  0.053586
1995-06-30       NaN       NaN  0.004263
1995-07-31  0.026259  0.746283  0.008616
1995-08-31  0.016510  0.747581  0.002307
1995-09-30  0.014267  0.749151 -0.039741
1995-10-31  0.008047  0.750003 -0.006162
1995-11-30 -0.003504  0.750755 -0.039215
1995-12-31 -0.011135  0.751219 -0.002441
               resid     trend  seasonal
2013-01-31  0.002499  1.230179  0.009169
2013-02-28  0.006582  1.229856 -0.063646
2013-03-31  0.024035  1.231219  0.046570
2013-04-30  0.028554  1.231778  0.026694
2013-05-31  0.027831  1.232307  0.053586
2013-06-30  0.012240  1.233112  0.004263
2013-07-31       NaN       NaN  0.008616
2013-08-31       NaN       NaN  0.002307
2013-09-30       NaN       NaN -0.039741
2013-10-31       NaN       NaN -0.006162
2013-11-30       NaN       NaN -0.039215
2013-12-31       NaN       NaN -0.002441

Looking at the decomposed model we can see that there is a significant seasonal component involved with the milk production. We can see a clear trend, and we can see our residual seems stationary; however, there is some increased variance at the end of the residual. Let's conduct the Dicky Fuller Test on the residual to determine if it is stationary. Let's create a function that will conduct the Dicky Fuller test. Then let's conduct the test on our decomposed milk residual skipping the first 6 and last 6 rows since they are NaNs.

In [13]:
#Conduct Dicky Fuller test and return key Stats
def DF_Test(ts):
    stationary = adfuller(ts) 
    ## Print the results
    print('D-F statistic = ' + str(stationary[0])) 
    print('p-value = ' + str(stationary[1]))
    print('number of lags used = ' + str(stationary[2]))
    print('Critical value at 5% confidence = ' + str(stationary[4]['5%']))
    print('Critical value at 10% confidence = ' + str(stationary[4]['10%'])) 
In [14]:
DF_Test(milk_decomp.resid[6:-6]) 
D-F statistic = -8.28093568269
p-value = 4.52179222386e-13
number of lags used = 7
Critical value at 5% confidence = -2.87553798678
Critical value at 10% confidence = -2.57423108081

Looking at the above DF statistic and p-value we can reject the null hypothesis that the residual is not stationary.

Now let's calculate and plot the autocorrelation function and partial correlation function for our decomposed milk residual.

In [15]:
splt.plot_acf(milk_decomp.resid[6:-6], lags = 40) 
splt.plot_pacf(milk_decomp.resid[6:-6], lags = 40) 
Out[15]:

Looking at the above ACF plot, it looks like besides 0 there is 1 more lag that shows significance. Let's create an ARIMA model of the decomposed milk residual of order (2,0,2). Let's first create a function that will fit our ARIMA model with a given order and return the fit model, along with printing the model's summary.

In [16]:
#function to fit ARIMA model, return fit model and print model summary
def model_ARIMA(ts, order): 
    model = ARIMA(ts, order = order) 
    model_fit = model.fit(disp=0, method='mle', trend='nc') 
    print(model_fit.summary()) 
    return(model_fit)
In [17]:
arima_milk = model_ARIMA(milk_decomp.resid[6:-6], order = (2,0,2)) 
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                  resid   No. Observations:                  216
Model:                     ARMA(2, 2)   Log Likelihood                 659.096
Method:                           mle   S.D. of innovations              0.011
Date:                Sat, 16 Jun 2018   AIC                          -1308.192
Time:                        01:19:28   BIC                          -1291.316
Sample:                    07-31-1995   HQIC                         -1301.374
                         - 06-30-2013                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
ar.L1.resid     1.5079      0.081     18.516      0.000       1.348       1.667
ar.L2.resid    -0.6937      0.075     -9.216      0.000      -0.841      -0.546
ma.L1.resid    -1.1349      0.110    -10.306      0.000      -1.351      -0.919
ma.L2.resid     0.1560      0.109      1.434      0.153      -0.057       0.369
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.0868           -0.5102j            1.2006           -0.0699
AR.2            1.0868           +0.5102j            1.2006            0.0699
MA.1            1.0257           +0.0000j            1.0257            0.0000
MA.2            6.2503           +0.0000j            6.2503            0.0000
-----------------------------------------------------------------------------

This model above does not look bad however the MA 2 residual has a rather high p value and has a confidence interval that includes 0. We cannot reject the null hypothesis that this variable is not helping the model. Let's go ahead and create a function below that will step through our decomposed milk residual with a given maximum value for P,D,Q and find the ARIMA model with the best BIC. The function will then return the best BIC, best order and the best model.

In [18]:
#Function Fits ARIMA models and returns the BIC, the Order of the model and the fit model
def model_ARIMA_2(ts, order):
    model = ARIMA(ts, order = order) 
    model_fit = model.fit(disp=0, method='mle', trend='nc')
    BIC = ARIMAResults.bic(model_fit) 
    print('Testing model of order: ' + str(order) + ' with BIC = ' + str(BIC))
    return(BIC, order, model_fit)

#Function to find the best bic of ARIMA model by passing max P, D, Q values and looping through them keeping track of best BIC
def step_ARIMA(resid, p_max, d_max, q_max):
    best_BIC = 9999999999999999.0 
    #loop through passed P, D, Q
    for p in range(p_max + 1):
        for d in range(d_max + 1):
            for q in range(q_max + 1):
                if(p > 0 or q > 0):
                    #try to fit model of given order pass if not possible
                    try:
                        order = (p, d, q)
                        BIC, order, model = model_ARIMA_2(resid, order)
                        #determine if BIC is better than previous BICs if so save it as best model
                        if(BIC < best_BIC): 
                            best_model = model
                            best_BIC = BIC
                            best_order = order
                    except: 
                        pass        
    return(best_BIC, best_order, best_model) 

Now let's run the function above on our decomposed milk residual with maximum P of 4, D of 0 and Q of 4.

In [19]:
#Determine best ARIMA model of given max orders and print best Model
BIC, order, model = step_ARIMA(milk_decomp.resid[6:-6], 4, 0, 4)    
print('***************************************')
print('Best model with BIC = ' + str(BIC) + ' and with order '+ str(order))
Testing model of order: (0, 0, 1) with BIC = -1259.2535158
Testing model of order: (0, 0, 2) with BIC = -1263.38952582
Testing model of order: (0, 0, 3) with BIC = -1261.65898902
Testing model of order: (0, 0, 4) with BIC = -1259.6340335
Testing model of order: (1, 0, 0) with BIC = -1266.33787122
Testing model of order: (1, 0, 1) with BIC = -1262.70759988
Testing model of order: (1, 0, 2) with BIC = -1259.39567819
Testing model of order: (2, 0, 0) with BIC = -1263.78487217
Testing model of order: (2, 0, 1) with BIC = -1294.74005217
Testing model of order: (2, 0, 2) with BIC = -1291.31560337
Testing model of order: (3, 0, 0) with BIC = -1265.06323554
Testing model of order: (3, 0, 1) with BIC = -1291.29082633
Testing model of order: (4, 0, 0) with BIC = -1267.27307533
Testing model of order: (4, 0, 1) with BIC = -1285.96385021
Testing model of order: (4, 0, 3) with BIC = -1283.60993067
***************************************
Best model with BIC = -1294.74005217 and with order (2, 0, 1)

The best ARIMA model for the STL decomposition residual of the Milk production time series is of order (2,0,1). Let's go ahead and run that model below and look at the summary statistics.

In [20]:
arima_milk = model_ARIMA(milk_decomp.resid[6:-6], order = (2,0,1)) 
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                  resid   No. Observations:                  216
Model:                     ARMA(2, 1)   Log Likelihood                 658.121
Method:                           mle   S.D. of innovations              0.011
Date:                Sat, 16 Jun 2018   AIC                          -1308.241
Time:                        01:19:31   BIC                          -1294.740
Sample:                    07-31-1995   HQIC                         -1302.787
                         - 06-30-2013                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
ar.L1.resid     1.4072      0.055     25.587      0.000       1.299       1.515
ar.L2.resid    -0.6030      0.055    -10.956      0.000      -0.711      -0.495
ma.L1.resid    -0.9787      0.014    -67.772      0.000      -1.007      -0.950
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1669           -0.5449j            1.2878           -0.0695
AR.2            1.1669           +0.5449j            1.2878            0.0695
MA.1            1.0218           +0.0000j            1.0218            0.0000
-----------------------------------------------------------------------------

The model above has a better BIC value than our original model and all of our model features have p values that are essentially 0.

Now let's normalize the time features of our data frame, and let's one hot encode the month column.

In [21]:
## Normalize the time features
df.loc[:, ['Month.Count', 'monthNumSqred', 'monthNumCubed']] = df.loc[:, ['Month.Count', 'monthNumSqred', 'monthNumCubed']].apply(zscore)

## Create dummy variables for the months
dummies = pd.get_dummies(df.loc[:, 'Month'])
df[list(dummies.columns)] = dummies

## Print the head of the data frame to look at the dummy variables.
df.head(12)
Out[21]:
Year Month Cotagecheese.Prod Icecream.Prod Milk.Prod N.CA.Fat.Price Month.Count monthNumSqred monthNumCubed Milk_log ... Dec Feb Jan Jul Jun Mar May Nov Oct Sep
1995-01-31 1995 Jan 4.370 51.595 2.112 0.9803 -1.724471 -1.120740 -0.883649 0.747635 ... 0 0 1 0 0 0 0 0 0 0
1995-02-28 1995 Feb 3.695 56.086 1.932 0.8924 -1.709277 -1.120548 -0.883646 0.658556 ... 0 1 0 0 0 0 0 0 0 0
1995-03-31 1995 Mar 4.538 68.453 2.162 0.8924 -1.694084 -1.120226 -0.883641 0.771034 ... 0 0 0 0 0 1 0 0 0 0
1995-04-30 1995 Apr 4.280 65.722 2.130 0.8967 -1.678890 -1.119776 -0.883630 0.756122 ... 0 0 0 0 0 0 0 0 0 0
1995-05-31 1995 May 4.470 73.730 2.227 0.8967 -1.663696 -1.119198 -0.883612 0.800655 ... 0 0 0 0 0 0 1 0 0 0
1995-06-30 1995 Jun 4.238 77.994 2.124 0.9160 -1.648503 -1.118491 -0.883585 0.753301 ... 0 0 0 0 1 0 0 0 0 0
1995-07-31 1995 Jul 4.377 81.475 2.184 0.9160 -1.633309 -1.117656 -0.883547 0.781158 ... 0 0 0 1 0 0 0 0 0 0
1995-08-31 1995 Aug 4.368 74.981 2.152 0.8934 -1.618116 -1.116692 -0.883497 0.766398 ... 0 0 0 0 0 0 0 0 0 0
1995-09-30 1995 Sep 3.917 61.530 2.062 0.8934 -1.602922 -1.115600 -0.883433 0.723676 ... 0 0 0 0 0 0 0 0 0 1
1995-10-31 1995 Oct 4.078 60.022 2.121 0.9434 -1.587729 -1.114379 -0.883353 0.751888 ... 0 0 0 0 0 0 0 0 1 0
1995-11-30 1995 Nov 3.611 52.772 2.030 0.9434 -1.572535 -1.113029 -0.883255 0.708036 ... 0 0 0 0 0 0 0 1 0 0
1995-12-31 1995 Dec 3.591 50.850 2.091 1.0811 -1.557341 -1.111551 -0.883138 0.737642 ... 1 0 0 0 0 0 0 0 0 0

12 rows �� 22 columns

Prediction Model

Now let's go ahead and fit our linear regression model to the data. We will pass the Log of Milk production as our target variable. We will pass our Normalized month and monthsquared counts along with our dummied months as our features. We will then predict the scores for our data from 1/1/1995-12/31/2012 and save it to the data frame under a column named scores. We will leave out 2013 to predict on later. Finally we will calculate the residuals of our predictions, from 1/1/1995-12/31/2012, by subtracting the actual milk log values from our predicted scores and then we will save it to our data frame under a column named resids.

In [22]:
#Fit a linear model 
X = df.loc[:'2012-12-31', ['Month.Count', 'monthNumSqred', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
Y = df.loc[:'2012-12-31', 'Milk_log'].values
lm_mod = lm.LinearRegression(fit_intercept = False)
mod_fit = lm_mod.fit(X, Y)

##Make predictions and calculate residual and save to DF
df.loc[:'2012-12-31', 'scores'] = mod_fit.predict(X)
df.loc[:'2012-12-31', 'resids'] = df.loc[:'2012-12-31', 'scores'] - df.loc[:'2012-12-31', 'Milk_log']

Now let's define a function that will accept a data frame and a given column. It will then plot a time series of the actual values of the given column in red and it will then plot the predictions for that column, saved in the df under 'scores', in blue. We can use this function to visually see how our model is predicting.

In [23]:
#Plot time series model with predicted numbers
def plot_mod_fit(df, col):
    fig = plt.figure(figsize=(8, 5)) 
    ax = fig.gca()   
    df.loc[:, col].plot(color = 'r', ax = ax) 
    df.loc[:, 'scores'].plot(ax = ax) 
    ax.set_title('Actual ' + col + ' vs. the predicted values') 
    ax.set_xlabel('Date') 
    ax.set_ylabel(col) 
    ax.legend() 

Now let's plot our Milk Log actual and predicted time series.

In [24]:
plot_mod_fit(df, 'Milk_log') 

ARIMA Adjustment

We can see that our linear regression has decent predictive power already. Let's take a look at the time series, the QQ plot, and histogram of our residual. Let's create two functions to assist us in this. Our first function will plot a time series with a given title. Our second function will print a histogram and qq plot of a given column with a given label.

In [25]:
#Function to Plot given time series with given title
def plot_ts(ts, title):
    ts.plot() 
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Value')
    
#Function to plot histogram and QQ plot of given time series with given label
def dist_ts(ts, lab = '', bins = 40):
    ## Setup a figure with two subplots side by side
    f, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3))
    ## Plot the histogram with labels
    ts.hist(ax = ax1, bins = bins, alpha = 0.5)
    ax1.set_xlabel('Value')
    ax1.set_ylabel('Frequency')
    ax1.set_title('Histogram of ' + lab)
    ## Plot the q-q plot on the other axes
    ss.probplot(ts, plot = ax2)    
In [26]:
#Plot TS, Histogram and QQ Plot of residual
plot_ts(df.loc[:'2012-12-31', 'resids'], title = 'Residual time series of log milk production')
dist_ts(df.loc[:'2012-12-31', 'resids'], '\n residual of trend and seasonal model')

We can see that our residual seems to be mostly normally distriubted.

Now let's conduct the Dicky Fuller test to determine if our fitted residual is stationary.

In [27]:
#Conduct Dicky Fuller Test
DF_Test(df.loc[:'2012-12-31', 'resids'])
D-F statistic = -4.6350226603
p-value = 0.000111272763226
number of lags used = 0
Critical value at 5% confidence = -2.8750788801
Critical value at 10% confidence = -2.57398611682

Looking at the above DF statistic and p-value we can reject the null hypothesis that the residual is not stationary.

Now let's determine what the best ARIMA model will be for our model's residual. We will use the function we created earlier and pass maximum order values of (3,1,3).

In [28]:
#Determine best ARIMA model for residual with maximum orders of (3,1,3)
BIC, order, model_fit = step_ARIMA(df.loc[:'2012-12-31', 'resids'], 3, 1, 3)    
print('Best order = ' + str(order) + ' best BIC = ' + str(BIC))
Testing model of order: (0, 0, 1) with BIC = -1096.13364648
Testing model of order: (0, 0, 2) with BIC = -1142.13511204
Testing model of order: (0, 0, 3) with BIC = -1173.34408866
Testing model of order: (0, 1, 1) with BIC = -1188.08354814
Testing model of order: (0, 1, 2) with BIC = -1183.64648287
Testing model of order: (0, 1, 3) with BIC = -1179.59314818
Testing model of order: (1, 0, 0) with BIC = -1207.21677011
Testing model of order: (1, 0, 1) with BIC = -1202.20090144
Testing model of order: (1, 0, 2) with BIC = -1196.93114897
Testing model of order: (1, 0, 3) with BIC = -1191.57028642
Testing model of order: (1, 1, 0) with BIC = -1187.68115621
Testing model of order: (1, 1, 1) with BIC = -1193.45741591
Testing model of order: (2, 0, 0) with BIC = -1202.2178462
Testing model of order: (2, 0, 1) with BIC = -1196.87634315
Testing model of order: (2, 0, 2) with BIC = -1191.55798224
Testing model of order: (2, 0, 3) with BIC = -1187.26002034
Testing model of order: (2, 1, 0) with BIC = -1182.94536293
Testing model of order: (2, 1, 1) with BIC = -1188.55444998
Testing model of order: (2, 1, 2) with BIC = -1182.92311621
Testing model of order: (3, 0, 0) with BIC = -1196.91622471
Testing model of order: (3, 0, 1) with BIC = -1191.54218996
Testing model of order: (3, 0, 2) with BIC = -1195.25924454
Testing model of order: (3, 0, 3) with BIC = -1183.40859384
Testing model of order: (3, 1, 0) with BIC = -1178.44532204
Testing model of order: (3, 1, 1) with BIC = -1183.22397813
Testing model of order: (3, 1, 2) with BIC = -1178.46070177
Best order = (1, 0, 0) best BIC = -1207.21677011

Our best model is that of order (1,0,0). Let's create an ARIMA model of order (1,0,0) for residual of our prediction from 1/1/1995-12/31/2012.

In [29]:
#Our best ARIMA model for the Linear Regression residual
arima_remainder = model_ARIMA(df.loc[:'2012-12-31', 'resids'], order = (1, 0, 0))
                              ARMA Model Results                              
==============================================================================
Dep. Variable:                 resids   No. Observations:                  216
Model:                     ARMA(1, 0)   Log Likelihood                 608.984
Method:                           mle   S.D. of innovations              0.014
Date:                Sat, 16 Jun 2018   AIC                          -1213.967
Time:                        01:19:33   BIC                          -1207.217
Sample:                    01-31-1995   HQIC                         -1211.240
                         - 12-31-2012                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
ar.L1.resids     0.8339      0.038     22.144      0.000       0.760       0.908
                                    Roots                                    
=============================================================================
                  Real          Imaginary           Modulus         Frequency
-----------------------------------------------------------------------------
AR.1            1.1992           +0.0000j            1.1992            0.0000
-----------------------------------------------------------------------------

Now that our ARIMA model has been fit let's predict the ARIMA residual adjustment for our linear regression model. We will predict from 1/1/2013-12/31/2013 and save it to a variable called model_prediction. We will also create a forecast of 12 with an alpha of .05 to compute the 95% confidence interval of our ARIMA predictions. We will print our Forecast numbers, their standard errors and their confidence intervals. Then we will plot our 95% confidence intervals for our Forecast.

In [30]:
start_index = len(df.loc[:'2012-12-31', 'resids']) 
end_index = start_index + 11 
model_prediction = arima_remainder.predict(start=start_index, end=end_index) 
print(model_prediction) 
2013-01-31    0.007042
2013-02-28    0.005873
2013-03-31    0.004897
2013-04-30    0.004084
2013-05-31    0.003405
2013-06-30    0.002840
2013-07-31    0.002368
2013-08-31    0.001975
2013-09-30    0.001647
2013-10-31    0.001373
2013-11-30    0.001145
2013-12-31    0.000955
Freq: M, dtype: float64
In [31]:
#returns forecast, stderr, confidence intervals for forecast
f, stderr, ci = arima_remainder.forecast(12, alpha =.05) 
print("Forecast: \n", f) 
print("\nStandard Error: \n", stderr) 
print("\nConfidence Intervals:\n ", ci) 
Forecast: 
 [ 0.00704236  0.00587261  0.00489715  0.00408372  0.0034054   0.00283975
  0.00236806  0.00197472  0.00164671  0.00137319  0.0011451   0.00095489]

Standard Error: 
 [ 0.01439234  0.01873983  0.02124489  0.02282538  0.0238628   0.02455839
  0.0250307   0.02535395  0.02557632  0.02572983  0.02583603  0.02590963]

Confidence Intervals:
  [[-0.02116611  0.03525084]
 [-0.03085678  0.04260199]
 [-0.03674207  0.04653637]
 [-0.0406532   0.04882064]
 [-0.04336483  0.05017563]
 [-0.04529381  0.05097331]
 [-0.04669121  0.05142733]
 [-0.04771811  0.05166754]
 [-0.04848196  0.05177538]
 [-0.04905634  0.05180272]
 [-0.04949259  0.05178279]
 [-0.04982705  0.05173684]]
In [32]:
arima_remainder.plot_predict(start=start_index, end=end_index, alpha = .05, plot_insample = False) 
Out[32]:

Our confidence intervals are reasonably small as they start from -.02 to .03 and get as large as -.05 to .05. Our confidence intervals are clearly getting larger the further we predict into the future. This shows that our model gets less and less confident the further into the future it is predicting.

Now that our ARIMA residual adjustment has been computed let's predict 1/1/2013-12/31/2013 using our previously fit linear model, and then adjust it by our ARIMA models prediction. We will save these predictions under the scores column of our data frame.

In [33]:
## Set X for prediction
X = df.loc['2013-1-31':, ['Month.Count', 'monthNumSqred', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
## Make predictions and adjust using ARIMA predictions
df.loc['2013-1-31':, 'scores'] =  mod_fit.predict(X) - model_prediction

Now let's take a look at our predicted log milk production values versus actual log milk production values for 2013.

In [34]:
print(df.loc['2013-1-31':, ['scores','Milk_log']]) # print predictions and actual values for 2013
              scores  Milk_log
2013-01-31  1.244953  1.241846
2013-02-28  1.171487  1.172792
2013-03-31  1.281060  1.301825
2013-04-30  1.261283  1.287026
2013-05-31  1.289228  1.313724
2013-06-30  1.240911  1.249615
2013-07-31  1.245774  1.215803
2013-08-31  1.239419  1.223775
2013-09-30  1.197346  1.162213
2013-10-31  1.230883  1.205372
2013-11-30  1.197702  1.186318
2013-12-31  1.234229  1.245019

Final Predictive Model

Finally let's take a look at our finished time series model of California Log Milk production with actual values in red and our predicted values in blue.

In [35]:
plot_mod_fit(df, 'Milk_log') 

The model has done a fairly good job of predicting.