A California specialty food client wants to expand their offerings with a dairy division but first they need an understanding of the production trends. They have asked us to create a regression model for either Milk, Cottage Cheese or Ice Cream productions in California.
Let's load our packages and the California dairy production dataset. Then let's look at our data frame and get an idea of the data we loaded.
# Load necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as ss
import statsmodels.tsa.seasonal as sts
from statsmodels.tsa.stattools import adfuller
import statsmodels.graphics.tsaplots as splt
from scipy.stats import zscore
import sklearn.linear_model as lm
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
#Load dataset
df = pd.read_csv(fileName, header = 'infer')
#print top 5 rows and shape
print(df.head())
print(df.shape)
#print columns
print(df.columns)
We should look at the value counts of our Year column to determine the years we will be working with, and so we can set our index as the date range.
#print unique year and value counts
print(df['Year'].unique())
print(df['Year'].value_counts())
Looking at the unique Years and Year value counts we can see that our years go from 1995 to 2013 and all years include 12 months of data. Our dataset is already sorted. With this information we can determine that the date range for the index needs to be 1/1/1995 to 12/31/2013. Now let's set our index for the dataframe as our determined date range, with a monthly frequency.
#set index of data frame to date range 1/1/1995-12/31/2013
df.index = pd.date_range(start = '1/1/1995', end = '12/31/2013', freq = 'M')
Let's take another look at our data frame now.
#print top 5 rows and shape
print(df.head())
print(df.shape)
Now that our date index has been set let's plot the time series plots of our three California dairy productions.
fig, (ax1, ax2, ax3) = plt.subplots(3, 1)
df['Cotagecheese.Prod'].plot(ax = ax1)
df['Icecream.Prod'].plot(ax = ax2)
df['Milk.Prod'].plot(ax = ax3)
ax1.set_ylabel('Cottage Cheese')
ax2.set_ylabel('Ice Cream')
ax3.set_ylabel('Milk')
ax3.set_xlabel('Date')
ax1.set_title('California production time series')
plt.tight_layout()
These 3 production time serieses look interesting but let's focus on the milk production for our analysis since cottage cheese production seems to be in decline and would not be a good avenue for our client to pursue, and ice cream seems very seasonal. The milk time series is not stationary, it seems to have gradually increased before reaching a peak where it is currently fluctuating. Let's create a log of the milk production column since the seasonality of the Milk production time series seems to be multiplicative.
df['Milk_log'] = np.log(df['Milk.Prod'] )
df['Milk_log'].plot()
plt.title('Time series plot of Log of California Milk production')
plt.ylabel('Value')
plt.xlabel('Date')
df.columns
Now let's create a function that will decompse a time series into a Seasonal, Trend, Residual (STL) model and plot the outputs. The function will then return a data frame with the Seasonal, Trend, and Residual components.
#function to STL decompose given Time series
def decomp_ts(ts, model = 'additive'):
res = sts.seasonal_decompose(ts, model = model)
res.plot()
return(pd.DataFrame({'resid': res.resid,
'trend': res.trend,
'seasonal': res.seasonal},
index = ts.index) )
Now let's decompse our Milk Log column, plot the outputs, and print the head and tail of our resulting STL model dataframe.
milk_decomp = decomp_ts(df['Milk_log'])
print(milk_decomp.head(12)) #print top 12 rows
print(milk_decomp.tail(12)) #print bottow 12 rows
Looking at the decomposed model we can see that there is a significant seasonal component involved with the milk production. We can see a clear trend, and we can see our residual seems stationary; however, there is some increased variance at the end of the residual. Let's conduct the Dicky Fuller Test on the residual to determine if it is stationary. Let's create a function that will conduct the Dicky Fuller test. Then let's conduct the test on our decomposed milk residual skipping the first 6 and last 6 rows since they are NaNs.
#Conduct Dicky Fuller test and return key Stats
def DF_Test(ts):
stationary = adfuller(ts)
## Print the results
print('D-F statistic = ' + str(stationary[0]))
print('p-value = ' + str(stationary[1]))
print('number of lags used = ' + str(stationary[2]))
print('Critical value at 5% confidence = ' + str(stationary[4]['5%']))
print('Critical value at 10% confidence = ' + str(stationary[4]['10%']))
DF_Test(milk_decomp.resid[6:-6])
Looking at the above DF statistic and p-value we can reject the null hypothesis that the residual is not stationary.
Now let's calculate and plot the autocorrelation function and partial correlation function for our decomposed milk residual.
splt.plot_acf(milk_decomp.resid[6:-6], lags = 40)
splt.plot_pacf(milk_decomp.resid[6:-6], lags = 40)
Looking at the above ACF plot, it looks like besides 0 there is 1 more lag that shows significance. Let's create an ARIMA model of the decomposed milk residual of order (2,0,2). Let's first create a function that will fit our ARIMA model with a given order and return the fit model, along with printing the model's summary.
#function to fit ARIMA model, return fit model and print model summary
def model_ARIMA(ts, order):
model = ARIMA(ts, order = order)
model_fit = model.fit(disp=0, method='mle', trend='nc')
print(model_fit.summary())
return(model_fit)
arima_milk = model_ARIMA(milk_decomp.resid[6:-6], order = (2,0,2))
This model above does not look bad however the MA 2 residual has a rather high p value and has a confidence interval that includes 0. We cannot reject the null hypothesis that this variable is not helping the model. Let's go ahead and create a function below that will step through our decomposed milk residual with a given maximum value for P,D,Q and find the ARIMA model with the best BIC. The function will then return the best BIC, best order and the best model.
#Function Fits ARIMA models and returns the BIC, the Order of the model and the fit model
def model_ARIMA_2(ts, order):
model = ARIMA(ts, order = order)
model_fit = model.fit(disp=0, method='mle', trend='nc')
BIC = ARIMAResults.bic(model_fit)
print('Testing model of order: ' + str(order) + ' with BIC = ' + str(BIC))
return(BIC, order, model_fit)
#Function to find the best bic of ARIMA model by passing max P, D, Q values and looping through them keeping track of best BIC
def step_ARIMA(resid, p_max, d_max, q_max):
best_BIC = 9999999999999999.0
#loop through passed P, D, Q
for p in range(p_max + 1):
for d in range(d_max + 1):
for q in range(q_max + 1):
if(p > 0 or q > 0):
#try to fit model of given order pass if not possible
try:
order = (p, d, q)
BIC, order, model = model_ARIMA_2(resid, order)
#determine if BIC is better than previous BICs if so save it as best model
if(BIC < best_BIC):
best_model = model
best_BIC = BIC
best_order = order
except:
pass
return(best_BIC, best_order, best_model)
Now let's run the function above on our decomposed milk residual with maximum P of 4, D of 0 and Q of 4.
#Determine best ARIMA model of given max orders and print best Model
BIC, order, model = step_ARIMA(milk_decomp.resid[6:-6], 4, 0, 4)
print('***************************************')
print('Best model with BIC = ' + str(BIC) + ' and with order '+ str(order))
The best ARIMA model for the STL decomposition residual of the Milk production time series is of order (2,0,1). Let's go ahead and run that model below and look at the summary statistics.
arima_milk = model_ARIMA(milk_decomp.resid[6:-6], order = (2,0,1))
The model above has a better BIC value than our original model and all of our model features have p values that are essentially 0.
Now let's normalize the time features of our data frame, and let's one hot encode the month column.
## Normalize the time features
df.loc[:, ['Month.Count', 'monthNumSqred', 'monthNumCubed']] = df.loc[:, ['Month.Count', 'monthNumSqred', 'monthNumCubed']].apply(zscore)
## Create dummy variables for the months
dummies = pd.get_dummies(df.loc[:, 'Month'])
df[list(dummies.columns)] = dummies
## Print the head of the data frame to look at the dummy variables.
df.head(12)
Now let's go ahead and fit our linear regression model to the data. We will pass the Log of Milk production as our target variable. We will pass our Normalized month and monthsquared counts along with our dummied months as our features. We will then predict the scores for our data from 1/1/1995-12/31/2012 and save it to the data frame under a column named scores. We will leave out 2013 to predict on later. Finally we will calculate the residuals of our predictions, from 1/1/1995-12/31/2012, by subtracting the actual milk log values from our predicted scores and then we will save it to our data frame under a column named resids.
#Fit a linear model
X = df.loc[:'2012-12-31', ['Month.Count', 'monthNumSqred', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
Y = df.loc[:'2012-12-31', 'Milk_log'].values
lm_mod = lm.LinearRegression(fit_intercept = False)
mod_fit = lm_mod.fit(X, Y)
##Make predictions and calculate residual and save to DF
df.loc[:'2012-12-31', 'scores'] = mod_fit.predict(X)
df.loc[:'2012-12-31', 'resids'] = df.loc[:'2012-12-31', 'scores'] - df.loc[:'2012-12-31', 'Milk_log']
Now let's define a function that will accept a data frame and a given column. It will then plot a time series of the actual values of the given column in red and it will then plot the predictions for that column, saved in the df under 'scores', in blue. We can use this function to visually see how our model is predicting.
#Plot time series model with predicted numbers
def plot_mod_fit(df, col):
fig = plt.figure(figsize=(8, 5))
ax = fig.gca()
df.loc[:, col].plot(color = 'r', ax = ax)
df.loc[:, 'scores'].plot(ax = ax)
ax.set_title('Actual ' + col + ' vs. the predicted values')
ax.set_xlabel('Date')
ax.set_ylabel(col)
ax.legend()
Now let's plot our Milk Log actual and predicted time series.
plot_mod_fit(df, 'Milk_log')
We can see that our linear regression has decent predictive power already. Let's take a look at the time series, the QQ plot, and histogram of our residual. Let's create two functions to assist us in this. Our first function will plot a time series with a given title. Our second function will print a histogram and qq plot of a given column with a given label.
#Function to Plot given time series with given title
def plot_ts(ts, title):
ts.plot()
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Value')
#Function to plot histogram and QQ plot of given time series with given label
def dist_ts(ts, lab = '', bins = 40):
## Setup a figure with two subplots side by side
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(7, 3))
## Plot the histogram with labels
ts.hist(ax = ax1, bins = bins, alpha = 0.5)
ax1.set_xlabel('Value')
ax1.set_ylabel('Frequency')
ax1.set_title('Histogram of ' + lab)
## Plot the q-q plot on the other axes
ss.probplot(ts, plot = ax2)
#Plot TS, Histogram and QQ Plot of residual
plot_ts(df.loc[:'2012-12-31', 'resids'], title = 'Residual time series of log milk production')
dist_ts(df.loc[:'2012-12-31', 'resids'], '\n residual of trend and seasonal model')
We can see that our residual seems to be mostly normally distriubted.
Now let's conduct the Dicky Fuller test to determine if our fitted residual is stationary.
#Conduct Dicky Fuller Test
DF_Test(df.loc[:'2012-12-31', 'resids'])
Looking at the above DF statistic and p-value we can reject the null hypothesis that the residual is not stationary.
Now let's determine what the best ARIMA model will be for our model's residual. We will use the function we created earlier and pass maximum order values of (3,1,3).
#Determine best ARIMA model for residual with maximum orders of (3,1,3)
BIC, order, model_fit = step_ARIMA(df.loc[:'2012-12-31', 'resids'], 3, 1, 3)
print('Best order = ' + str(order) + ' best BIC = ' + str(BIC))
Our best model is that of order (1,0,0). Let's create an ARIMA model of order (1,0,0) for residual of our prediction from 1/1/1995-12/31/2012.
#Our best ARIMA model for the Linear Regression residual
arima_remainder = model_ARIMA(df.loc[:'2012-12-31', 'resids'], order = (1, 0, 0))
Now that our ARIMA model has been fit let's predict the ARIMA residual adjustment for our linear regression model. We will predict from 1/1/2013-12/31/2013 and save it to a variable called model_prediction. We will also create a forecast of 12 with an alpha of .05 to compute the 95% confidence interval of our ARIMA predictions. We will print our Forecast numbers, their standard errors and their confidence intervals. Then we will plot our 95% confidence intervals for our Forecast.
start_index = len(df.loc[:'2012-12-31', 'resids'])
end_index = start_index + 11
model_prediction = arima_remainder.predict(start=start_index, end=end_index)
print(model_prediction)
#returns forecast, stderr, confidence intervals for forecast
f, stderr, ci = arima_remainder.forecast(12, alpha =.05)
print("Forecast: \n", f)
print("\nStandard Error: \n", stderr)
print("\nConfidence Intervals:\n ", ci)
arima_remainder.plot_predict(start=start_index, end=end_index, alpha = .05, plot_insample = False)
Our confidence intervals are reasonably small as they start from -.02 to .03 and get as large as -.05 to .05. Our confidence intervals are clearly getting larger the further we predict into the future. This shows that our model gets less and less confident the further into the future it is predicting.
Now that our ARIMA residual adjustment has been computed let's predict 1/1/2013-12/31/2013 using our previously fit linear model, and then adjust it by our ARIMA models prediction. We will save these predictions under the scores column of our data frame.
## Set X for prediction
X = df.loc['2013-1-31':, ['Month.Count', 'monthNumSqred', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']].values
## Make predictions and adjust using ARIMA predictions
df.loc['2013-1-31':, 'scores'] = mod_fit.predict(X) - model_prediction
Now let's take a look at our predicted log milk production values versus actual log milk production values for 2013.
print(df.loc['2013-1-31':, ['scores','Milk_log']]) # print predictions and actual values for 2013
Finally let's take a look at our finished time series model of California Log Milk production with actual values in red and our predicted values in blue.
plot_mod_fit(df, 'Milk_log')
The model has done a fairly good job of predicting.