King County Housing Price Predictions

Abstract

Problem

A King county construction company is looking to purchase properties to update and resell. They are looking for underpriced houses that they can renovate and sell for a quick profit. They are interested in finding what property characteristics affect the price the most. They are also looking for a model that can take an input of what are deemed to be the best characteristics and output the expected price of a house. They will use this model to find underpriced houses to purchase. They will then use the model to determine what changes made to the house will produce the best return on their flipping project.

We will be utilizing the Kaggle house sales prediction dataset which can be found at: https://www.kaggle.com/harlfoxem/housesalesprediction/data

Conclusion

We began by doing some data preparation and exploration. In this step we found some interesting features, and created some new features that we thought would be useful, including whether it's a new renovation, a new build, if it's been renovated at all, and if the house has a basement. After our data was prepared, scaled, and our features from the data exploration were selected we began by fitting a regular OLS model to our training data. We then conducted backwards stepwise regression to find a model with better features. We conducted a few more alterations to the model and found that a model utilizing the following features provided us with the best results: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'lat', 'long', 'sqft_living15', 'newReno', 'newBuild', 'hasBsmnt', 'view1', 'view2', 'view3', 'view4', 'cond3', 'cond4', 'cond5', 'grade10', 'grade11', 'grade12', 'grade13', 'grade4', 'grade5', 'grade6', 'grade7', 'grade9', 'zip1', 'zip2', 'zip3', 'zip4']. We used our best features list, shown above, to attempt ridge, lasso, and elastic net regression to see if we could improve on our OLS model. We then constructed an SVD model, and a PCR model using the same best feature list. We predicted using all of the above models on our test set and calculated the Sum of Square Residuals for our models on the test set to compare them, along with either the models R-squared or AIC values. Our best model turned out to be the regualr OLS model with the above features selected for training input. Our best regression formula is: -0.226187 + 1.315281zip1 + 1.063486zip2 + 0.822946grade13 + 0.773717waterfront + 0.730778zip4 + 0.707111grade12 + 0.659465zip3 + 0.585843grade11 + 0.558247cond5 + 0.52146view4 + 0.485521grade10 + 0.392712cond4 + 0.377873view3 + 0.360219lat + 0.292872grade9 + 0.272636view1 + 0.266681newReno + 0.259037cond3 + 0.256832sqft_living + 0.241129view2 + 0.115955sqft_living15 + 0.088957hasBsmnt + 0.080396floors + 0.054347bathrooms + 0.039591sqft_lot - 0.015802long - 0.024311bedrooms - 0.185233newBuild - 0.280656grade7 - 0.588332grade6 - 0.845429grade5 - 0.981769grade4. This formula will provide the predicted scaled log price of a house given it's above features. The coeffecients show the effect of each feature on the scaled log price. For instance we can see that the feature with the largest positive effect on price is being in zipcode 1, the Medina area. For our builders, the most important insight is that the most effective price predictors are if it's in a high priced zipcode, if it has a high King county grade, if it's waterfront, its condition, its number of views, its latitude, if it's a new renovation, and its square feet of living space. This confirms the age old adage, location, location, location. Therefore our builders will want to find houses with good intangibles, including its zipcode, its waterfront status, and its latitude. They will want a house with those features that is not in an ideal condition and with a low king county grade. They can then purchase the property, and renovate it. They will be able to predict their effective return on the house after it is renovated, given predictions on its new King county grade after renovation, its new condition, its new view and its other updated features per the regression formula above.

In [2]:
# Load necessary libraries
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import math
from scipy.stats import kstest
from scipy.stats import zscore
import statsmodels.formula.api as sm

Auxiliary Functions

Now we will create some auxiliary functions that will help us along the way with our data exploration.

In [3]:
%matplotlib inline
#plot distributions
def plot_cums(dist1, dist2): 
    ## sort the first data sample:
    data_sorted = np.sort(dist1)
    # calculate the proportional values of samples
    p = 1. * np.arange(len(dist1)) / (len(dist1) - 1)

    # Now plot as line and scatter plot. 
    plt.plot(data_sorted, p)
    plt.scatter(data_sorted, p, color = 'red')
    
    # sort the seond data sample:
    data_sorted = np.sort(dist2)
    
    # And the second plot
    plt.plot(data_sorted, p, color = 'yellow')
    plt.scatter(data_sorted, p, color = 'green')

#conduct KS test
def ks_test(data, dist = 'norm'):
    ks_stat, pvalue = kstest(data, dist) 
    print('KS-statistic = ' + str(ks_stat)) 
    print('P-value = ' + str(pvalue)) 
In [4]:
#Create Box Plot
def plotBox(column, depVar, df, x=9, rotation = 0):
    fig = plt.figure(figsize=(x, 9)) 
    ax = fig.gca() 
    df.loc[:,[depVar, column]].boxplot(by =column, ax = ax) 
    ax.set_title('Box plot of ' + depVar + ' vs. ' + column) 
    ax.set_ylabel(depVar)
    ax.tick_params(axis = 'x', labelrotation = rotation)
In [5]:
#Function used by Bootstrap function 
#plots Distributions with 95% CI.
def plot_dists(a, b, nbins, a_label='pop_A', b_label='pop_B', depVar = 'variable', p=5):
    # Create a single sequence of bins to be shared across both
    #distribution plots for visualization consistency.
    combined = pd.concat([a, b])
    breaks = np.linspace(
        combined.min(), 
        combined.max(), 
        num=nbins+1)

    plt.subplot(2, 1, 1) 
    plt.hist(a, bins=breaks, alpha = .5)
    plt.axvline(a.mean(), linewidth = 4) 
    plt.axvline(np.percentile(a, p/2.), color='red', linewidth=3) 
    plt.axvline(np.percentile(a, 100-p/2.), color='red', linewidth=3) 
    plt.title(a_label) 
    plt.xlabel(depVar) 
    
    plt.subplot(2, 1, 2) 
    plt.hist(b, bins=breaks, alpha = .5) 
    plt.axvline(b.mean(), linewidth = 4) 
    plt.axvline(np.percentile(b, p/2.), color='red', linewidth=3) 
    plt.axvline(np.percentile(b, 100-p/2.), color='red', linewidth=3) 
    plt.title(b_label) 
    plt.xlabel(depVar) 
    
    plt.tight_layout() 
    
#Function used by Bootstrap  diff function 
#plots histogram with 95% CI.
def plot_hist(x, a, b, depVar = 'variable', p=5):
    plt.hist(x, alpha = .5)
    plt.axvline(x.mean(), linewidth = 4)
    plt.axvline(np.percentile(x, p/2.), color='red', linewidth=3)
    plt.axvline(np.percentile(x, 100-p/2.), color='red', linewidth=3)
    plt.title(str(a) + " - " + str(b))
    plt.xlabel(depVar + " Mean Differences")
In [6]:
#Bootstrap
def bootStrap(df,column, choice1, choice2, depVar, n_replicas = 1000):
    a = df[df[column] == choice1] 
    b = df[df[column] == choice2] 

    a_bootstrap_means = pd.Series([
        a.sample(frac=1, replace=True)[depVar].mean()
        for i in range(n_replicas)]) 

    b_bootstrap_means = pd.Series([
        b.sample(frac=1, replace=True)[depVar].mean()
        for i in range(n_replicas)]) 

    plot_dists(a = a_bootstrap_means, b = b_bootstrap_means, 
               nbins=80, a_label= choice1, b_label=choice2, depVar = depVar) 
In [7]:
#Bootstrap Mean Diffs
def bootStrapDiffs(df, column, choice1, choice2, depVar, n_replicas = 1000):
    diffs = [] 
    for i in range(n_replicas):
        sample = df.sample(frac=1.0, replace=True)
        a_sample_mean = sample[sample[column] == choice1][depVar].mean() 
        b_sample_mean = sample[sample[column] == choice2][depVar].mean() 
        diffs.append(a_sample_mean - b_sample_mean) 
    diffs = pd.Series(diffs) 

    plot_hist(x = diffs, a = choice1, b = choice2, depVar = depVar) 

Data Preparation

Now let's load our data and take a look at it.

In [8]:
os.chdir(path)
df = pd.read_csv('kc_house_data.csv', header = 'infer')
In [9]:
pd.set_option('display.max_columns', None)#display all columns
print(df.head())#print top 5 rows
print(df.tail())#print bottom 5 rows
           id             date     price  bedrooms  bathrooms  sqft_living  \
0  7129300520  20141013T000000  221900.0         3       1.00         1180   
1  6414100192  20141209T000000  538000.0         3       2.25         2570   
2  5631500400  20150225T000000  180000.0         2       1.00          770   
3  2487200875  20141209T000000  604000.0         4       3.00         1960   
4  1954400510  20150218T000000  510000.0         3       2.00         1680   

   sqft_lot  floors  waterfront  view  condition  grade  sqft_above  \
0      5650     1.0           0     0          3      7        1180   
1      7242     2.0           0     0          3      7        2170   
2     10000     1.0           0     0          3      6         770   
3      5000     1.0           0     0          5      7        1050   
4      8080     1.0           0     0          3      8        1680   

   sqft_basement  yr_built  yr_renovated  zipcode      lat     long  \
0              0      1955             0    98178  47.5112 -122.257   
1            400      1951          1991    98125  47.7210 -122.319   
2              0      1933             0    98028  47.7379 -122.233   
3            910      1965             0    98136  47.5208 -122.393   
4              0      1987             0    98074  47.6168 -122.045   

   sqft_living15  sqft_lot15  
0           1340        5650  
1           1690        7639  
2           2720        8062  
3           1360        5000  
4           1800        7503  
               id             date     price  bedrooms  bathrooms  \
21608   263000018  20140521T000000  360000.0         3       2.50   
21609  6600060120  20150223T000000  400000.0         4       2.50   
21610  1523300141  20140623T000000  402101.0         2       0.75   
21611   291310100  20150116T000000  400000.0         3       2.50   
21612  1523300157  20141015T000000  325000.0         2       0.75   

       sqft_living  sqft_lot  floors  waterfront  view  condition  grade  \
21608         1530      1131     3.0           0     0          3      8   
21609         2310      5813     2.0           0     0          3      8   
21610         1020      1350     2.0           0     0          3      7   
21611         1600      2388     2.0           0     0          3      8   
21612         1020      1076     2.0           0     0          3      7   

       sqft_above  sqft_basement  yr_built  yr_renovated  zipcode      lat  \
21608        1530              0      2009             0    98103  47.6993   
21609        2310              0      2014             0    98146  47.5107   
21610        1020              0      2009             0    98144  47.5944   
21611        1600              0      2004             0    98027  47.5345   
21612        1020              0      2008             0    98144  47.5941   

          long  sqft_living15  sqft_lot15  
21608 -122.346           1530        1509  
21609 -122.362           1830        7200  
21610 -122.299           1020        2007  
21611 -122.069           1410        1287  
21612 -122.299           1020        1357  
In [10]:
print(df.shape) 
print(df.dtypes) 
(21613, 21)
id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

Let's set our date to a date type and create a month and year column. This will help us get a better idea of the data we have.

In [11]:
df['date'] = pd.to_datetime(df['date']) #turn date column to date type
df['month'] = pd.DatetimeIndex(df.loc[:,'date']).month #Create month column
df['year'] = pd.DatetimeIndex(df.loc[:,'date']).year #create year column
In [12]:
#print unique values of number of unique values for our features
print('Dates \n', df['date'].nunique())
print('Years \n', df['year'].unique())
print('Months \n', df['month'].unique())
print('Bedrooms \n', df['bedrooms'].unique())
print('Bathrooms \n', df['bathrooms'].unique())
print('Floors \n', df['floors'].unique())
print('Waterfront \n', df['waterfront'].unique())
print('Views \n', df['view'].unique())
print('Condition \n', df['condition'].unique())
print('Grades \n', df['grade'].unique())
print('Years Built \n', df['yr_built'].unique())
print('Years Renovated \n', df['yr_renovated'].unique())
print('Zip Codes \n', df['zipcode'].unique())
print('Latitude \n', df['lat'].nunique())
print('Longitude \n', df['long'].nunique())
print('Square Foot Living \n', df['sqft_living'].nunique())
print('Square Foot Lot \n', df['sqft_lot'].nunique())
print('Square Foot Living 15\n', df['sqft_living15'].nunique())
print('Square Foot Lot 15\n', df['sqft_lot15'].nunique())
print('Square Foot Above \n', df['sqft_above'].nunique())
print('Square Foot Basement \n', df['sqft_basement'].nunique())
Dates 
 372
Years 
 [2014 2015]
Months 
 [10 12  2  5  6  1  4  3  7  8 11  9]
Bedrooms 
 [ 3  2  4  5  1  6  7  0  8  9 11 10 33]
Bathrooms 
 [ 1.    2.25  3.    2.    4.5   1.5   2.5   1.75  2.75  3.25  4.    3.5
  0.75  4.75  5.    4.25  3.75  0.    1.25  5.25  6.    0.5   5.5   6.75
  5.75  8.    7.5   7.75  6.25  6.5 ]
Floors 
 [ 1.   2.   1.5  3.   2.5  3.5]
Waterfront 
 [0 1]
Views 
 [0 3 4 2 1]
Condition 
 [3 5 4 1 2]
Grades 
 [ 7  6  8 11  9  5 10 12  4  3 13  1]
Years Built 
 [1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 1942 1927 1977 1900 1979
 1994 1916 1921 1969 1947 1968 1985 1941 1915 1909 1948 2005 1929 1981 1930
 1904 1996 2000 1984 2014 1922 1959 1966 1953 1950 2008 1991 1954 1973 1925
 1989 1972 1986 1956 2002 1992 1964 1952 1961 2006 1988 1962 1939 1946 1967
 1975 1980 1910 1983 1978 1905 1971 2010 1945 1924 1990 1914 1926 2004 1923
 2007 1976 1949 1999 1901 1993 1920 1997 1943 1957 1940 1918 1928 1974 1911
 1936 1937 1982 1908 1931 1998 1913 2013 1907 1958 2012 1912 2011 1917 1932
 1944 1902 2009 1903 1970 2015 1934 1938 1919 1906 1935]
Years Renovated 
 [   0 1991 2002 2010 1999 1992 2013 1994 1978 2005 2008 2003 1984 1954 2014
 2011 1974 1983 1945 1990 1988 1957 1977 1981 1995 2000 1998 1970 1989 2004
 1986 2009 2007 1987 1973 2006 1985 2001 1980 1971 1979 1997 1950 1969 1948
 2015 1968 2012 1963 1951 1993 1962 1996 1972 1953 1955 1982 1956 1940 1976
 1946 1975 1958 1964 1959 1960 1967 1965 1934 1944]
Zip Codes 
 [98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 98007 98115
 98107 98126 98019 98103 98002 98133 98040 98092 98030 98119 98112 98052
 98027 98117 98058 98001 98056 98166 98023 98070 98148 98105 98042 98008
 98059 98122 98144 98004 98005 98034 98075 98116 98010 98118 98199 98032
 98045 98102 98077 98108 98168 98177 98065 98029 98006 98109 98022 98033
 98155 98024 98011 98031 98106 98072 98188 98014 98055 98039]
Latitude 
 5034
Longitude 
 752
Square Foot Living 
 1038
Square Foot Lot 
 9782
Square Foot Living 15
 777
Square Foot Lot 15
 8689
Square Foot Above 
 946
Square Foot Basement 
 306

We can see that we have about 1 years worth, 372 days, of data overall spanning 2014-2015 with an entry for every month. We do not have a sample spanning a long enough period to do useful time series analysis. We can see that that we have houses built from 1900 to 2015 and renovations from not renovated to renovated between 1934 and 2015. Our waterfront variable is binary, we should decode it for our data exploration. Our view, condition and grade features look like they should be decoded. It seems there are some houses with over 10 bedrooms. Let's see if these houses are not typos, especially the 33 bedroom house.

In [13]:
print(df.loc[df['bedrooms'] >10,:] )
               id       date     price  bedrooms  bathrooms  sqft_living  \
8757   1773100755 2014-08-21  520000.0        11       3.00         3000   
15870  2402100895 2014-06-25  640000.0        33       1.75         1620   

       sqft_lot  floors  waterfront  view  condition  grade  sqft_above  \
8757       4960     2.0           0     0          3      7        2400   
15870      6000     1.0           0     0          5      7        1040   

       sqft_basement  yr_built  yr_renovated  zipcode      lat     long  \
8757             600      1918          1999    98106  47.5560 -122.363   
15870            580      1947             0    98103  47.6878 -122.331   

       sqft_living15  sqft_lot15  month  year  
8757            1420        4960      8  2014  
15870           1330        4700      6  2014  

Most of our data looks clean and makes sense, except for our entry with 33 bedrooms and 1.75 bathrooms with 1620 square feet of living space. This is probably a typo and should probably be only 3 bedrooms; however, we are not 100% sure on what the correct number should be so let's drop this row.

In [14]:
#drop 33 bedroom row
toDrop = df.loc[df['bedrooms'] == 33,'bedrooms'].index.tolist()
df.drop(toDrop, axis = 0, inplace = True)
df.drop('id', axis = 1, inplace = True)

Now let's create some new binary features that may be useful to us. Let's create a feature to see if a house has been renovated or not. One for if it's a new renovation, which will be any renovation greater than the average renovation year. One for if it's a new build, which will be any house built over the average build year. And one for if the house has a basement or not. We will also create a column to decode our waterfront feature.

Let's start by finding our average year built and year renovated.

In [15]:
print('Year Built Min: ', df['yr_built'].min()) #print min year built
print('Year Built Max: ', df['yr_built'].max()) #print max year built
print('Year Built Mean: ', np.mean(df['yr_built'])) #print mean year built
print('Year Renovated Min: ', df.loc[df['yr_renovated'] > 0,'yr_renovated'].min()) #min year renovated excluding not reno
print('Year Renovated Max: ', df['yr_renovated'].max()) #max year renovated excluding not reno
print('Year Renovated Mean: ', (np.mean(df.loc[df['yr_renovated'] > 0,'yr_renovated']))) #mean year renovated excluding not reno
Year Built Min:  1900
Year Built Max:  2015
Year Built Mean:  1971.0062465297058
Year Renovated Min:  1934
Year Renovated Max:  2015
Year Renovated Mean:  1995.8271334792123
In [16]:
#create new features
df.loc[df.loc[:,'yr_renovated'] > 0 , "renovated"] = 1
df.loc[df.loc[:,'yr_renovated'] == 0, "renovated"] = 0

df.loc[df.loc[:,'yr_renovated'] > 1995 , "newReno"] = 1
df.loc[df.loc[:,'yr_renovated'] < 1996, "newReno"] = 0

df.loc[df.loc[:,'yr_built'] > 1970 , "newBuild"] = 1
df.loc[df.loc[:,'yr_built'] < 1971, "newBuild"] = 0

df.loc[df.loc[:,'sqft_basement'] > 0 , "hasBsmnt"] = 1
df.loc[df.loc[:,'sqft_basement'] == 0, "hasBsmnt"] = 0

df.loc[df.loc[:,'waterfront'] > 0 , "wfName"] = 'Wtrfrnt'
df.loc[df.loc[:,'waterfront'] == 0, "wfName"] = 'noWtrFrnt'
In [17]:
#Set our new features as ints
def setTypes(dataF, i, varType):
     dataF.loc[:,i] = dataF.loc[:,i].astype(varType)

setTypes(df,'renovated', 'int64')
setTypes(df,'newReno', 'int64')
setTypes(df,'newBuild', 'int64')
setTypes(df,'hasBsmnt', 'int64')

Now let's create a log price column, and let's create a z-scaled column of our price and log price columns.

In [18]:
def getLog(price):
     return math.log(price)
In [19]:
df['log_price'] = df['price'].apply(getLog)
In [20]:
#Standardize our price and log price and save to dfs
df['priceS'] = scale(df.loc[:,'price'])
df['logPS']= scale(df.loc[:,'log_price'])

Now let's decode our view, condition, and grade columns, then let's one hot encode them.

In [21]:
#Decode columns
df.loc[df['view'] == 0, 'viewName'] = 'view0'
df.loc[df['view'] == 1, 'viewName'] = 'view1'
df.loc[df['view'] == 2, 'viewName'] = 'view2'
df.loc[df['view'] == 3, 'viewName'] = 'view3'
df.loc[df['view'] == 4, 'viewName'] = 'view4'

df.loc[df['condition'] == 1, 'condName'] = 'cond1'
df.loc[df['condition'] == 2, 'condName'] = 'cond2'
df.loc[df['condition'] == 3, 'condName'] = 'cond3'
df.loc[df['condition'] == 4, 'condName'] = 'cond4'
df.loc[df['condition'] == 5, 'condName'] = 'cond5'

df.loc[df['grade'] == 1, 'gradeName'] = 'grade1'
df.loc[df['grade'] == 3, 'gradeName'] = 'grade3'
df.loc[df['grade'] == 4, 'gradeName'] = 'grade4'
df.loc[df['grade'] == 5, 'gradeName'] = 'grade5'
df.loc[df['grade'] == 6, 'gradeName'] = 'grade6'
df.loc[df['grade'] == 7, 'gradeName'] = 'grade7'
df.loc[df['grade'] == 8, 'gradeName'] = 'grade8'
df.loc[df['grade'] == 9, 'gradeName'] = 'grade9'
df.loc[df['grade'] == 10, 'gradeName'] = 'grade10'
df.loc[df['grade'] == 11, 'gradeName'] = 'grade11'
df.loc[df['grade'] == 12, 'gradeName'] = 'grade12'
df.loc[df['grade'] == 13, 'gradeName'] = 'grade13'
In [22]:
#Get Dummies 
dummies = pd.get_dummies(df.loc[:, 'viewName'], drop_first = True)
df[list(dummies.columns)] = dummies
dummies = pd.get_dummies(df.loc[:, 'condName'], drop_first = True)
df[list(dummies.columns)] = dummies
dummies = pd.get_dummies(df.loc[:, 'gradeName'], drop_first = True)
df[list(dummies.columns)] = dummies

Let's reset our data frames index and take a look at our data frame now.

In [23]:
df.reset_index(drop = True, inplace = True)
In [24]:
print(df.head())
        date     price  bedrooms  bathrooms  sqft_living  sqft_lot  floors  \
0 2014-10-13  221900.0         3       1.00         1180      5650     1.0   
1 2014-12-09  538000.0         3       2.25         2570      7242     2.0   
2 2015-02-25  180000.0         2       1.00          770     10000     1.0   
3 2014-12-09  604000.0         4       3.00         1960      5000     1.0   
4 2015-02-18  510000.0         3       2.00         1680      8080     1.0   

   waterfront  view  condition  grade  sqft_above  sqft_basement  yr_built  \
0           0     0          3      7        1180              0      1955   
1           0     0          3      7        2170            400      1951   
2           0     0          3      6         770              0      1933   
3           0     0          5      7        1050            910      1965   
4           0     0          3      8        1680              0      1987   

   yr_renovated  zipcode      lat     long  sqft_living15  sqft_lot15  month  \
0             0    98178  47.5112 -122.257           1340        5650     10   
1          1991    98125  47.7210 -122.319           1690        7639     12   
2             0    98028  47.7379 -122.233           2720        8062      2   
3             0    98136  47.5208 -122.393           1360        5000     12   
4             0    98074  47.6168 -122.045           1800        7503      2   

   year  renovated  newReno  newBuild  hasBsmnt     wfName  log_price  \
0  2014          0        0         0         0  noWtrFrnt  12.309982   
1  2014          1        0         0         1  noWtrFrnt  13.195614   
2  2015          0        0         0         0  noWtrFrnt  12.100712   
3  2014          0        0         0         1  noWtrFrnt  13.311329   
4  2015          0        0         1         0  noWtrFrnt  13.142166   

     priceS     logPS viewName condName gradeName  view1  view2  view3  view4  \
0 -0.866686 -1.400889    view0    cond3    grade7      0      0      0      0   
1 -0.005675  0.280648    view0    cond3    grade7      0      0      0      0   
2 -0.980816 -1.798227    view0    cond3    grade6      0      0      0      0   
3  0.174099  0.500356    view0    cond5    grade7      0      0      0      0   
4 -0.081943  0.179168    view0    cond3    grade8      0      0      0      0   

   cond2  cond3  cond4  cond5  grade10  grade11  grade12  grade13  grade3  \
0      0      1      0      0        0        0        0        0       0   
1      0      1      0      0        0        0        0        0       0   
2      0      1      0      0        0        0        0        0       0   
3      0      0      0      1        0        0        0        0       0   
4      0      1      0      0        0        0        0        0       0   

   grade4  grade5  grade6  grade7  grade8  grade9  
0       0       0       0       1       0       0  
1       0       0       0       1       0       0  
2       0       0       1       0       0       0  
3       0       0       0       1       0       0  
4       0       0       0       0       1       0  
In [25]:
print(df.shape)
print(df.dtypes)
(21612, 52)
date             datetime64[ns]
price                   float64
bedrooms                  int64
bathrooms               float64
sqft_living               int64
sqft_lot                  int64
floors                  float64
waterfront                int64
view                      int64
condition                 int64
grade                     int64
sqft_above                int64
sqft_basement             int64
yr_built                  int64
yr_renovated              int64
zipcode                   int64
lat                     float64
long                    float64
sqft_living15             int64
sqft_lot15                int64
month                     int64
year                      int64
renovated                 int64
newReno                   int64
newBuild                  int64
hasBsmnt                  int64
wfName                   object
log_price               float64
priceS                  float64
logPS                   float64
viewName                 object
condName                 object
gradeName                object
view1                     uint8
view2                     uint8
view3                     uint8
view4                     uint8
cond2                     uint8
cond3                     uint8
cond4                     uint8
cond5                     uint8
grade10                   uint8
grade11                   uint8
grade12                   uint8
grade13                   uint8
grade3                    uint8
grade4                    uint8
grade5                    uint8
grade6                    uint8
grade7                    uint8
grade8                    uint8
grade9                    uint8
dtype: object

Our data seems to be prepared. Let's take a look and decide if we will be using our price or log price as our dependent variable.

Data Exploration

Let's begin by plotting our scaled price and log price distribution and then let's KS test them versus a normal distribution.

In [26]:
#Plot our standardize price columns
plot_cums(df['priceS'], df['logPS'])  
In [27]:
ks_test(df['priceS'])  
KS-statistic = 0.145756411619
P-value = 0.0
In [28]:
ks_test(df['logPS'])  
KS-statistic = 0.0259997450665
P-value = 4.08737268131e-13

Neither distribution is normal; however, Log price scaled has a smaller K stat and larger p value, meaning it is more normal.

Let's look at the histogram of our log price and price.

In [29]:
df['log_price'].plot.hist(bins = 40)
plt.xlabel('price')
plt.title('Price Histogram')
Out[29]:
Text(0.5,1,'Price Histogram')
In [30]:
df['price'].plot.hist(bins = 100)
plt.xlabel('price')
plt.title('Price Histogram')
Out[30]:
Text(0.5,1,'Price Histogram')

We can see that the house prices are fairly normally distributed when looking at the log price histogram. Our price histogram shows us that most of our houses fall around \$500,000 and the majority of houses fall in the range from 0-1,000,000 dollars, with some houses going out to the 8 million dollar range. We will be using the scaled log of price since it is more normal and will likely produce better results.

Our scaled log of price ranges from -3.46 to 5.33. This translates to a price range of \$75,000 to \$7.7 million. A scaled log price of 0 is equal to \$445,000.

Let's pick some columns that we think may have an effect on price and let's create a pair plot.

In [31]:
#Pairplot of given columns
num_cols = ["logPS", "bedrooms", "bathrooms", "sqft_living", "sqft_lot",  "grade", "zipcode", "yr_renovated", "yr_built"] 
sns.pairplot(df.loc[:, num_cols], 
             diag_kind="kde", 
             size=2)
Out[31]:
<seaborn.axisgrid.PairGrid at 0x2b5ce694f60>

Now let's create some box plots and scatter plots of the features that seem to affect the price the most based on our pairplot above.

Let's start with a boxplot of the number of bedrooms vs log price scaled.

In [32]:
plotBox('bedrooms', 'logPS', df)

It seems that bedrooms do in fact have a factor on price as the price gradually increases as the bedrooms increase; however, the increase in price begins to decrease as the bedroom count increases past 5.

Let's look at the boxplot of the number of bathrooms in relation to house prices next.

In [33]:
plotBox('bathrooms', 'logPS', df, rotation = 90)

This boxplot shows a clear trend between bathroom count and price with an almost linear increase in price with the increase in bathroom count.

Now let's look at the scatter plot of Square foot living space and price.

In [34]:
ax = plt.figure(figsize=(9, 6)).gca() 
df.plot.scatter(x = 'sqft_living', y = 'logPS', ax = ax)
ax.set_title('Sqft Living vs Log Price Scaled') 
ax.set_ylabel('Log Price Scaled')
ax.set_xlabel('Square Foot Living')
Out[34]:
Text(0.5,0,'Square Foot Living')

As we can see the Square foot living clearly has an impact on price. We can see a clearly defined pattern with the larger houses being more expensive.

Let's add a waterfront hue on the scatter plot. This will show us if a waterfront house has an impact on price.

In [35]:
sns.lmplot(x = 'sqft_living', y = 'logPS', 
           data = df, 
           hue = "wfName", 
           palette="seismic", 
           scatter_kws={'alpha':0.3, 
                        },
           fit_reg = False)
plt.xlabel('Square Foot Living')
plt.ylabel('House Log Price Scaled')
plt.title('SQFT Living vs House Price with Hue by Waterfront\n')
Out[35]:
Text(0.5,1,'SQFT Living vs House Price with Hue by Waterfront\n')

It does seem like being a waterfront house makes a difference as most of the expensive houses by square foot living are colored red, which indicates a waterfront house.

Now let's look at the King county house grade vs price.

In [36]:
plotBox('grade', 'logPS', df)

This looks to be a very good indicator of price. House prices steadily increase as the grade increases. There is a very nice looking linear increase from houses grade 6-13.

Let's see if the year the house was renovated plays a factor on price.

In [37]:
plotBox('yr_renovated', 'logPS', df, x = 12, rotation = 90)

It seems that houses renovated after 1978 do see a slight increase in price from non renovated houses.

Let's look at the year built vs price next.

In [38]:
plotBox('yr_built', 'logPS', df, x = 18, rotation = 90)

The year built surprisingly does not seem to affect the price.

Let's take a look at how zipcode affects the price.

In [39]:
plotBox('zipcode', 'logPS', df, x = 18, rotation = 90)

The prices are pretty spread out when it comes to the zipcode. There are a few zipcodes that are clearly more expensive than others. Zipcode 98039 is the most expensive by far with the lower quartile of the scaled log price being over 2. The second most expensive zipcode is 98004. This zipcode has a median price range near 2 and has an upper quartile price over 2. The 3rd most expensive zipcode is 98040. This zipcode has a third quartile that just crosses over log price scaled of 2. 98112 is the 4th most expensive zip code having a very similiar median and upper quartile as 98040. The rest of the zipcodes are significantly lower.

The most expensive zipcode 98039 is for the Medina area. 98004 is for the area right next to medina including Clyde Hill and Bellevue. 98040 is for Mercer Island, and 98112 is for the Montlake area. Surprisingly the top two most expensive areas are on the eastside while Mercer island and Montlake, which are in very close proximity, are the most expensive in the Seattle area. Unsurprisingly Medina is the most expensive zipcode as that is where some real succesful CEOs are known to live.

Let's create create dummy variables for the top 4 zipcodes, we will see if this provides us better results than simply using the zipcode feature in our regression models later.

In [40]:
#Decode zipcode
df['zipTop'] = 'otherZip'
df.loc[df['zipcode'] == 98039, 'zipTop'] = 'zip1'
df.loc[df['zipcode'] == 98004, 'zipTop'] = 'zip2'
df.loc[df['zipcode'] == 98040, 'zipTop'] = 'zip3'
df.loc[df['zipcode'] == 98112, 'zipTop'] = 'zip4'

#encode zipTop
dummies = pd.get_dummies(df.loc[:, 'zipTop'], drop_first = True)
df[list(dummies.columns)] = dummies

Now let's keep going with our data exploration.

Let's see if being a waterfront property seems to have an effect on price.

In [41]:
plotBox('waterfront', 'logPS', df)

Although there is some overlapping it seems like waterfront vs not waterfront house prices are drawn from different distributions.

Let's bootstrap this as well to see if this provides a more clear result.

In [42]:
bootStrap(df, 'waterfront', 0, 1, 'logPS')

Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It is pretty clear that having a waterfront house does affect price.

Now let's bootstrap the mean difference in price between waterfront houses and not.

In [43]:
bootStrapDiffs(df, 'waterfront', 0, 1, 'logPS')
In [44]:
bootStrapDiffs(df, 'waterfront', 0, 1, 'price')

Above we can see that the Log scaled difference in mean price between waterfront houses and not waterfront houses is -2. Meaning non waterfront houses in general had a mean scaled log price that was -2 less than waterfront houses. This is a price difference of just over \$1.1 million. This is a significant difference in mean prices of houses.

Now let's take a look at a plot for newly renovated houses versus not renovated or older renovated houses.

In [45]:
plotBox('newReno', 'logPS', df)

These distributions seem to be very similiar. Whether the house was newly renovated or not does seem to have an effect on price as the newly renovated houses lower quartile is about the same as the older renovated houses mean.

Let's bootstrap this as well to see if this provides a more clear result.

In [46]:
bootStrap(df, 'newReno', 0, 1, 'logPS')

Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It does seem that having a newly renovated house does affect price.

Let's bootstrap the mean difference in log price of newly renovated houses and not.

In [47]:
bootStrapDiffs(df, 'newReno', 0, 1, 'logPS')
In [48]:
bootStrapDiffs(df, 'newReno', 0, 1, 'price')

Above we can see that the Log scaled difference in mean price between newly renovated houses and old houses is -.73. Meaning older renovated houses in general had a mean scaled log price that was -.73 less than newer renovated houses. This is a mean price difference of around \$280,000.

Now let's take a look at a plot for newly built houses versus older houses.

In [49]:
plotBox('newBuild', 'logPS', df)

These distributions seem to be very similiar. Whether the house was newly built or not does not seem to have an effect on price.

Let's bootstrap this as well to see if this provides a more clear result.

In [50]:
bootStrap(df, 'newBuild', 0, 1, 'logPS')

Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It actually seems that having a newly built house does affect price.

Let's also bootstrap their mean differences.

In [51]:
bootStrapDiffs(df, 'newBuild', 0, 1, 'logPS')
In [52]:
bootStrapDiffs(df, 'newBuild', 0, 1, 'price')

Above we can see that the Log scaled difference in mean price between newly built houses and old houses is -.26. Meaning older houses in general had a mean scaled log price that was -.26 less than newer houses. This is a price difference of about \$65,000.

Now let's take a look at a plot for houses with basements vs houses without.

In [53]:
plotBox('hasBsmnt', 'logPS', df)

These distributions seem to be very similiar. Whether the house has a basement or not does not seem to have an effect on price.

Let's bootstrap this as well to see if this provides a more clear result.

In [54]:
bootStrap(df, 'hasBsmnt', 0, 1, 'logPS')

Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It actually seems that having a basement on your house does affect price.

Let's also bootstrap their mean differences.

In [55]:
bootStrapDiffs(df, 'hasBsmnt', 0, 1, 'logPS')
In [56]:
bootStrapDiffs(df, 'hasBsmnt', 0, 1, 'price')

Above we can see that the Log scaled difference in mean price between houses with basements and houses without is -.43. Meaning houses without basements in general had a mean scaled log price that was -.43 less than houses with basements. This is a price difference of about \$135,000.

We've gained some valuable insight on these features. It seems so far that the features that affect price the most are if it's waterfront, its King county grade, its bathroom count, its zipcode, and whether it was newly renovated.

Let's look at the correlation of our features with price.

In [57]:
house_corr = df[["logPS", "bedrooms", "bathrooms", 'condition', 'floors','lat','long','sqft_above', 'sqft_basement', "sqft_living", "sqft_living15", "sqft_lot", "sqft_lot15", 'view', "waterfront", "grade", "zipcode", "yr_renovated", "yr_built", "renovated", 'newReno', 'newBuild', 'hasBsmnt']].corr()
In [58]:
fig = plt.figure(figsize=(16,12)) # define plot area
ax = fig.gca()
ax.set_title('Correlation matrix for numeric features')
sns.heatmap(round(house_corr,2), ax = ax, linewidths=0.25,vmax=1.0, square=True, cmap="RdYlGn", linecolor='k', annot=True)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x2b5d47fc550>

As we can see the features most correlated with price are Sqaure Feet Living, Grade, Square Feet Living15, Square Feet Above, Bathrooms and on a less correlated end, Bedrooms, Latitude, view, Square Foot Basement, Floors, has basement, and Waterfront.

Let's plot a pairplot of just these features, with a hue for whether the house is waterfront or not.

In [59]:
num_cols = ['logPS', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'sqft_above', 'grade', 'lat', 'sqft_basement', 'view', 'floors', 'hasBsmnt', 'wfName'] 
sns.pairplot(df.loc[:, num_cols], hue='wfName', 
             palette="seismic", diag_kind="kde", 
             size=2)
Out[59]:
<seaborn.axisgrid.PairGrid at 0x2b5d4825908>

This is a very illuminating plot, we seem to have found the features that affect price the most. The top row shows us the relation of price as a given feature increases. We can also clearly see that waterfront houses have a higher price in general given a feature.

Let's take a look at a facet grid plot. Our row will be the King county grade, our column will be the view count. We will plot scatter plots of the square foot living in relation to the price with a hue of the houses waterfront status.

In [60]:
g = sns.FacetGrid(df, 
                  col='view', 
                  row='grade',  
                  hue='wfName', 
                  palette='Set1') 
g = (g.map(plt.scatter, 'sqft_living', 'logPS').add_legend())  

Looking at the above facet grid we can see that most houses are from grade 5-11. We can also see that waterfront houses in general draw many more views with most waterfront houses having 4 views and in general having 3 or more views. The prices clearly increase as the grade increases, and houses with more views sell on the upper end of a given grade's price range. We can also see that as houses' grades increase the average square foot living size increases as well.

Now that we seem to have found some interesting relationships let's create some linear regression models of our data and see if our model is any good at predicting.

Let's take a quick look at our final original data frame first.

In [61]:
df.head()
Out[61]:
date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 month year renovated newReno newBuild hasBsmnt wfName log_price priceS logPS viewName condName gradeName view1 view2 view3 view4 cond2 cond3 cond4 cond5 grade10 grade11 grade12 grade13 grade3 grade4 grade5 grade6 grade7 grade8 grade9 zipTop zip1 zip2 zip3 zip4
0 2014-10-13 221900.0 3 1.00 1180 5650 1.0 0 0 3 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650 10 2014 0 0 0 0 noWtrFrnt 12.309982 -0.866686 -1.400889 view0 cond3 grade7 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 otherZip 0 0 0 0
1 2014-12-09 538000.0 3 2.25 2570 7242 2.0 0 0 3 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639 12 2014 1 0 0 1 noWtrFrnt 13.195614 -0.005675 0.280648 view0 cond3 grade7 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 otherZip 0 0 0 0
2 2015-02-25 180000.0 2 1.00 770 10000 1.0 0 0 3 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062 2 2015 0 0 0 0 noWtrFrnt 12.100712 -0.980816 -1.798227 view0 cond3 grade6 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 otherZip 0 0 0 0
3 2014-12-09 604000.0 4 3.00 1960 5000 1.0 0 0 5 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000 12 2014 0 0 0 1 noWtrFrnt 13.311329 0.174099 0.500356 view0 cond5 grade7 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 otherZip 0 0 0 0
4 2015-02-18 510000.0 3 2.00 1680 8080 1.0 0 0 3 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503 2 2015 0 0 1 0 noWtrFrnt 13.142166 -0.081943 0.179168 view0 cond3 grade8 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 otherZip 0 0 0 0
In [62]:
df.columns
Out[62]:
Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15', 'month', 'year', 'renovated', 'newReno',
       'newBuild', 'hasBsmnt', 'wfName', 'log_price', 'priceS', 'logPS',
       'viewName', 'condName', 'gradeName', 'view1', 'view2', 'view3', 'view4',
       'cond2', 'cond3', 'cond4', 'cond5', 'grade10', 'grade11', 'grade12',
       'grade13', 'grade3', 'grade4', 'grade5', 'grade6', 'grade7', 'grade8',
       'grade9', 'zipTop', 'zip1', 'zip2', 'zip3', 'zip4'],
      dtype='object')

Models

Let's create a testing and training df to use. We will drop our excess price columns. We will also drop sqft above and basement as they are colinear with sqft living. We will also drop month, monthName, date, and year since their is not enough data for these to be useful factors. We will drop view, condition, grade and their respective name features as we have those dummied. And we will remove year built and year renovated since our renovated, newReno, and newBuild features provide us this same information in a dummied fashion. We will then z normalize all of our numerical variables and add an intercept column. Finally we will take a look at our prepared data frame.

In [63]:
#Create new df
dfReady = df.copy(deep = True)
#drop columns
dfReady.drop(['date','price', 'priceS', 'log_price','sqft_above', 'sqft_basement', 'month', 'year', 'view', 'condition', 'grade', 'viewName', 'condName', 'gradeName', 'yr_built', 'yr_renovated', 'wfName', 'zipTop'],axis=1,inplace = True)
#z normalize variables
dfReady.loc[:,['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode','lat','long','sqft_living15','sqft_lot15']] = df.loc[:,['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode','lat','long','sqft_living15','sqft_lot15']].apply(zscore)
#add intercept column
dfReady['intercept'] = 1
print(dfReady.shape)
dfReady.head()
(21612, 40)
Out[63]:
bedrooms bathrooms sqft_living sqft_lot floors waterfront zipcode lat long sqft_living15 sqft_lot15 renovated newReno newBuild hasBsmnt logPS view1 view2 view3 view4 cond2 cond3 cond4 cond5 grade10 grade11 grade12 grade13 grade3 grade4 grade5 grade6 grade7 grade8 grade9 zip1 zip2 zip3 zip4 intercept
0 -0.406924 -1.447460 -0.979841 -0.228326 -0.915466 0 1.870139 -0.352528 -0.306115 -0.943398 -0.260724 0 0 0 0 -1.400889 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
1 -0.406924 0.175582 0.533602 -0.189891 0.936460 0 0.879573 1.161607 -0.746375 -0.432730 -0.187877 1 0 0 1 0.280648 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
2 -1.508293 -1.447460 -1.426252 -0.123306 -0.915466 0 -0.933350 1.283575 -0.135692 1.070093 -0.172385 0 0 0 0 -1.798227 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
3 0.694446 1.149406 -0.130571 -0.244019 -0.915466 0 1.085163 -0.283244 -1.271845 -0.914217 -0.284530 0 0 0 1 0.500356 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
4 -0.406924 -0.149027 -0.435437 -0.169660 -0.915466 0 -0.073613 0.409591 1.199288 -0.272234 -0.192858 0 0 1 0 0.179168 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
In [64]:
dfReady.dtypes
Out[64]:
bedrooms         float64
bathrooms        float64
sqft_living      float64
sqft_lot         float64
floors           float64
waterfront         int64
zipcode          float64
lat              float64
long             float64
sqft_living15    float64
sqft_lot15       float64
renovated          int64
newReno            int64
newBuild           int64
hasBsmnt           int64
logPS            float64
view1              uint8
view2              uint8
view3              uint8
view4              uint8
cond2              uint8
cond3              uint8
cond4              uint8
cond5              uint8
grade10            uint8
grade11            uint8
grade12            uint8
grade13            uint8
grade3             uint8
grade4             uint8
grade5             uint8
grade6             uint8
grade7             uint8
grade8             uint8
grade9             uint8
zip1               uint8
zip2               uint8
zip3               uint8
zip4               uint8
intercept          int64
dtype: object

Now let's split our ready data frame into a training and test set, with an 80/20 split.

In [65]:
from sklearn.model_selection import train_test_split
#train/test split
train, test = train_test_split(dfReady, test_size = .2) 
print(train.shape)
print(test.shape)
(17289, 40)
(4323, 40)

Now let's create our Xtrain, yTrain, Xtest and yTest data frames.

In [66]:
Xtest = test.loc[:, test.drop(['logPS'], axis = 1).columns]
yTest = test.loc[:, 'logPS']
Xtrain = train.loc[:, train.drop(['logPS'], axis = 1).columns]
yTrain = train.loc[:, 'logPS']

Earlier we created some dummy columns for our top 4 zipcodes; however, we also kept the zipcode column as we were not sure which would prove to be more useful for our model. Now we will create a list of X columns including the zipcode but dropping the dummy columns, and one that includes the dummy columns but drops the zipcode. We will do this so we do not pass colinear features to our model.

In [67]:
xCatsZipOnly = list(Xtrain.drop(['zip1','zip2','zip3','zip4'], axis = 1).columns) #Drops dummy column keeps zipcode
xCatsZipTop = list(Xtrain.drop(['zipcode'], axis = 1).columns) #drops zipcode keeps dummy columns

OLS Models

Now let's fit our first model to all features and the zipcode.

In [68]:
ols_model = sm.OLS(yTrain, Xtrain[xCatsZipOnly])

results = ols_model.fit()
n_points = Xtrain.shape[0]
y_output = yTrain

print('RMSE: {}'.format(np.sqrt(results.mse_model)))

# Get most of the linear regression statistics we are interested in:
print(results.summary())

# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
RMSE: 19.76486465255466
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  logPS   R-squared:                       0.767
Model:                            OLS   Adj. R-squared:                  0.766
Method:                 Least Squares   F-statistic:                     1668.
Date:                Sat, 23 Jun 2018   Prob (F-statistic):               0.00
Time:                        17:14:53   Log-Likelihood:                -11968.
No. Observations:               17289   AIC:                         2.401e+04
Df Residuals:                   17254   BIC:                         2.428e+04
Df Model:                          34                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
bedrooms         -0.0362      0.005     -7.241      0.000      -0.046      -0.026
bathrooms         0.0585      0.007      8.604      0.000       0.045       0.072
sqft_living       0.2702      0.009     30.508      0.000       0.253       0.288
sqft_lot          0.0429      0.006      7.519      0.000       0.032       0.054
floors            0.0835      0.005     16.036      0.000       0.073       0.094
waterfront        0.7868      0.054     14.476      0.000       0.680       0.893
zipcode          -0.0618      0.005    -12.925      0.000      -0.071      -0.052
lat               0.3754      0.004     94.576      0.000       0.368       0.383
long             -0.0494      0.005     -9.971      0.000      -0.059      -0.040
sqft_living15     0.1320      0.006     20.630      0.000       0.119       0.145
sqft_lot15       -0.0139      0.005     -2.547      0.011      -0.025      -0.003
renovated         0.0664      0.028      2.336      0.019       0.011       0.122
newReno           0.2703      0.036      7.431      0.000       0.199       0.342
newBuild         -0.2518      0.011    -22.236      0.000      -0.274      -0.230
hasBsmnt          0.0935      0.009     10.134      0.000       0.075       0.112
view1             0.3175      0.030     10.432      0.000       0.258       0.377
view2             0.2487      0.018     13.601      0.000       0.213       0.285
view3             0.3415      0.026     13.334      0.000       0.291       0.392
view4             0.4460      0.039     11.293      0.000       0.369       0.523
cond2             0.0478      0.107      0.445      0.656      -0.162       0.258
cond3             0.2656      0.100      2.663      0.008       0.070       0.461
cond4             0.4177      0.100      4.185      0.000       0.222       0.613
cond5             0.5822      0.100      5.799      0.000       0.385       0.779
grade10           1.7937      0.495      3.622      0.000       0.823       2.764
grade11           1.9231      0.496      3.876      0.000       0.951       2.895
grade12           2.0644      0.499      4.137      0.000       1.086       3.043
grade13           2.1990      0.518      4.242      0.000       1.183       3.215
grade3            1.3826      0.601      2.299      0.022       0.204       2.561
grade4            0.2236      0.504      0.443      0.657      -0.765       1.212
grade5            0.3776      0.495      0.763      0.445      -0.592       1.348
grade6            0.6418      0.494      1.298      0.194      -0.327       1.611
grade7            0.9472      0.495      1.915      0.055      -0.022       1.917
grade8            1.2479      0.495      2.523      0.012       0.278       2.218
grade9            1.5707      0.495      3.174      0.002       0.601       2.541
intercept        -1.4179      0.484     -2.927      0.003      -2.368      -0.468
==============================================================================
Omnibus:                      251.244   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              480.527
Skew:                          -0.000   Prob(JB):                    4.52e-105
Kurtosis:                       3.817   Cond. No.                         877.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
E:\Program Files\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[68]:
Text(0.5,1,'Residual Histogram')

Now let's predict on our test column and calculate the SSR of our model.

In [69]:
y_pred = results.predict(Xtest[xCatsZipOnly])
In [70]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred']
compareOLS['resid2'] = compareOLS['resid'] ** 2
print('SSR: ', compareOLS.resid2.sum())

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  1019.22365798
R-squared = 0.7622637004211023
Out[70]:
Actual Pred resid resid2
6054 -2.144398 -1.122823 -1.021575 1.043616
2029 -0.676353 -1.296547 0.620194 0.384640
12300 0.888498 -0.126171 1.014669 1.029553
8062 -1.598180 -1.524343 -0.073837 0.005452
12346 -0.330182 -0.471712 0.141530 0.020031

This is not a bad model already with a R squared value of .76 and a SSR of 990.

Now let's see if our model would be better using the dummied zipcode features instead of the acutal zipcodes.

In [71]:
ols_model = sm.OLS(yTrain, Xtrain[xCatsZipTop])

results = ols_model.fit()
n_points = Xtrain.shape[0]
y_output = yTrain

print('RMSE: {}'.format(np.sqrt(results.mse_model)))

# Get most of the linear regression statistics we are interested in:
print(results.summary())

# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
RMSE: 19.257899321264812
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  logPS   R-squared:                       0.792
Model:                            OLS   Adj. R-squared:                  0.792
Method:                 Least Squares   F-statistic:                     1776.
Date:                Sat, 23 Jun 2018   Prob (F-statistic):               0.00
Time:                        17:14:54   Log-Likelihood:                -10971.
No. Observations:               17289   AIC:                         2.202e+04
Df Residuals:                   17251   BIC:                         2.231e+04
Df Model:                          37                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
bedrooms         -0.0315      0.005     -6.679      0.000      -0.041      -0.022
bathrooms         0.0511      0.006      7.954      0.000       0.038       0.064
sqft_living       0.2696      0.008     32.230      0.000       0.253       0.286
sqft_lot          0.0475      0.005      8.816      0.000       0.037       0.058
floors            0.0782      0.005     15.933      0.000       0.069       0.088
waterfront        0.7906      0.051     15.393      0.000       0.690       0.891
lat               0.3607      0.004     98.619      0.000       0.354       0.368
long             -0.0143      0.004     -3.348      0.001      -0.023      -0.006
sqft_living15     0.1117      0.006     18.463      0.000       0.100       0.124
sqft_lot15       -0.0119      0.005     -2.293      0.022      -0.022      -0.002
renovated         0.0470      0.027      1.750      0.080      -0.006       0.100
newReno           0.2428      0.034      7.069      0.000       0.175       0.310
newBuild         -0.1863      0.011    -17.476      0.000      -0.207      -0.165
hasBsmnt          0.0825      0.009      9.460      0.000       0.065       0.100
view1             0.2713      0.029      9.432      0.000       0.215       0.328
view2             0.2598      0.017     15.064      0.000       0.226       0.294
view3             0.3772      0.024     15.592      0.000       0.330       0.425
view4             0.4969      0.037     13.320      0.000       0.424       0.570
cond2             0.0891      0.101      0.880      0.379      -0.109       0.288
cond3             0.3143      0.094      3.337      0.001       0.130       0.499
cond4             0.4540      0.094      4.817      0.000       0.269       0.639
cond5             0.6272      0.095      6.616      0.000       0.441       0.813
grade10           1.6646      0.468      3.560      0.000       0.748       2.581
grade11           1.7365      0.468      3.708      0.000       0.818       2.655
grade12           1.8586      0.471      3.945      0.000       0.935       2.782
grade13           1.9126      0.490      3.907      0.000       0.953       2.872
grade3            1.2931      0.568      2.278      0.023       0.180       2.406
grade4            0.1911      0.476      0.402      0.688      -0.742       1.124
grade5            0.3330      0.467      0.713      0.476      -0.583       1.249
grade6            0.5858      0.467      1.255      0.210      -0.329       1.501
grade7            0.8977      0.467      1.923      0.055      -0.017       1.813
grade8            1.1835      0.467      2.534      0.011       0.268       2.099
grade9            1.4750      0.467      3.157      0.002       0.559       2.391
zip1              1.3003      0.077     16.779      0.000       1.148       1.452
zip2              1.0561      0.029     35.823      0.000       0.998       1.114
zip3              0.6384      0.031     20.289      0.000       0.577       0.700
zip4              0.7354      0.032     22.834      0.000       0.672       0.799
intercept        -1.4614      0.457     -3.195      0.001      -2.358      -0.565
==============================================================================
Omnibus:                      336.593   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              634.862
Skew:                          -0.126   Prob(JB):                    1.39e-138
Kurtosis:                       3.904   Cond. No.                         859.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
E:\Program Files\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[71]:
Text(0.5,1,'Residual Histogram')
In [72]:
y_pred = results.predict(Xtest[xCatsZipTop])
In [73]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred']
compareOLS['resid2'] = compareOLS['resid'] ** 2
print('SSR: ', compareOLS.resid2.sum())

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  908.814891103
R-squared = 0.7880168032596412
Out[73]:
Actual Pred resid resid2
6054 -2.144398 -1.150333 -0.994065 0.988165
2029 -0.676353 -1.231522 0.555169 0.308213
12300 0.888498 0.416083 0.472415 0.223176
8062 -1.598180 -1.614967 0.016787 0.000282
12346 -0.330182 -0.422568 0.092387 0.008535

This model is clearly better than our original. Our R squared has increased to .791, our SSR has dropped to 886, the AIC has dropped and our RMSE has dropped from 20 to 19.

Going forward we will use our dummied zipcode columns, not the actual column.

Now let's conduct a backward stepwise regression to see if we can find a better model.

In [74]:
def backward_selected(X, Y):
    # Start with all factors and intercept
    possible_factors = list(X.columns)
        
    best_aic = sm.OLS(Y, X).fit().aic
    current_aic = best_aic
    
    
    nochange = False
    #loops through all columns
    for i in range(len(X.columns)):
        #makes sure a change has been made, if no change model is done
        if nochange == False:
            nochange = True
            to_try_remove = possible_factors
            aic_candidates = []
            
            for candidate in to_try_remove:
                columns = possible_factors.copy()
                columns.remove(candidate)
                
                aic = sm.OLS(Y,X[columns]).fit().aic

                # Append tuple of the form (aic, response)
                aic_candidates.append((aic, candidate))

            # Sort all the pairs by the first entry of tuple (default of sort() method)
            aic_candidates.sort()
            #get best new AIC and feature removed
            best_new_aic, best_candidate = aic_candidates.pop(0)

            # Now check if we have something better:
            if best_new_aic < current_aic:
                # Remove the best candidate's name from possible_factors
                # set the current AIC to best AIC
                # reset no change flag
                possible_factors.remove(best_candidate)       
                current_aic = best_new_aic
                nochange = False
            
            
        # Now we repeat the process with all the remaining candidate columns

    # Here is the final formula!
    formula = "{} ~ {} + 1".format('logPS', ' + '.join(possible_factors))
    print(formula)
    # Get the model object
    model = sm.OLS(Y, X[possible_factors]).fit()
    return possible_factors, model #return best features list and model
In [75]:
#apply backwards stepwise regression
bestXCols, backwards_model = backward_selected(Xtrain[xCatsZipTop], yTrain)

print('\nAdjusted R-Squared: {}'.format(backwards_model.rsquared_adj)) 
print('AIC: {}'.format(backwards_model.aic)) 
logPS ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + waterfront + lat + long + sqft_living15 + sqft_lot15 + renovated + newReno + newBuild + hasBsmnt + view1 + view2 + view3 + view4 + cond3 + cond4 + cond5 + grade10 + grade11 + grade12 + grade13 + grade3 + grade5 + grade6 + grade7 + grade8 + grade9 + zip1 + zip2 + zip3 + zip4 + intercept + 1

Adjusted R-Squared: 0.7916691170640058
AIC: 22015.546023861156

Let's run this model with the best feature list to get a better idea of our result and to use it to predict on our test house prices.

In [76]:
ols_model = sm.OLS(yTrain,Xtrain[bestXCols]) 

results = ols_model.fit() 

print('RMSE: {}'.format(np.sqrt(results.mse_model)))

# Get most of the linear regression statistics we are interested in:
print(results.summary())

# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
RMSE: 19.800316230301597
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  logPS   R-squared:                       0.792
Model:                            OLS   Adj. R-squared:                  0.792
Method:                 Least Squares   F-statistic:                     1878.
Date:                Sat, 23 Jun 2018   Prob (F-statistic):               0.00
Time:                        17:14:56   Log-Likelihood:                -10972.
No. Observations:               17289   AIC:                         2.202e+04
Df Residuals:                   17253   BIC:                         2.229e+04
Df Model:                          35                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
bedrooms         -0.0315      0.005     -6.671      0.000      -0.041      -0.022
bathrooms         0.0511      0.006      7.963      0.000       0.039       0.064
sqft_living       0.2696      0.008     32.227      0.000       0.253       0.286
sqft_lot          0.0475      0.005      8.818      0.000       0.037       0.058
floors            0.0782      0.005     15.924      0.000       0.069       0.088
waterfront        0.7902      0.051     15.386      0.000       0.690       0.891
lat               0.3606      0.004     98.618      0.000       0.353       0.368
long             -0.0143      0.004     -3.350      0.001      -0.023      -0.006
sqft_living15     0.1115      0.006     18.446      0.000       0.100       0.123
sqft_lot15       -0.0118      0.005     -2.286      0.022      -0.022      -0.002
renovated         0.0472      0.027      1.758      0.079      -0.005       0.100
newReno           0.2426      0.034      7.063      0.000       0.175       0.310
newBuild         -0.1863      0.011    -17.469      0.000      -0.207      -0.165
hasBsmnt          0.0826      0.009      9.466      0.000       0.065       0.100
view1             0.2714      0.029      9.436      0.000       0.215       0.328
view2             0.2599      0.017     15.068      0.000       0.226       0.294
view3             0.3772      0.024     15.594      0.000       0.330       0.425
view4             0.4968      0.037     13.318      0.000       0.424       0.570
cond3             0.2392      0.036      6.560      0.000       0.168       0.311
cond4             0.3789      0.037     10.334      0.000       0.307       0.451
cond5             0.5521      0.038     14.479      0.000       0.477       0.627
grade10           1.4862      0.099     14.996      0.000       1.292       1.680
grade11           1.5582      0.103     15.184      0.000       1.357       1.759
grade12           1.6804      0.115     14.674      0.000       1.456       1.905
grade13           1.7344      0.175      9.903      0.000       1.391       2.078
grade3            1.1214      0.337      3.325      0.001       0.460       1.782
grade5            0.1524      0.101      1.508      0.132      -0.046       0.350
grade6            0.4070      0.097      4.216      0.000       0.218       0.596
grade7            0.7191      0.096      7.461      0.000       0.530       0.908
grade8            1.0048      0.097     10.375      0.000       0.815       1.195
grade9            1.2965      0.098     13.268      0.000       1.105       1.488
zip1              1.3003      0.077     16.779      0.000       1.148       1.452
zip2              1.0558      0.029     35.816      0.000       0.998       1.114
zip3              0.6385      0.031     20.292      0.000       0.577       0.700
zip4              0.7352      0.032     22.828      0.000       0.672       0.798
intercept        -1.2077      0.100    -12.112      0.000      -1.403      -1.012
==============================================================================
Omnibus:                      337.948   Durbin-Watson:                   1.999
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              637.425
Skew:                          -0.127   Prob(JB):                    3.85e-139
Kurtosis:                       3.906   Cond. No.                         195.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
E:\Program Files\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[76]:
Text(0.5,1,'Residual Histogram')
In [77]:
y_pred = results.predict(Xtest[bestXCols]) 
In [78]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred}) 
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Find resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #Resid squared
print('SSR: ', compareOLS.resid2.sum()) #Sum SSR

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  909.030359783
R-squared = 0.7879665446866322
Out[78]:
Actual Pred resid resid2
6054 -2.144398 -1.150399 -0.993999 0.988035
2029 -0.676353 -1.231154 0.554801 0.307804
12300 0.888498 0.416259 0.472239 0.223010
8062 -1.598180 -1.614813 0.016633 0.000277
12346 -0.330182 -0.422441 0.092260 0.008512

Our backwards selected model has removed 3 features. This has improved on our model as the AIC, SSR, RMSE, and Rsquared are the same but we are using less features, which helps prevent overfitting. However, we can see above that renovated, grade 3, grade 8 and sqft_lot15 still have p-values over .05.

Let's remove the features with p-values over .05 from our model and see if this helps our model.

In [79]:
#Checks if feature in list and removes
if 'grade3' in bestXCols:
    bestXCols.remove('grade3')
if 'renovated' in bestXCols:
    bestXCols.remove('renovated')
if 'sqft_lot15' in bestXCols:
    bestXCols.remove('sqft_lot15')
if 'grade8' in bestXCols:
    bestXCols.remove('grade8')    
In [80]:
ols_model = sm.OLS(yTrain,Xtrain[bestXCols]) 

results = ols_model.fit() 

print('RMSE: {}'.format(np.sqrt(results.mse_model)))

# Get most of the linear regression statistics we are interested in:
print(results.summary())

# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
RMSE: 21.020261123980823
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  logPS   R-squared:                       0.791
Model:                            OLS   Adj. R-squared:                  0.790
Method:                 Least Squares   F-statistic:                     2103.
Date:                Sat, 23 Jun 2018   Prob (F-statistic):               0.00
Time:                        17:14:57   Log-Likelihood:                -11030.
No. Observations:               17289   AIC:                         2.212e+04
Df Residuals:                   17257   BIC:                         2.237e+04
Df Model:                          31                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
bedrooms         -0.0292      0.005     -6.196      0.000      -0.038      -0.020
bathrooms         0.0526      0.006      8.172      0.000       0.040       0.065
sqft_living       0.2723      0.008     32.522      0.000       0.256       0.289
sqft_lot          0.0389      0.004     10.050      0.000       0.031       0.046
floors            0.0798      0.005     16.240      0.000       0.070       0.089
waterfront        0.7950      0.051     15.476      0.000       0.694       0.896
lat               0.3618      0.004     98.689      0.000       0.355       0.369
long             -0.0165      0.004     -3.860      0.000      -0.025      -0.008
sqft_living15     0.1115      0.006     18.400      0.000       0.100       0.123
newReno           0.2889      0.023     12.396      0.000       0.243       0.335
newBuild         -0.1840      0.011    -17.464      0.000      -0.205      -0.163
hasBsmnt          0.0830      0.009      9.486      0.000       0.066       0.100
view1             0.2700      0.029      9.358      0.000       0.213       0.327
view2             0.2601      0.017     15.032      0.000       0.226       0.294
view3             0.3782      0.024     15.599      0.000       0.331       0.426
view4             0.4996      0.037     13.358      0.000       0.426       0.573
cond3             0.2723      0.036      7.488      0.000       0.201       0.344
cond4             0.4112      0.037     11.246      0.000       0.340       0.483
cond5             0.5850      0.038     15.375      0.000       0.510       0.660
grade10           0.4777      0.019     24.573      0.000       0.440       0.516
grade11           0.5439      0.032     17.240      0.000       0.482       0.606
grade12           0.6581      0.059     11.148      0.000       0.542       0.774
grade13           0.7051      0.145      4.872      0.000       0.421       0.989
grade5           -0.8316      0.035    -23.593      0.000      -0.901      -0.762
grade6           -0.5818      0.016    -36.251      0.000      -0.613      -0.550
grade7           -0.2760      0.010    -27.971      0.000      -0.295      -0.257
grade9            0.2914      0.013     22.212      0.000       0.266       0.317
zip1              1.3004      0.078     16.727      0.000       1.148       1.453
zip2              1.0579      0.030     35.770      0.000       1.000       1.116
zip3              0.6409      0.032     20.307      0.000       0.579       0.703
zip4              0.7387      0.032     22.864      0.000       0.675       0.802
intercept        -0.2428      0.037     -6.514      0.000      -0.316      -0.170
==============================================================================
Omnibus:                      392.805   Durbin-Watson:                   2.001
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              764.731
Skew:                          -0.150   Prob(JB):                    8.72e-167
Kurtosis:                       3.986   Cond. No.                         76.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
E:\Program Files\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[80]:
Text(0.5,1,'Residual Histogram')
In [81]:
y_pred = results.predict(Xtest[bestXCols]) 
In [82]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred}) 
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  914.314873556
R-squared = 0.7867339195020727
Out[82]:
Actual Pred resid resid2
6054 -2.144398 -1.146880 -0.997518 0.995042
2029 -0.676353 -1.180939 0.504586 0.254607
12300 0.888498 0.423323 0.465174 0.216387
8062 -1.598180 -1.614362 0.016182 0.000262
12346 -0.330182 -0.435060 0.104879 0.011000

Looking at our model, our R-squared is slightly above our backwards selected model, and our aic is the same; however, all of our p values are below our .05 cutoff. This model is our best so far. We seem to have found the best features. Let's use these features to create a ridge, lasso, and elastic regression model and predict to see if they provide us with better results. We will use only the features we found to be the best to train these models.

Ridge, Lasso, and Elastic-Net Regression Models

First let's fit the ridge regression model and predict using it.

In [83]:
ols_model = sm.OLS(yTrain, Xtrain[bestXCols])

results = ols_model.fit_regularized(method ='elastic_net', alpha=0.000247875, L1_wt = 0) 
print(results.params)#Print coeffecients
[-0.03120652  0.05426292  0.27895471  0.03833521  0.07960361  0.7629218
  0.36229179 -0.01799991  0.11396278  0.2874783  -0.18442388  0.08103086
  0.2655781   0.25777727  0.37328969  0.5053338   0.23758251  0.3767641
  0.54793728  0.46462173  0.52248695  0.60367078  0.48787155 -0.80778573
 -0.57389448 -0.27239588  0.28490927  1.16702595  1.03825069  0.62739769
  0.72232205 -0.20611307]
In [84]:
y_pred = results.predict(Xtest[bestXCols]) #predict on Xtest
In [85]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred}) 
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  915.266392683
R-squared = 0.7865119754424086
Out[85]:
Actual Pred resid resid2
6054 -2.144398 -1.148122 -0.996276 0.992566
2029 -0.676353 -1.184966 0.508613 0.258687
12300 0.888498 0.413271 0.475227 0.225841
8062 -1.598180 -1.616102 0.017922 0.000321
12346 -0.330182 -0.443093 0.112911 0.012749

Now let's fit the lasso regression model and predict on it.

In [86]:
ols_model = sm.OLS(yTrain, Xtrain[bestXCols])

results = ols_model.fit_regularized(method ='elastic_net', alpha=0.000247875, L1_wt = 1) 
print(results.params)#Print coeffecients and features
bedrooms        -0.033032
bathrooms        0.055021
sqft_living      0.288573
sqft_lot         0.036923
floors           0.078565
waterfront       0.753079
lat              0.362797
long            -0.019207
sqft_living15    0.114450
newReno          0.280637
newBuild        -0.183638
hasBsmnt         0.076842
view1            0.250847
view2            0.254465
view3            0.367611
view4            0.505622
cond3            0.056203
cond4            0.195473
cond5            0.365336
grade10          0.443845
grade11          0.489095
grade12          0.541142
grade13          0.000000
grade5          -0.816536
grade6          -0.576567
grade7          -0.274742
grade9           0.271523
zip1             1.210335
zip2             1.039525
zip3             0.620356
zip4             0.715637
intercept       -0.018950
dtype: float64
In [87]:
y_pred = results.predict(Xtest[bestXCols]) 
In [88]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred}) 
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  918.869215982
R-squared = 0.7856716084901911
Out[88]:
Actual Pred resid resid2
6054 -2.144398 -1.157557 -0.986841 0.973856
2029 -0.676353 -1.192430 0.516077 0.266335
12300 0.888498 0.405834 0.482664 0.232964
8062 -1.598180 -1.623470 0.025290 0.000640
12346 -0.330182 -0.447519 0.117337 0.013768

Now let's fit the elastic-net regression model and predict using it.

In [89]:
ols_model = sm.OLS(yTrain, Xtrain[bestXCols])

results = ols_model.fit_regularized(method ='elastic_net', alpha=0.000247875, L1_wt = .5) 
print(results.params) #Print coeffecients and features
bedrooms        -0.032854
bathrooms        0.054810
sqft_living      0.286986
sqft_lot         0.037080
floors           0.078698
waterfront       0.753009
lat              0.362704
long            -0.018881
sqft_living15    0.113761
newReno          0.284297
newBuild        -0.183905
hasBsmnt         0.077628
view1            0.257575
view2            0.256800
view3            0.371209
view4            0.509494
cond3            0.063947
cond4            0.203606
cond5            0.374091
grade10          0.447424
grade11          0.496728
grade12          0.560250
grade13          0.000000
grade5          -0.820721
grade6          -0.579437
grade7          -0.276423
grade9           0.273238
zip1             1.197119
zip2             1.038305
zip3             0.622292
zip4             0.717655
intercept       -0.027132
dtype: float64
In [90]:
y_pred = results.predict(Xtest[bestXCols])
In [91]:
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred}) 
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids

SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))

compareOLS.head()
SSR:  918.763583147
R-squared = 0.7856962475958664
Out[91]:
Actual Pred resid resid2
6054 -2.144398 -1.158277 -0.986121 0.972435
2029 -0.676353 -1.192504 0.516151 0.266411
12300 0.888498 0.406382 0.482115 0.232435
8062 -1.598180 -1.623679 0.025499 0.000650
12346 -0.330182 -0.445950 0.115769 0.013402

Looking at our 3 models above there has not been a significant increase in performance to our OLS model using ridge, lasso or elastic-net regression. Elastic-net regression has given us the best of the above models having the lowest SSR and highest R squared on our prediction of the three models. However none of the models have proven to be better than our OLS model with our best feature list which had an SSR of 887 and an R squared of .79.

SVD Model

Let's go ahead and create a SVD model and see if that performs better on our predictions.

In [92]:
trainVars = train.loc[:,xCatsZipTop] 
M = trainVars.values 
logPS = train.loc[:,'logPS'] 

testVars = test.loc[:,xCatsZipTop] 
Mtest = testVars.values 
logPStest = test.loc[:,'logPS'] 

print('1st 5 rows of Data Frame: ')
print(trainVars.head())
print('\n1st 5 rows in matrix Form: ')
print(M[0:5, :])
1st 5 rows of Data Frame: 
       bedrooms  bathrooms  sqft_living  sqft_lot    floors  waterfront  \
12063  0.694446  -1.447460    -0.653199 -0.182214  0.010497           0   
10280  0.694446   0.824798    -0.424549 -0.217704  0.010497           0   
16752 -0.406924   0.500190    -0.424549 -0.239891  0.936460           0   
15565 -0.406924   0.500190     1.480865  2.138185  0.936460           0   
5415   1.795815   0.824798     1.241327 -0.221470  0.936460           0   

            lat      long  sqft_living15  sqft_lot15  renovated  newReno  \
12063  1.067064 -1.023312      -0.739131   -0.190771          0        0   
10280  0.875812 -0.391327      -0.243053   -0.244609          0        0   
16752 -2.036263 -0.945201       0.822055   -0.283614          0        0   
15565 -1.230841  1.618244       1.055503    3.664324          0        0   
5415   1.064899  0.801634       1.624533   -0.250322          0        0   

       newBuild  hasBsmnt  view1  view2  view3  view4  cond2  cond3  cond4  \
12063         0         0      0      0      0      0      0      1      0   
10280         0         1      0      0      0      0      0      1      0   
16752         1         0      0      0      0      0      0      1      0   
15565         1         0      0      0      0      0      0      1      0   
5415          1         0      0      0      0      0      0      1      0   

       cond5  grade10  grade11  grade12  grade13  grade3  grade4  grade5  \
12063      0        0        0        0        0       0       0       0   
10280      0        0        0        0        0       0       0       0   
16752      0        0        0        0        0       0       0       0   
15565      0        0        0        0        0       0       0       0   
5415       0        0        0        0        0       0       0       0   

       grade6  grade7  grade8  grade9  zip1  zip2  zip3  zip4  intercept  
12063       0       1       0       0     0     0     0     0          1  
10280       0       1       0       0     0     0     0     0          1  
16752       0       0       1       0     0     0     0     0          1  
15565       0       0       0       1     0     0     0     0          1  
5415        0       0       0       1     0     0     0     0          1  

1st 5 rows in matrix Form: 
[[ 0.69444556 -1.44745951 -0.6531988  -0.18221401  0.01049699  0.
   1.06706366 -1.0233119  -0.73913051 -0.19077077  0.          0.          0.
   0.          0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          1.          0.          0.          0.
   0.          0.          0.          1.        ]
 [ 0.69444556  0.82479808 -0.42454912 -0.21770371  0.01049699  0.
   0.87581213 -0.39132673 -0.24305299 -0.24460879  0.          0.          0.
   1.          0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          1.          0.          0.          0.
   0.          0.          0.          1.        ]
 [-0.40692359  0.50018986 -0.42454912 -0.23989081  0.93645991  0.
  -2.03626301 -0.94520138  0.82205462 -0.28361389  0.          0.          1.
   0.          0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          1.          0.          0.
   0.          0.          0.          1.        ]
 [-0.40692359  0.50018986  1.48086489  2.13818485  0.93645991  0.
  -1.23084148  1.61824409  1.05550287  3.66432446  0.          0.          1.
   0.          0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.          0.
   0.          0.          0.          1.        ]
 [ 1.79581472  0.82479808  1.24132712 -0.22146996  0.93645991  0.
   1.06489854  0.80163404  1.62453297 -0.25032222  0.          0.          1.
   0.          0.          0.          0.          0.          0.          1.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.          0.
   0.          0.          0.          1.        ]]
In [93]:
beta_coeffs, resids, rank, s = np.linalg.lstsq(M,logPS) 

print('beta coefficients: {}'.format(beta_coeffs))

print('\nsingular values: {}'.format(s))
beta coefficients: [-0.03150877  0.05107818  0.26959431  0.04747887  0.07822763  0.79064703
  0.36067408 -0.01428583  0.11166028 -0.01185408  0.04699324  0.24281385
 -0.18634131  0.08252309  0.2713139   0.25979988  0.37716561  0.49688733
  0.08908352  0.31430699  0.45397178  0.62719715  1.66463344  1.7365487
  1.85858182  1.91263893  1.29314406  0.19110228  0.33295394  0.5857501
  0.89774777  1.18345659  1.47504917  1.30029633  1.05612681  0.63839059
  0.73544178 -1.46143787]

singular values: [ 242.92972946  198.6507655   172.18666864  135.614026    130.24598275
  117.41042174   98.17319785   77.988911     73.10366372   70.76964913
   66.55162581   55.25528523   50.69382878   46.66128389   42.9904901
   40.626568     37.82278648   28.82665219   26.97284263   26.12956792
   19.25142138   17.65301899   17.16999832   15.82928567   15.59277364
   14.47534676   14.06420708   13.48279275   12.97486471   11.16820027
    8.27923073    7.94229809    5.91632966    4.89566306    3.31905759
    2.22609501    1.47018791    0.28270003]
In [94]:
house_predictions = np.dot(Mtest, beta_coeffs) + np.mean(logPStest) 
house_resids = house_predictions - logPStest 

# Plot the residuals vs score (height)
plt.subplot(1, 2, 1)
plt.plot(house_predictions, house_resids, '.')
plt.ylabel('residuals')
plt.xlabel('score(logPS)')
plt.title('Residuals vs Score')

# Histogram of residuals
plt.subplot(1, 2, 2)
plt.hist(house_resids)
plt.title('Histogram of Residuals')

SSR = np.sum(np.square(house_resids))
SST = np.sum(np.square(logPStest - np.mean(logPStest)))

print('SSR: {}'.format(SSR))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
SSR: 909.1878640651952
R-squared = 0.7879298064447362

This model has actually provided us with very similiar results to our best model. The SSR on this model is only slightly higher than our OLS feature selected model and the R squared is very similar. So far our feature selected OLS is the best followed closely by this model and then the Elastic Net regression model.

Now let's see if using Principal component regression will provide us with a better model.

Principal Component Regression

In [95]:
from sklearn.decomposition import PCA

#PCA
target_label = 'logPS'
pca = PCA() #set model
pca_result = pca.fit_transform(Xtrain) 

column_names = ['pc' + str(ix+1) for ix in range(Xtrain.shape[1])] 
pca_df = pd.DataFrame(data = pca_result, columns=column_names) 
pca_df[target_label] = yTrain 
In [96]:
plt.plot(pca.explained_variance_) 
plt.title('Explained variance by Principal Component Num')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
Out[96]:
Text(0,0.5,'Explained Variance')
In [97]:
#get percentage explained variance of n principal components
print("1 Component: ", pca.explained_variance_[0:1].sum()/pca.explained_variance_.sum())
print("2 Components: ", pca.explained_variance_[0:2].sum()/pca.explained_variance_.sum()) 
print("3 Components: ", pca.explained_variance_[0:3].sum()/pca.explained_variance_.sum()) 
print("4 Components: ", pca.explained_variance_[0:4].sum()/pca.explained_variance_.sum()) 
print("5 Components: ", pca.explained_variance_[0:5].sum()/pca.explained_variance_.sum()) 
print("6 Components: ", pca.explained_variance_[0:6].sum()/pca.explained_variance_.sum()) 
print("7 Components: ", pca.explained_variance_[0:7].sum()/pca.explained_variance_.sum()) 
print("8 Components: ", pca.explained_variance_[0:8].sum()/pca.explained_variance_.sum()) 
print("9 Components: ", pca.explained_variance_[0:9].sum()/pca.explained_variance_.sum())
print("10 Components: ", pca.explained_variance_[0:10].sum()/pca.explained_variance_.sum()) 
print("11 Components: ", pca.explained_variance_[0:11].sum()/pca.explained_variance_.sum()) 
print("12 Components: ", pca.explained_variance_[0:12].sum()/pca.explained_variance_.sum()) 
print("13 Components: ", pca.explained_variance_[0:13].sum()/pca.explained_variance_.sum())
print("14 Components: ", pca.explained_variance_[0:14].sum()/pca.explained_variance_.sum()) 
print("15 Components: ", pca.explained_variance_[0:15].sum()/pca.explained_variance_.sum()) 
print("16 Components: ", pca.explained_variance_[0:16].sum()/pca.explained_variance_.sum()) 
print("17 Components: ", pca.explained_variance_[0:17].sum()/pca.explained_variance_.sum())
1 Component:  0.299598734925
2 Components:  0.454207545844
3 Components:  0.56567468678
4 Components:  0.650266078242
5 Components:  0.724357621979
6 Components:  0.771964371108
7 Components:  0.807988873463
8 Components:  0.838809933557
9 Components:  0.865143761461
10 Components:  0.88965486365
11 Components:  0.911348077917
12 Components:  0.926377453316
13 Components:  0.939939052748
14 Components:  0.951094024461
15 Components:  0.961285837804
16 Components:  0.969373754559
17 Components:  0.976367929049

We can see from above that 15 components is our best choice as it will provide us with a model that explains 96.2% of the variance of price and adding further components begins to increase our models explanation of variance very minimally.

In [98]:
# Perform linear regression with the first N columns
n = 15
formula_start = target_label + ' ~ ' #create formula start for OLS
formula_terms = ['pc' + str(x+1) for x in range(n)] #Gather pc's to add to formula
formula_end = ' + '.join(formula_terms) #prepare end of formula
formula_final = formula_start + formula_end #final formula to use in OLS

pcr_model = sm.ols(formula = formula_final, data=pca_df) 

results = pcr_model.fit()

# Get most of the linear regression statistics we are interested in:
print(results.summary())

# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  logPS   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                 -0.000
Method:                 Least Squares   F-statistic:                    0.9277
Date:                Sat, 23 Jun 2018   Prob (F-statistic):              0.532
Time:                        17:14:58   Log-Likelihood:                -19759.
No. Observations:               13865   AIC:                         3.955e+04
Df Residuals:                   13849   BIC:                         3.967e+04
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -0.0285      0.009     -3.330      0.001      -0.045      -0.012
pc1           -0.0023      0.005     -0.513      0.608      -0.011       0.007
pc2           -0.0032      0.006     -0.520      0.603      -0.015       0.009
pc3            0.0101      0.007      1.366      0.172      -0.004       0.025
pc4            0.0024      0.009      0.284      0.776      -0.014       0.019
pc5           -0.0017      0.009     -0.182      0.855      -0.020       0.016
pc6           -0.0147      0.011     -1.293      0.196      -0.037       0.008
pc7            0.0140      0.013      1.067      0.286      -0.012       0.040
pc8           -0.0007      0.014     -0.048      0.962      -0.028       0.027
pc9            0.0216      0.015      1.400      0.161      -0.009       0.052
pc10          -0.0111      0.016     -0.697      0.486      -0.042       0.020
pc11           0.0109      0.016      0.678      0.498      -0.021       0.042
pc12           0.0082      0.020      0.406      0.684      -0.031       0.048
pc13          -0.0199      0.021     -0.933      0.351      -0.062       0.022
pc14           0.0355      0.023      1.512      0.131      -0.011       0.081
pc15          -0.0386      0.025     -1.562      0.118      -0.087       0.010
==============================================================================
Omnibus:                      541.810   Durbin-Watson:                   1.974
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              697.765
Skew:                           0.422   Prob(JB):                    3.04e-152
Kurtosis:                       3.703   Cond. No.                         5.47
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
E:\Program Files\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
Out[98]:
Text(0.5,1,'Residual Histogram')

The PCR model has a much higher AIC value than our previous models, and is probably not as good at predicting as our previous models. As this model will not show us which feature has what affect on price, and it is the worst model we will not predict on it.

Out of all of our models our feature selected OLS model with all features with p-values over .05 removed is our best model.

Our best model was created using the following features: [zip1, zip2, grade13, waterfront, zip4, grade12, zip3, grade11, cond5, view4, grade10, cond4, view3, lat, grade9, view1, newReno, cond3, sqft_living, view2, sqft_living15, hasBsmnt, floors, bathrooms, sqft_lot, long, bedrooms, newBuild, grade7, grade6, grade5, grade4]

Our best regression formula is: -0.226187 + 1.315281zip1 + 1.063486zip2 + 0.822946grade13 + 0.773717waterfront + 0.730778zip4 + 0.707111grade12 + 0.659465zip3 + 0.585843grade11 + 0.558247cond5 + 0.52146view4 + 0.485521grade10 + 0.392712cond4 + 0.377873view3 + 0.360219lat + 0.292872grade9 + 0.272636view1 + 0.266681newReno + 0.259037cond3 + 0.256832sqft_living + 0.241129view2 + 0.115955sqft_living15 + 0.088957hasBsmnt + 0.080396floors + 0.054347bathrooms + 0.039591sqft_lot - 0.015802long - 0.024311bedrooms - 0.185233newBuild - 0.280656grade7 - 0.588332grade6 - 0.845429grade5 - 0.981769grade4.

This formula will provide the predicted scaled log price of a house given it's above features. It seems that the most effective price predicters are if it's in a high priced zipcode, if it's waterfront, if it has a high King county grade, it's number of views, it's latitude, if it's a new renovation, and it's square feet of living space.