A King county construction company is looking to purchase properties to update and resell. They are looking for underpriced houses that they can renovate and sell for a quick profit. They are interested in finding what property characteristics affect the price the most. They are also looking for a model that can take an input of what are deemed to be the best characteristics and output the expected price of a house. They will use this model to find underpriced houses to purchase. They will then use the model to determine what changes made to the house will produce the best return on their flipping project.
We will be utilizing the Kaggle house sales prediction dataset which can be found at: https://www.kaggle.com/harlfoxem/housesalesprediction/data
We began by doing some data preparation and exploration. In this step we found some interesting features, and created some new features that we thought would be useful, including whether it's a new renovation, a new build, if it's been renovated at all, and if the house has a basement. After our data was prepared, scaled, and our features from the data exploration were selected we began by fitting a regular OLS model to our training data. We then conducted backwards stepwise regression to find a model with better features. We conducted a few more alterations to the model and found that a model utilizing the following features provided us with the best results: ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'lat', 'long', 'sqft_living15', 'newReno', 'newBuild', 'hasBsmnt', 'view1', 'view2', 'view3', 'view4', 'cond3', 'cond4', 'cond5', 'grade10', 'grade11', 'grade12', 'grade13', 'grade4', 'grade5', 'grade6', 'grade7', 'grade9', 'zip1', 'zip2', 'zip3', 'zip4']. We used our best features list, shown above, to attempt ridge, lasso, and elastic net regression to see if we could improve on our OLS model. We then constructed an SVD model, and a PCR model using the same best feature list. We predicted using all of the above models on our test set and calculated the Sum of Square Residuals for our models on the test set to compare them, along with either the models R-squared or AIC values. Our best model turned out to be the regualr OLS model with the above features selected for training input. Our best regression formula is: -0.226187 + 1.315281zip1 + 1.063486zip2 + 0.822946grade13 + 0.773717waterfront + 0.730778zip4 + 0.707111grade12 + 0.659465zip3 + 0.585843grade11 + 0.558247cond5 + 0.52146view4 + 0.485521grade10 + 0.392712cond4 + 0.377873view3 + 0.360219lat + 0.292872grade9 + 0.272636view1 + 0.266681newReno + 0.259037cond3 + 0.256832sqft_living + 0.241129view2 + 0.115955sqft_living15 + 0.088957hasBsmnt + 0.080396floors + 0.054347bathrooms + 0.039591sqft_lot - 0.015802long - 0.024311bedrooms - 0.185233newBuild - 0.280656grade7 - 0.588332grade6 - 0.845429grade5 - 0.981769grade4. This formula will provide the predicted scaled log price of a house given it's above features. The coeffecients show the effect of each feature on the scaled log price. For instance we can see that the feature with the largest positive effect on price is being in zipcode 1, the Medina area. For our builders, the most important insight is that the most effective price predictors are if it's in a high priced zipcode, if it has a high King county grade, if it's waterfront, its condition, its number of views, its latitude, if it's a new renovation, and its square feet of living space. This confirms the age old adage, location, location, location. Therefore our builders will want to find houses with good intangibles, including its zipcode, its waterfront status, and its latitude. They will want a house with those features that is not in an ideal condition and with a low king county grade. They can then purchase the property, and renovate it. They will be able to predict their effective return on the house after it is renovated, given predictions on its new King county grade after renovation, its new condition, its new view and its other updated features per the regression formula above.
# Load necessary libraries
import pandas as pd
import numpy as np
import os
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import math
from scipy.stats import kstest
from scipy.stats import zscore
import statsmodels.formula.api as sm
Now we will create some auxiliary functions that will help us along the way with our data exploration.
%matplotlib inline
#plot distributions
def plot_cums(dist1, dist2):
## sort the first data sample:
data_sorted = np.sort(dist1)
# calculate the proportional values of samples
p = 1. * np.arange(len(dist1)) / (len(dist1) - 1)
# Now plot as line and scatter plot.
plt.plot(data_sorted, p)
plt.scatter(data_sorted, p, color = 'red')
# sort the seond data sample:
data_sorted = np.sort(dist2)
# And the second plot
plt.plot(data_sorted, p, color = 'yellow')
plt.scatter(data_sorted, p, color = 'green')
#conduct KS test
def ks_test(data, dist = 'norm'):
ks_stat, pvalue = kstest(data, dist)
print('KS-statistic = ' + str(ks_stat))
print('P-value = ' + str(pvalue))
#Create Box Plot
def plotBox(column, depVar, df, x=9, rotation = 0):
fig = plt.figure(figsize=(x, 9))
ax = fig.gca()
df.loc[:,[depVar, column]].boxplot(by =column, ax = ax)
ax.set_title('Box plot of ' + depVar + ' vs. ' + column)
ax.set_ylabel(depVar)
ax.tick_params(axis = 'x', labelrotation = rotation)
#Function used by Bootstrap function
#plots Distributions with 95% CI.
def plot_dists(a, b, nbins, a_label='pop_A', b_label='pop_B', depVar = 'variable', p=5):
# Create a single sequence of bins to be shared across both
#distribution plots for visualization consistency.
combined = pd.concat([a, b])
breaks = np.linspace(
combined.min(),
combined.max(),
num=nbins+1)
plt.subplot(2, 1, 1)
plt.hist(a, bins=breaks, alpha = .5)
plt.axvline(a.mean(), linewidth = 4)
plt.axvline(np.percentile(a, p/2.), color='red', linewidth=3)
plt.axvline(np.percentile(a, 100-p/2.), color='red', linewidth=3)
plt.title(a_label)
plt.xlabel(depVar)
plt.subplot(2, 1, 2)
plt.hist(b, bins=breaks, alpha = .5)
plt.axvline(b.mean(), linewidth = 4)
plt.axvline(np.percentile(b, p/2.), color='red', linewidth=3)
plt.axvline(np.percentile(b, 100-p/2.), color='red', linewidth=3)
plt.title(b_label)
plt.xlabel(depVar)
plt.tight_layout()
#Function used by Bootstrap diff function
#plots histogram with 95% CI.
def plot_hist(x, a, b, depVar = 'variable', p=5):
plt.hist(x, alpha = .5)
plt.axvline(x.mean(), linewidth = 4)
plt.axvline(np.percentile(x, p/2.), color='red', linewidth=3)
plt.axvline(np.percentile(x, 100-p/2.), color='red', linewidth=3)
plt.title(str(a) + " - " + str(b))
plt.xlabel(depVar + " Mean Differences")
#Bootstrap
def bootStrap(df,column, choice1, choice2, depVar, n_replicas = 1000):
a = df[df[column] == choice1]
b = df[df[column] == choice2]
a_bootstrap_means = pd.Series([
a.sample(frac=1, replace=True)[depVar].mean()
for i in range(n_replicas)])
b_bootstrap_means = pd.Series([
b.sample(frac=1, replace=True)[depVar].mean()
for i in range(n_replicas)])
plot_dists(a = a_bootstrap_means, b = b_bootstrap_means,
nbins=80, a_label= choice1, b_label=choice2, depVar = depVar)
#Bootstrap Mean Diffs
def bootStrapDiffs(df, column, choice1, choice2, depVar, n_replicas = 1000):
diffs = []
for i in range(n_replicas):
sample = df.sample(frac=1.0, replace=True)
a_sample_mean = sample[sample[column] == choice1][depVar].mean()
b_sample_mean = sample[sample[column] == choice2][depVar].mean()
diffs.append(a_sample_mean - b_sample_mean)
diffs = pd.Series(diffs)
plot_hist(x = diffs, a = choice1, b = choice2, depVar = depVar)
Now let's load our data and take a look at it.
os.chdir(path)
df = pd.read_csv('kc_house_data.csv', header = 'infer')
pd.set_option('display.max_columns', None)#display all columns
print(df.head())#print top 5 rows
print(df.tail())#print bottom 5 rows
print(df.shape)
print(df.dtypes)
Let's set our date to a date type and create a month and year column. This will help us get a better idea of the data we have.
df['date'] = pd.to_datetime(df['date']) #turn date column to date type
df['month'] = pd.DatetimeIndex(df.loc[:,'date']).month #Create month column
df['year'] = pd.DatetimeIndex(df.loc[:,'date']).year #create year column
#print unique values of number of unique values for our features
print('Dates \n', df['date'].nunique())
print('Years \n', df['year'].unique())
print('Months \n', df['month'].unique())
print('Bedrooms \n', df['bedrooms'].unique())
print('Bathrooms \n', df['bathrooms'].unique())
print('Floors \n', df['floors'].unique())
print('Waterfront \n', df['waterfront'].unique())
print('Views \n', df['view'].unique())
print('Condition \n', df['condition'].unique())
print('Grades \n', df['grade'].unique())
print('Years Built \n', df['yr_built'].unique())
print('Years Renovated \n', df['yr_renovated'].unique())
print('Zip Codes \n', df['zipcode'].unique())
print('Latitude \n', df['lat'].nunique())
print('Longitude \n', df['long'].nunique())
print('Square Foot Living \n', df['sqft_living'].nunique())
print('Square Foot Lot \n', df['sqft_lot'].nunique())
print('Square Foot Living 15\n', df['sqft_living15'].nunique())
print('Square Foot Lot 15\n', df['sqft_lot15'].nunique())
print('Square Foot Above \n', df['sqft_above'].nunique())
print('Square Foot Basement \n', df['sqft_basement'].nunique())
We can see that we have about 1 years worth, 372 days, of data overall spanning 2014-2015 with an entry for every month. We do not have a sample spanning a long enough period to do useful time series analysis. We can see that that we have houses built from 1900 to 2015 and renovations from not renovated to renovated between 1934 and 2015. Our waterfront variable is binary, we should decode it for our data exploration. Our view, condition and grade features look like they should be decoded. It seems there are some houses with over 10 bedrooms. Let's see if these houses are not typos, especially the 33 bedroom house.
print(df.loc[df['bedrooms'] >10,:] )
Most of our data looks clean and makes sense, except for our entry with 33 bedrooms and 1.75 bathrooms with 1620 square feet of living space. This is probably a typo and should probably be only 3 bedrooms; however, we are not 100% sure on what the correct number should be so let's drop this row.
#drop 33 bedroom row
toDrop = df.loc[df['bedrooms'] == 33,'bedrooms'].index.tolist()
df.drop(toDrop, axis = 0, inplace = True)
df.drop('id', axis = 1, inplace = True)
Now let's create some new binary features that may be useful to us. Let's create a feature to see if a house has been renovated or not. One for if it's a new renovation, which will be any renovation greater than the average renovation year. One for if it's a new build, which will be any house built over the average build year. And one for if the house has a basement or not. We will also create a column to decode our waterfront feature.
Let's start by finding our average year built and year renovated.
print('Year Built Min: ', df['yr_built'].min()) #print min year built
print('Year Built Max: ', df['yr_built'].max()) #print max year built
print('Year Built Mean: ', np.mean(df['yr_built'])) #print mean year built
print('Year Renovated Min: ', df.loc[df['yr_renovated'] > 0,'yr_renovated'].min()) #min year renovated excluding not reno
print('Year Renovated Max: ', df['yr_renovated'].max()) #max year renovated excluding not reno
print('Year Renovated Mean: ', (np.mean(df.loc[df['yr_renovated'] > 0,'yr_renovated']))) #mean year renovated excluding not reno
#create new features
df.loc[df.loc[:,'yr_renovated'] > 0 , "renovated"] = 1
df.loc[df.loc[:,'yr_renovated'] == 0, "renovated"] = 0
df.loc[df.loc[:,'yr_renovated'] > 1995 , "newReno"] = 1
df.loc[df.loc[:,'yr_renovated'] < 1996, "newReno"] = 0
df.loc[df.loc[:,'yr_built'] > 1970 , "newBuild"] = 1
df.loc[df.loc[:,'yr_built'] < 1971, "newBuild"] = 0
df.loc[df.loc[:,'sqft_basement'] > 0 , "hasBsmnt"] = 1
df.loc[df.loc[:,'sqft_basement'] == 0, "hasBsmnt"] = 0
df.loc[df.loc[:,'waterfront'] > 0 , "wfName"] = 'Wtrfrnt'
df.loc[df.loc[:,'waterfront'] == 0, "wfName"] = 'noWtrFrnt'
#Set our new features as ints
def setTypes(dataF, i, varType):
dataF.loc[:,i] = dataF.loc[:,i].astype(varType)
setTypes(df,'renovated', 'int64')
setTypes(df,'newReno', 'int64')
setTypes(df,'newBuild', 'int64')
setTypes(df,'hasBsmnt', 'int64')
Now let's create a log price column, and let's create a z-scaled column of our price and log price columns.
def getLog(price):
return math.log(price)
df['log_price'] = df['price'].apply(getLog)
#Standardize our price and log price and save to dfs
df['priceS'] = scale(df.loc[:,'price'])
df['logPS']= scale(df.loc[:,'log_price'])
Now let's decode our view, condition, and grade columns, then let's one hot encode them.
#Decode columns
df.loc[df['view'] == 0, 'viewName'] = 'view0'
df.loc[df['view'] == 1, 'viewName'] = 'view1'
df.loc[df['view'] == 2, 'viewName'] = 'view2'
df.loc[df['view'] == 3, 'viewName'] = 'view3'
df.loc[df['view'] == 4, 'viewName'] = 'view4'
df.loc[df['condition'] == 1, 'condName'] = 'cond1'
df.loc[df['condition'] == 2, 'condName'] = 'cond2'
df.loc[df['condition'] == 3, 'condName'] = 'cond3'
df.loc[df['condition'] == 4, 'condName'] = 'cond4'
df.loc[df['condition'] == 5, 'condName'] = 'cond5'
df.loc[df['grade'] == 1, 'gradeName'] = 'grade1'
df.loc[df['grade'] == 3, 'gradeName'] = 'grade3'
df.loc[df['grade'] == 4, 'gradeName'] = 'grade4'
df.loc[df['grade'] == 5, 'gradeName'] = 'grade5'
df.loc[df['grade'] == 6, 'gradeName'] = 'grade6'
df.loc[df['grade'] == 7, 'gradeName'] = 'grade7'
df.loc[df['grade'] == 8, 'gradeName'] = 'grade8'
df.loc[df['grade'] == 9, 'gradeName'] = 'grade9'
df.loc[df['grade'] == 10, 'gradeName'] = 'grade10'
df.loc[df['grade'] == 11, 'gradeName'] = 'grade11'
df.loc[df['grade'] == 12, 'gradeName'] = 'grade12'
df.loc[df['grade'] == 13, 'gradeName'] = 'grade13'
#Get Dummies
dummies = pd.get_dummies(df.loc[:, 'viewName'], drop_first = True)
df[list(dummies.columns)] = dummies
dummies = pd.get_dummies(df.loc[:, 'condName'], drop_first = True)
df[list(dummies.columns)] = dummies
dummies = pd.get_dummies(df.loc[:, 'gradeName'], drop_first = True)
df[list(dummies.columns)] = dummies
Let's reset our data frames index and take a look at our data frame now.
df.reset_index(drop = True, inplace = True)
print(df.head())
print(df.shape)
print(df.dtypes)
Our data seems to be prepared. Let's take a look and decide if we will be using our price or log price as our dependent variable.
Let's begin by plotting our scaled price and log price distribution and then let's KS test them versus a normal distribution.
#Plot our standardize price columns
plot_cums(df['priceS'], df['logPS'])
ks_test(df['priceS'])
ks_test(df['logPS'])
Neither distribution is normal; however, Log price scaled has a smaller K stat and larger p value, meaning it is more normal.
Let's look at the histogram of our log price and price.
df['log_price'].plot.hist(bins = 40)
plt.xlabel('price')
plt.title('Price Histogram')
df['price'].plot.hist(bins = 100)
plt.xlabel('price')
plt.title('Price Histogram')
We can see that the house prices are fairly normally distributed when looking at the log price histogram. Our price histogram shows us that most of our houses fall around \$500,000 and the majority of houses fall in the range from 0-1,000,000 dollars, with some houses going out to the 8 million dollar range. We will be using the scaled log of price since it is more normal and will likely produce better results.
Our scaled log of price ranges from -3.46 to 5.33. This translates to a price range of \$75,000 to \$7.7 million. A scaled log price of 0 is equal to \$445,000.
Let's pick some columns that we think may have an effect on price and let's create a pair plot.
#Pairplot of given columns
num_cols = ["logPS", "bedrooms", "bathrooms", "sqft_living", "sqft_lot", "grade", "zipcode", "yr_renovated", "yr_built"]
sns.pairplot(df.loc[:, num_cols],
diag_kind="kde",
size=2)
Now let's create some box plots and scatter plots of the features that seem to affect the price the most based on our pairplot above.
Let's start with a boxplot of the number of bedrooms vs log price scaled.
plotBox('bedrooms', 'logPS', df)
It seems that bedrooms do in fact have a factor on price as the price gradually increases as the bedrooms increase; however, the increase in price begins to decrease as the bedroom count increases past 5.
Let's look at the boxplot of the number of bathrooms in relation to house prices next.
plotBox('bathrooms', 'logPS', df, rotation = 90)
This boxplot shows a clear trend between bathroom count and price with an almost linear increase in price with the increase in bathroom count.
Now let's look at the scatter plot of Square foot living space and price.
ax = plt.figure(figsize=(9, 6)).gca()
df.plot.scatter(x = 'sqft_living', y = 'logPS', ax = ax)
ax.set_title('Sqft Living vs Log Price Scaled')
ax.set_ylabel('Log Price Scaled')
ax.set_xlabel('Square Foot Living')
As we can see the Square foot living clearly has an impact on price. We can see a clearly defined pattern with the larger houses being more expensive.
Let's add a waterfront hue on the scatter plot. This will show us if a waterfront house has an impact on price.
sns.lmplot(x = 'sqft_living', y = 'logPS',
data = df,
hue = "wfName",
palette="seismic",
scatter_kws={'alpha':0.3,
},
fit_reg = False)
plt.xlabel('Square Foot Living')
plt.ylabel('House Log Price Scaled')
plt.title('SQFT Living vs House Price with Hue by Waterfront\n')
It does seem like being a waterfront house makes a difference as most of the expensive houses by square foot living are colored red, which indicates a waterfront house.
Now let's look at the King county house grade vs price.
plotBox('grade', 'logPS', df)
This looks to be a very good indicator of price. House prices steadily increase as the grade increases. There is a very nice looking linear increase from houses grade 6-13.
Let's see if the year the house was renovated plays a factor on price.
plotBox('yr_renovated', 'logPS', df, x = 12, rotation = 90)
It seems that houses renovated after 1978 do see a slight increase in price from non renovated houses.
Let's look at the year built vs price next.
plotBox('yr_built', 'logPS', df, x = 18, rotation = 90)
The year built surprisingly does not seem to affect the price.
Let's take a look at how zipcode affects the price.
plotBox('zipcode', 'logPS', df, x = 18, rotation = 90)
The prices are pretty spread out when it comes to the zipcode. There are a few zipcodes that are clearly more expensive than others. Zipcode 98039 is the most expensive by far with the lower quartile of the scaled log price being over 2. The second most expensive zipcode is 98004. This zipcode has a median price range near 2 and has an upper quartile price over 2. The 3rd most expensive zipcode is 98040. This zipcode has a third quartile that just crosses over log price scaled of 2. 98112 is the 4th most expensive zip code having a very similiar median and upper quartile as 98040. The rest of the zipcodes are significantly lower.
The most expensive zipcode 98039 is for the Medina area. 98004 is for the area right next to medina including Clyde Hill and Bellevue. 98040 is for Mercer Island, and 98112 is for the Montlake area. Surprisingly the top two most expensive areas are on the eastside while Mercer island and Montlake, which are in very close proximity, are the most expensive in the Seattle area. Unsurprisingly Medina is the most expensive zipcode as that is where some real succesful CEOs are known to live.
Let's create create dummy variables for the top 4 zipcodes, we will see if this provides us better results than simply using the zipcode feature in our regression models later.
#Decode zipcode
df['zipTop'] = 'otherZip'
df.loc[df['zipcode'] == 98039, 'zipTop'] = 'zip1'
df.loc[df['zipcode'] == 98004, 'zipTop'] = 'zip2'
df.loc[df['zipcode'] == 98040, 'zipTop'] = 'zip3'
df.loc[df['zipcode'] == 98112, 'zipTop'] = 'zip4'
#encode zipTop
dummies = pd.get_dummies(df.loc[:, 'zipTop'], drop_first = True)
df[list(dummies.columns)] = dummies
Now let's keep going with our data exploration.
Let's see if being a waterfront property seems to have an effect on price.
plotBox('waterfront', 'logPS', df)
Although there is some overlapping it seems like waterfront vs not waterfront house prices are drawn from different distributions.
Let's bootstrap this as well to see if this provides a more clear result.
bootStrap(df, 'waterfront', 0, 1, 'logPS')
Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It is pretty clear that having a waterfront house does affect price.
Now let's bootstrap the mean difference in price between waterfront houses and not.
bootStrapDiffs(df, 'waterfront', 0, 1, 'logPS')
bootStrapDiffs(df, 'waterfront', 0, 1, 'price')
Above we can see that the Log scaled difference in mean price between waterfront houses and not waterfront houses is -2. Meaning non waterfront houses in general had a mean scaled log price that was -2 less than waterfront houses. This is a price difference of just over \$1.1 million. This is a significant difference in mean prices of houses.
Now let's take a look at a plot for newly renovated houses versus not renovated or older renovated houses.
plotBox('newReno', 'logPS', df)
These distributions seem to be very similiar. Whether the house was newly renovated or not does seem to have an effect on price as the newly renovated houses lower quartile is about the same as the older renovated houses mean.
Let's bootstrap this as well to see if this provides a more clear result.
bootStrap(df, 'newReno', 0, 1, 'logPS')
Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It does seem that having a newly renovated house does affect price.
Let's bootstrap the mean difference in log price of newly renovated houses and not.
bootStrapDiffs(df, 'newReno', 0, 1, 'logPS')
bootStrapDiffs(df, 'newReno', 0, 1, 'price')
Above we can see that the Log scaled difference in mean price between newly renovated houses and old houses is -.73. Meaning older renovated houses in general had a mean scaled log price that was -.73 less than newer renovated houses. This is a mean price difference of around \$280,000.
Now let's take a look at a plot for newly built houses versus older houses.
plotBox('newBuild', 'logPS', df)
These distributions seem to be very similiar. Whether the house was newly built or not does not seem to have an effect on price.
Let's bootstrap this as well to see if this provides a more clear result.
bootStrap(df, 'newBuild', 0, 1, 'logPS')
Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It actually seems that having a newly built house does affect price.
Let's also bootstrap their mean differences.
bootStrapDiffs(df, 'newBuild', 0, 1, 'logPS')
bootStrapDiffs(df, 'newBuild', 0, 1, 'price')
Above we can see that the Log scaled difference in mean price between newly built houses and old houses is -.26. Meaning older houses in general had a mean scaled log price that was -.26 less than newer houses. This is a price difference of about \$65,000.
Now let's take a look at a plot for houses with basements vs houses without.
plotBox('hasBsmnt', 'logPS', df)
These distributions seem to be very similiar. Whether the house has a basement or not does not seem to have an effect on price.
Let's bootstrap this as well to see if this provides a more clear result.
bootStrap(df, 'hasBsmnt', 0, 1, 'logPS')
Looking at the above graph, our 95% confidence intervals do not overlap at all, in fact there is a huge gap between the two distribution's plots. It actually seems that having a basement on your house does affect price.
Let's also bootstrap their mean differences.
bootStrapDiffs(df, 'hasBsmnt', 0, 1, 'logPS')
bootStrapDiffs(df, 'hasBsmnt', 0, 1, 'price')
Above we can see that the Log scaled difference in mean price between houses with basements and houses without is -.43. Meaning houses without basements in general had a mean scaled log price that was -.43 less than houses with basements. This is a price difference of about \$135,000.
We've gained some valuable insight on these features. It seems so far that the features that affect price the most are if it's waterfront, its King county grade, its bathroom count, its zipcode, and whether it was newly renovated.
Let's look at the correlation of our features with price.
house_corr = df[["logPS", "bedrooms", "bathrooms", 'condition', 'floors','lat','long','sqft_above', 'sqft_basement', "sqft_living", "sqft_living15", "sqft_lot", "sqft_lot15", 'view', "waterfront", "grade", "zipcode", "yr_renovated", "yr_built", "renovated", 'newReno', 'newBuild', 'hasBsmnt']].corr()
fig = plt.figure(figsize=(16,12)) # define plot area
ax = fig.gca()
ax.set_title('Correlation matrix for numeric features')
sns.heatmap(round(house_corr,2), ax = ax, linewidths=0.25,vmax=1.0, square=True, cmap="RdYlGn", linecolor='k', annot=True)
As we can see the features most correlated with price are Sqaure Feet Living, Grade, Square Feet Living15, Square Feet Above, Bathrooms and on a less correlated end, Bedrooms, Latitude, view, Square Foot Basement, Floors, has basement, and Waterfront.
Let's plot a pairplot of just these features, with a hue for whether the house is waterfront or not.
num_cols = ['logPS', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15', 'sqft_above', 'grade', 'lat', 'sqft_basement', 'view', 'floors', 'hasBsmnt', 'wfName']
sns.pairplot(df.loc[:, num_cols], hue='wfName',
palette="seismic", diag_kind="kde",
size=2)
This is a very illuminating plot, we seem to have found the features that affect price the most. The top row shows us the relation of price as a given feature increases. We can also clearly see that waterfront houses have a higher price in general given a feature.
Let's take a look at a facet grid plot. Our row will be the King county grade, our column will be the view count. We will plot scatter plots of the square foot living in relation to the price with a hue of the houses waterfront status.
g = sns.FacetGrid(df,
col='view',
row='grade',
hue='wfName',
palette='Set1')
g = (g.map(plt.scatter, 'sqft_living', 'logPS').add_legend())
Looking at the above facet grid we can see that most houses are from grade 5-11. We can also see that waterfront houses in general draw many more views with most waterfront houses having 4 views and in general having 3 or more views. The prices clearly increase as the grade increases, and houses with more views sell on the upper end of a given grade's price range. We can also see that as houses' grades increase the average square foot living size increases as well.
Now that we seem to have found some interesting relationships let's create some linear regression models of our data and see if our model is any good at predicting.
Let's take a quick look at our final original data frame first.
df.head()
df.columns
Let's create a testing and training df to use. We will drop our excess price columns. We will also drop sqft above and basement as they are colinear with sqft living. We will also drop month, monthName, date, and year since their is not enough data for these to be useful factors. We will drop view, condition, grade and their respective name features as we have those dummied. And we will remove year built and year renovated since our renovated, newReno, and newBuild features provide us this same information in a dummied fashion. We will then z normalize all of our numerical variables and add an intercept column. Finally we will take a look at our prepared data frame.
#Create new df
dfReady = df.copy(deep = True)
#drop columns
dfReady.drop(['date','price', 'priceS', 'log_price','sqft_above', 'sqft_basement', 'month', 'year', 'view', 'condition', 'grade', 'viewName', 'condName', 'gradeName', 'yr_built', 'yr_renovated', 'wfName', 'zipTop'],axis=1,inplace = True)
#z normalize variables
dfReady.loc[:,['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode','lat','long','sqft_living15','sqft_lot15']] = df.loc[:,['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode','lat','long','sqft_living15','sqft_lot15']].apply(zscore)
#add intercept column
dfReady['intercept'] = 1
print(dfReady.shape)
dfReady.head()
dfReady.dtypes
Now let's split our ready data frame into a training and test set, with an 80/20 split.
from sklearn.model_selection import train_test_split
#train/test split
train, test = train_test_split(dfReady, test_size = .2)
print(train.shape)
print(test.shape)
Now let's create our Xtrain, yTrain, Xtest and yTest data frames.
Xtest = test.loc[:, test.drop(['logPS'], axis = 1).columns]
yTest = test.loc[:, 'logPS']
Xtrain = train.loc[:, train.drop(['logPS'], axis = 1).columns]
yTrain = train.loc[:, 'logPS']
Earlier we created some dummy columns for our top 4 zipcodes; however, we also kept the zipcode column as we were not sure which would prove to be more useful for our model. Now we will create a list of X columns including the zipcode but dropping the dummy columns, and one that includes the dummy columns but drops the zipcode. We will do this so we do not pass colinear features to our model.
xCatsZipOnly = list(Xtrain.drop(['zip1','zip2','zip3','zip4'], axis = 1).columns) #Drops dummy column keeps zipcode
xCatsZipTop = list(Xtrain.drop(['zipcode'], axis = 1).columns) #drops zipcode keeps dummy columns
Now let's fit our first model to all features and the zipcode.
ols_model = sm.OLS(yTrain, Xtrain[xCatsZipOnly])
results = ols_model.fit()
n_points = Xtrain.shape[0]
y_output = yTrain
print('RMSE: {}'.format(np.sqrt(results.mse_model)))
# Get most of the linear regression statistics we are interested in:
print(results.summary())
# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
Now let's predict on our test column and calculate the SSR of our model.
y_pred = results.predict(Xtest[xCatsZipOnly])
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred']
compareOLS['resid2'] = compareOLS['resid'] ** 2
print('SSR: ', compareOLS.resid2.sum())
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
This is not a bad model already with a R squared value of .76 and a SSR of 990.
Now let's see if our model would be better using the dummied zipcode features instead of the acutal zipcodes.
ols_model = sm.OLS(yTrain, Xtrain[xCatsZipTop])
results = ols_model.fit()
n_points = Xtrain.shape[0]
y_output = yTrain
print('RMSE: {}'.format(np.sqrt(results.mse_model)))
# Get most of the linear regression statistics we are interested in:
print(results.summary())
# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
y_pred = results.predict(Xtest[xCatsZipTop])
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred']
compareOLS['resid2'] = compareOLS['resid'] ** 2
print('SSR: ', compareOLS.resid2.sum())
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
This model is clearly better than our original. Our R squared has increased to .791, our SSR has dropped to 886, the AIC has dropped and our RMSE has dropped from 20 to 19.
Going forward we will use our dummied zipcode columns, not the actual column.
Now let's conduct a backward stepwise regression to see if we can find a better model.
def backward_selected(X, Y):
# Start with all factors and intercept
possible_factors = list(X.columns)
best_aic = sm.OLS(Y, X).fit().aic
current_aic = best_aic
nochange = False
#loops through all columns
for i in range(len(X.columns)):
#makes sure a change has been made, if no change model is done
if nochange == False:
nochange = True
to_try_remove = possible_factors
aic_candidates = []
for candidate in to_try_remove:
columns = possible_factors.copy()
columns.remove(candidate)
aic = sm.OLS(Y,X[columns]).fit().aic
# Append tuple of the form (aic, response)
aic_candidates.append((aic, candidate))
# Sort all the pairs by the first entry of tuple (default of sort() method)
aic_candidates.sort()
#get best new AIC and feature removed
best_new_aic, best_candidate = aic_candidates.pop(0)
# Now check if we have something better:
if best_new_aic < current_aic:
# Remove the best candidate's name from possible_factors
# set the current AIC to best AIC
# reset no change flag
possible_factors.remove(best_candidate)
current_aic = best_new_aic
nochange = False
# Now we repeat the process with all the remaining candidate columns
# Here is the final formula!
formula = "{} ~ {} + 1".format('logPS', ' + '.join(possible_factors))
print(formula)
# Get the model object
model = sm.OLS(Y, X[possible_factors]).fit()
return possible_factors, model #return best features list and model
#apply backwards stepwise regression
bestXCols, backwards_model = backward_selected(Xtrain[xCatsZipTop], yTrain)
print('\nAdjusted R-Squared: {}'.format(backwards_model.rsquared_adj))
print('AIC: {}'.format(backwards_model.aic))
Let's run this model with the best feature list to get a better idea of our result and to use it to predict on our test house prices.
ols_model = sm.OLS(yTrain,Xtrain[bestXCols])
results = ols_model.fit()
print('RMSE: {}'.format(np.sqrt(results.mse_model)))
# Get most of the linear regression statistics we are interested in:
print(results.summary())
# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
y_pred = results.predict(Xtest[bestXCols])
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Find resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #Resid squared
print('SSR: ', compareOLS.resid2.sum()) #Sum SSR
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
Our backwards selected model has removed 3 features. This has improved on our model as the AIC, SSR, RMSE, and Rsquared are the same but we are using less features, which helps prevent overfitting. However, we can see above that renovated, grade 3, grade 8 and sqft_lot15 still have p-values over .05.
Let's remove the features with p-values over .05 from our model and see if this helps our model.
#Checks if feature in list and removes
if 'grade3' in bestXCols:
bestXCols.remove('grade3')
if 'renovated' in bestXCols:
bestXCols.remove('renovated')
if 'sqft_lot15' in bestXCols:
bestXCols.remove('sqft_lot15')
if 'grade8' in bestXCols:
bestXCols.remove('grade8')
ols_model = sm.OLS(yTrain,Xtrain[bestXCols])
results = ols_model.fit()
print('RMSE: {}'.format(np.sqrt(results.mse_model)))
# Get most of the linear regression statistics we are interested in:
print(results.summary())
# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
y_pred = results.predict(Xtest[bestXCols])
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
Looking at our model, our R-squared is slightly above our backwards selected model, and our aic is the same; however, all of our p values are below our .05 cutoff. This model is our best so far. We seem to have found the best features. Let's use these features to create a ridge, lasso, and elastic regression model and predict to see if they provide us with better results. We will use only the features we found to be the best to train these models.
First let's fit the ridge regression model and predict using it.
ols_model = sm.OLS(yTrain, Xtrain[bestXCols])
results = ols_model.fit_regularized(method ='elastic_net', alpha=0.000247875, L1_wt = 0)
print(results.params)#Print coeffecients
y_pred = results.predict(Xtest[bestXCols]) #predict on Xtest
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
Now let's fit the lasso regression model and predict on it.
ols_model = sm.OLS(yTrain, Xtrain[bestXCols])
results = ols_model.fit_regularized(method ='elastic_net', alpha=0.000247875, L1_wt = 1)
print(results.params)#Print coeffecients and features
y_pred = results.predict(Xtest[bestXCols])
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
Now let's fit the elastic-net regression model and predict using it.
ols_model = sm.OLS(yTrain, Xtrain[bestXCols])
results = ols_model.fit_regularized(method ='elastic_net', alpha=0.000247875, L1_wt = .5)
print(results.params) #Print coeffecients and features
y_pred = results.predict(Xtest[bestXCols])
compareOLS = pd.DataFrame({'Actual': yTest, 'Pred': y_pred})
compareOLS['resid'] = compareOLS['Actual'] - compareOLS['Pred'] #Create resid
compareOLS['resid2'] = compareOLS['resid'] ** 2 #create resid squared
print('SSR: ', compareOLS.resid2.sum()) #sum of square resids
SSR = compareOLS.resid2.sum()
SST = np.sum(np.square(yTest - np.mean(yTest)))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
compareOLS.head()
Looking at our 3 models above there has not been a significant increase in performance to our OLS model using ridge, lasso or elastic-net regression. Elastic-net regression has given us the best of the above models having the lowest SSR and highest R squared on our prediction of the three models. However none of the models have proven to be better than our OLS model with our best feature list which had an SSR of 887 and an R squared of .79.
Let's go ahead and create a SVD model and see if that performs better on our predictions.
trainVars = train.loc[:,xCatsZipTop]
M = trainVars.values
logPS = train.loc[:,'logPS']
testVars = test.loc[:,xCatsZipTop]
Mtest = testVars.values
logPStest = test.loc[:,'logPS']
print('1st 5 rows of Data Frame: ')
print(trainVars.head())
print('\n1st 5 rows in matrix Form: ')
print(M[0:5, :])
beta_coeffs, resids, rank, s = np.linalg.lstsq(M,logPS)
print('beta coefficients: {}'.format(beta_coeffs))
print('\nsingular values: {}'.format(s))
house_predictions = np.dot(Mtest, beta_coeffs) + np.mean(logPStest)
house_resids = house_predictions - logPStest
# Plot the residuals vs score (height)
plt.subplot(1, 2, 1)
plt.plot(house_predictions, house_resids, '.')
plt.ylabel('residuals')
plt.xlabel('score(logPS)')
plt.title('Residuals vs Score')
# Histogram of residuals
plt.subplot(1, 2, 2)
plt.hist(house_resids)
plt.title('Histogram of Residuals')
SSR = np.sum(np.square(house_resids))
SST = np.sum(np.square(logPStest - np.mean(logPStest)))
print('SSR: {}'.format(SSR))
R2 = 1.0 - (SSR / SST)
print('R-squared = {}'.format(R2))
This model has actually provided us with very similiar results to our best model. The SSR on this model is only slightly higher than our OLS feature selected model and the R squared is very similar. So far our feature selected OLS is the best followed closely by this model and then the Elastic Net regression model.
Now let's see if using Principal component regression will provide us with a better model.
from sklearn.decomposition import PCA
#PCA
target_label = 'logPS'
pca = PCA() #set model
pca_result = pca.fit_transform(Xtrain)
column_names = ['pc' + str(ix+1) for ix in range(Xtrain.shape[1])]
pca_df = pd.DataFrame(data = pca_result, columns=column_names)
pca_df[target_label] = yTrain
plt.plot(pca.explained_variance_)
plt.title('Explained variance by Principal Component Num')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
#get percentage explained variance of n principal components
print("1 Component: ", pca.explained_variance_[0:1].sum()/pca.explained_variance_.sum())
print("2 Components: ", pca.explained_variance_[0:2].sum()/pca.explained_variance_.sum())
print("3 Components: ", pca.explained_variance_[0:3].sum()/pca.explained_variance_.sum())
print("4 Components: ", pca.explained_variance_[0:4].sum()/pca.explained_variance_.sum())
print("5 Components: ", pca.explained_variance_[0:5].sum()/pca.explained_variance_.sum())
print("6 Components: ", pca.explained_variance_[0:6].sum()/pca.explained_variance_.sum())
print("7 Components: ", pca.explained_variance_[0:7].sum()/pca.explained_variance_.sum())
print("8 Components: ", pca.explained_variance_[0:8].sum()/pca.explained_variance_.sum())
print("9 Components: ", pca.explained_variance_[0:9].sum()/pca.explained_variance_.sum())
print("10 Components: ", pca.explained_variance_[0:10].sum()/pca.explained_variance_.sum())
print("11 Components: ", pca.explained_variance_[0:11].sum()/pca.explained_variance_.sum())
print("12 Components: ", pca.explained_variance_[0:12].sum()/pca.explained_variance_.sum())
print("13 Components: ", pca.explained_variance_[0:13].sum()/pca.explained_variance_.sum())
print("14 Components: ", pca.explained_variance_[0:14].sum()/pca.explained_variance_.sum())
print("15 Components: ", pca.explained_variance_[0:15].sum()/pca.explained_variance_.sum())
print("16 Components: ", pca.explained_variance_[0:16].sum()/pca.explained_variance_.sum())
print("17 Components: ", pca.explained_variance_[0:17].sum()/pca.explained_variance_.sum())
We can see from above that 15 components is our best choice as it will provide us with a model that explains 96.2% of the variance of price and adding further components begins to increase our models explanation of variance very minimally.
# Perform linear regression with the first N columns
n = 15
formula_start = target_label + ' ~ ' #create formula start for OLS
formula_terms = ['pc' + str(x+1) for x in range(n)] #Gather pc's to add to formula
formula_end = ' + '.join(formula_terms) #prepare end of formula
formula_final = formula_start + formula_end #final formula to use in OLS
pcr_model = sm.ols(formula = formula_final, data=pca_df)
results = pcr_model.fit()
# Get most of the linear regression statistics we are interested in:
print(results.summary())
# Plot a histogram of the residuals
sns.distplot(results.resid, hist=True)
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.title('Residual Histogram')
The PCR model has a much higher AIC value than our previous models, and is probably not as good at predicting as our previous models. As this model will not show us which feature has what affect on price, and it is the worst model we will not predict on it.
Out of all of our models our feature selected OLS model with all features with p-values over .05 removed is our best model.
Our best model was created using the following features: [zip1, zip2, grade13, waterfront, zip4, grade12, zip3, grade11, cond5, view4, grade10, cond4, view3, lat, grade9, view1, newReno, cond3, sqft_living, view2, sqft_living15, hasBsmnt, floors, bathrooms, sqft_lot, long, bedrooms, newBuild, grade7, grade6, grade5, grade4]
Our best regression formula is: -0.226187 + 1.315281zip1 + 1.063486zip2 + 0.822946grade13 + 0.773717waterfront + 0.730778zip4 + 0.707111grade12 + 0.659465zip3 + 0.585843grade11 + 0.558247cond5 + 0.52146view4 + 0.485521grade10 + 0.392712cond4 + 0.377873view3 + 0.360219lat + 0.292872grade9 + 0.272636view1 + 0.266681newReno + 0.259037cond3 + 0.256832sqft_living + 0.241129view2 + 0.115955sqft_living15 + 0.088957hasBsmnt + 0.080396floors + 0.054347bathrooms + 0.039591sqft_lot - 0.015802long - 0.024311bedrooms - 0.185233newBuild - 0.280656grade7 - 0.588332grade6 - 0.845429grade5 - 0.981769grade4.
This formula will provide the predicted scaled log price of a house given it's above features. It seems that the most effective price predicters are if it's in a high priced zipcode, if it's waterfront, if it has a high King county grade, it's number of views, it's latitude, if it's a new renovation, and it's square feet of living space.