A large-item auction site client is getting ready to acquire a smaller automobile auction company primarily dealing with classic cars and is interested in understanding how the price relates to certain automobile characteristics.
First let's import our libraries and define some auxiliary functions.
#Import libraries
import pandas as pd
import numpy as np
import math
import seaborn as sns
from sklearn.preprocessing import scale
import scipy.stats as ss
import scipy
import statsmodels.stats.weightstats as ws
import matplotlib.pyplot as plt
from scipy.stats import kstest
from statistics import mean
from statsmodels.stats.multicomp import pairwise_tukeyhsd
Let's create some functions that we will use in our analysis below.
#Create getLog Function to create the log(price) column
def getLog(price):
return math.log(price)
%matplotlib inline
def plot_cums(dist1, dist2):
## sort the first data sample:
data_sorted = np.sort(dist1)
# calculate the proportional values of samples
p = 1. * np.arange(len(dist1)) / (len(dist1) - 1)
# Now plot as line and scatter plot.
plt.plot(data_sorted, p)
plt.scatter(data_sorted, p, color = 'red')
# sort the seond data sample:
data_sorted = np.sort(dist2)
# And the second plot
plt.plot(data_sorted, p, color = 'yellow')
plt.scatter(data_sorted, p, color = 'green')
def ks_test(data, dist = 'norm'):
ks_stat, pvalue = kstest(data, dist)
print('KS-statistic = ' + str(ks_stat))
print('P-value = ' + str(pvalue))
def stratify(df,p,column, depVar):
groupedCounts = df.loc[:,column].value_counts()
groupedIndexes = groupedCounts.index.tolist()
maxNums = groupedCounts.min()
wantNums = int(p* groupedCounts.sum())
if wantNums <= maxNums:
item1Loc = df.loc[:, column] == groupedIndexes[0]
item2Loc = df.loc[:, column] == groupedIndexes[1]
else: #p is to large
pmax = maxNums / df.shape[0]
raise Exception('The maximum value of p = ' + str(pmax))
item1Data = df.loc[item1Loc]
item2Data = df.loc[item2Loc]
item1Samples = item1Data.sample(n=wantNums)
item2Samples = item2Data.sample(n=wantNums)
type1 = item1Samples.loc[:, [column,depVar]]
type2 = item2Samples.loc[:, [column,depVar]]
totalTypes = pd.concat([type1,type2])
return type1, type2, totalTypes
We've defined a few functions including a stratify function, ks test function, plot cumaltive distribution function and a function to add log price to our data frame.
Next let's define our t_test function. This function will return our degrees of freedom, the difference of mean, the t-statistic, p-value, and high and low confidence intervals.
def t_test(type1, type2, depVar, p):
a = type1.loc[:,depVar]
b = type2.loc[:,depVar]
diff = a.mean() - b.mean()
res = ss.ttest_ind(a, b)
means = ws.CompareMeans(ws.DescrStatsW(a), ws.DescrStatsW(b))
confint = means.tconfint_diff(alpha=p, alternative='two-sided', usevar='unequal')
degfree = means.dof_satt()
index = ['DegFreedom', 'Difference', 'Statistic', 'PValue', 'Low95CI', 'High95CI']
return pd.Series([degfree, diff, res[0], res[1], confint[0], confint[1]], index = index)
Now let's create our plothist function. This will create a subplot of two histograms that share an x and y axis. This will also graph the 95% confidence intervals for the first histogram with dashed vertical lines, and means for the first and second histogram with a solid vertical line.
def plotHist(type1, type2, depVar, p):
a = type1.loc[:,depVar]
b = type2.loc[:,depVar]
means = ws.CompareMeans(ws.DescrStatsW(a), ws.DescrStatsW(b))
confint = means.tconfint_diff(alpha=p, alternative='two-sided', usevar='unequal')
diff = a.mean() - b.mean()
ax1 = plt.subplot(211)
upper = mean(a) + confint[1] - diff
lower = mean(a) + confint[0] - diff
plt.hist(a, label = type1.iloc[0,0],bins=10, color='blue', density=True)
plt.legend()
plt.axvline(mean(a), color = 'red')
plt.axvline(upper, color = 'red', linestyle='--')
plt.axvline(lower, color = 'red', linestyle='--')
plt.ylabel('Frequency')
ax2 = plt.subplot(212, sharex=ax1, sharey=ax1)
plt.hist(b, label = type2.iloc[0,0],bins=10, color='green', density=True)
plt.legend()
plt.axvline(mean(b), color = 'red')
plt.ylabel('Frequency')
plt.xlabel('Value')
Now let's create a function that will create a data frame for each unique value in a given column in our data frame. It will return all data frames in a list along with a variable containing all unique values found in the given column. We will also create a cleanDF function that will take our data frame and extract only the specified column.
def createDfs(df, column):
uniquevals = df[column].unique().tolist()
allDf = []
for vals in uniquevals:
allDf.append(vals)
for i in range(len(uniquevals)):
strvals = str(uniquevals[i])
allDf[i] = df.copy(deep=True)
toDrop = allDf[i].loc[allDf[i][column] != strvals, column]
toDropIndex = toDrop.index.tolist()
allDf[i].drop(toDropIndex, axis=0, inplace=True)
return uniquevals, allDf
def cleanDf(df, depVar):
df = df.loc[:, depVar]
return df
Now that our auxiliary functions are defined let's load the data. We will create 2 data frames to use for different analysis. We will change our numeric columns to numeric and remove all rows of missing data. In dfNorm we will be removing all missing data from the normalized losses column along with the other changes made in the original df. Our cleaned up original Df will have 193 rows, and the dfNorm will have 160 rows.
#Load the data set and save it as a df
df = pd.read_csv(fileName,header = 'infer')
dfNorm = pd.read_csv(fileName,header = 'infer')
#Set columns with missing values to fix
cols = ['price', 'bore', 'stroke',
'horsepower', 'peak-rpm']
cols2 = ['price', 'bore', 'stroke',
'horsepower', 'peak-rpm', 'normalized-losses']
#Coerce columns to numeric and put NaN's for unknown values
for i in cols:
df.loc[:,i] = pd.to_numeric(df.loc[:,i], errors = 'coerce')
for i in cols2:
dfNorm.loc[:,i] = pd.to_numeric(dfNorm.loc[:,i], errors = 'coerce')
#Drop all columns with NaN values
df.dropna(axis=0, inplace=True)
dfNorm.dropna(axis=0, inplace=True)
#Cleanup ? from Num of Doors
column = 'num-of-doors'
toDrop = df.loc[df[column] == '?', column]
toDropIndex = toDrop.index.tolist()
df.drop(toDropIndex, axis = 0, inplace = True)
Now let's add the log price and standardized prices columns to our data frames.
#Add log price column. Use get log function to apply it to the data frame
df['log_price'] = df['price'].apply(getLog)
dfNorm['log_price'] = dfNorm['price'].apply(getLog)
#Standardize our price and log price and save to dfs
df['priceS'] = scale(df.loc[:,'price'])
df['logPS']= scale(df.loc[:,'log_price'])
dfNorm['priceS'] = scale(dfNorm.loc[:,'price'])
dfNorm['logPS']= scale(dfNorm.loc[:,'log_price'])
Now that our data is loaded and prepared let's take a look at the data and some statistics of the data to make sure everything has been loaded correctly and to get a feel for our data.
#Let's look at the top of our cleaned up data frame
pd.set_option('display.max_columns', 500)
print(df.head())
#Let's print the dtypes to make sure everything looks right.
print(df.dtypes)
#Let's print a few more statistics
print('Shape: ', df.shape)
print('Total Nulls: ',df.isnull().sum().sum()) #check for NaN's
print('Columns: ', df.columns)
#Let's print the descriptive statistics
df.describe()
Now let's look at all the value counts or unique values for our categories to get a further idea of the data we are working with, and to gather ideas for our analysis process.
col = 'symboling'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'make'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'num-of-doors'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'engine-location'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'engine-type'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'num-of-cylinders'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'fuel-system'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'fuel-type'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'aspiration'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'drive-wheels'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'body-style'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'engine-size'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'normalized-losses'
print("\n", col, ":\n", dfNorm.loc[:,col].unique())
col = 'city-mpg'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'highway-mpg'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'horsepower'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'peak-rpm'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'curb-weight'
print("\n", col, ":\n", df.loc[:,col].unique())
Let's create some Bins for our cylinders and insurance ratings. We will bin cars with a risk rating of 0 or less as Not Risky and those with a rating of 1 or more as Risky. We will consider cars with four or less cylinders as Normal and those with over four cylinders as Extra.
#Function to bin insurance risks
def binIns(symbol):
if symbol < 1:
return 'Not Risky'
elif symbol > 0:
return 'Risky'
#Function to bin cylinders
def binCyls(cylinders):
if cylinders == 'four':
return 'Normal'
elif cylinders == 'three':
return 'Normal'
elif cylinders == 'five':
return 'Extra'
elif cylinders == 'six':
return 'Extra'
elif cylinders == 'eight':
return 'Extra'
elif cylinders == 'twelve':
return 'Extra'
df['InsBin'] = df['symboling'].apply(binIns) #bin symboling column using our function and save in df
df['CylBin'] = df['num-of-cylinders'].apply(binCyls) #bin cylinder column using our function and save in df
dfNorm['InsBin'] = dfNorm['symboling'].apply(binIns) #bin symboling column using our function and save in df
dfNorm['CylBin'] = dfNorm['num-of-cylinders'].apply(binCyls) #bin cylinder column using our function and save in df
Now that our data is ready let's start with our first test. We will compare and test the Normality of the distributions of price and log price.
Let's plot the distributions of price and log price.
#Plot price vs log price using our function
plot_cums(df.loc[:,'price'], df.loc[:,'log_price'])
As we can see above these two distributions when plotted don't provide us a great plot. This is because log price naturally has much smaller values than price. Let's standardize price and log(price) so that we can plot them more effectively against each other.
#Plot our standardize price columns
plot_cums(df['priceS'], df['logPS'])
Now that they are standardized we can see that both log(price) and price have similiar distributions. It is hard to see if these look normal. Now let's create a function that will perform our kstest and return the KS statistic along with the P value. The function will test the distribution of our given data versus that of a normal distribution.
Next let's run our KS test for our standardized price and standardized log price against a normal distribution and compare their statistics and p values.
ks_test(df['priceS'])
ks_test(df['logPS'])
Judging by the statistics above, these distributions are not really normal. Their k-statistics are not too large but they both possess low p-values, meaning we must reject our null hypothesis that they are similiar to a normal distribution. However, we can see that the log price column is more normally distributed as it has a smaller KS statistic and a larger p-value than the normal price.
Let's look at some Kernel Density Estimate plots of some of our variables to get a feel for their distributions.
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'log_price'], ax = ax)
ax.set_title('KDE plot of log auto price')
ax.set_xlabel('Log Auto price')
ax.set_ylabel('Density')
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'price'], ax = ax)
ax.set_title('KDE plot of auto price')
ax.set_xlabel('Auto price')
ax.set_ylabel('Density')
The price distributions are clearly right skewed with the mode being around $8,000. The log price distribution is slightly less skewed but it is still clearly right skewed.
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'curb-weight'], ax = ax)
ax.set_title('KDE plot of curb weight')
ax.set_xlabel('Curb Weight')
ax.set_ylabel('Density')
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'symboling'], ax = ax)
ax.set_title('KDE plot of symboling')
ax.set_xlabel('Symboling')
ax.set_ylabel('Density')
We can see above that the insurance ratings are actually left skewed. This shows that most of the cars are more risky to the insurance companies; however the mode is at 0 so the highest count falls at normal risk.
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
sns.set_style("whitegrid")
sns.kdeplot(dfNorm.loc[:, 'normalized-losses'], ax = ax)
ax.set_title('KDE plot of Normalized Losses')
ax.set_xlabel('Normalized Losses')
ax.set_ylabel('Density')
The above shows the distribution of our normalized losses. The mode is around 100 and the distribution is slightly skewed right. It is interesting to note that even though most cars are deemed risky their losses seem to be minimal.
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'horsepower'], ax = ax)
ax.set_title('KDE plot of Horsepower')
ax.set_xlabel('Horsepower')
ax.set_ylabel('Density')
The horsepower distribution has a slighly similiar distribution as log price and price which makes sense as generally more horsepower on a car will cost more.
Now let's begin exploring the relationships between categories.
Let's test significance of some variables using both graphical methods and the t-test.
Let's begin by stratifying our fuel-type column. We will then perform a t-test on the two fuel types and print our test statistics.
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'fuel-type'
depVar = 'log_price'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p)
print(testStats)
Looking at our t_test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in means due to fuel type are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.
Now let's plot the histogram of our two fuel types.
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p)
Looking at the histogram it is clear that the difference in means between the two categories is due to random variation alone. The mean of the diesel histogram is inside the 95% confidence interval of the gas histogram. This combined with our t_test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the fuel type is not a statistically significant indicator of price.
Next let's look at the importance of aspiration on price.
p = .05 #set our p
column = 'aspiration'
depVar = 'log_price'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar=depVar, p = p)
print(testStats)
Looking at our t-test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in means due to aspiration are from random variation alone. Let's look at the data graphically and see if it agrees with what our t-test told us.
Now let's plot the histogram of our two aspiration types.
plotHist(type1 = type1, type2 = type2, depVar= depVar, p = p)
Looking at the histogram it seems that the difference in means between the two categories is due to random variation alone. The mean of the turbo histogram is inside the 95% confidence interval of the standard histogram. This combined with our t-test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the aspiration type is not a statistically significant indicator of price.
Now let's look at the importance of rear or front wheel drive on price. First let's create a new df and remove 4wd cars from the drive-wheels column. Then let's stratify our new df and do our t test.
dfdoors = df.copy(deep = True)
column = 'drive-wheels'
depVar = 'log_price'
wd4 = dfdoors.loc[dfdoors[column] == '4wd',column]
wd4Index = wd4.index.tolist()
dfdoors.drop(wd4Index, axis = 0, inplace = True)
p = .05
column = 'drive-wheels'
depVar = 'log_price'
type1, type2, totalTypes = stratify(df = dfdoors, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar=depVar, p = p)
print(testStats)
Looking at our t-test statistics our 95% confidence interval doesn't include 0 and our p-value is very small, well below our .05 cutoff. Looking at these statistics we must reject our null hypothesis that the difference in means due to front or rear wheel drive are from random variation alone. Let's look at the data graphically and see if it agrees with what our t_test told us.
Now let's plot the histogram of our two drive wheel types.
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p)
Looking at the histogram it seems that these two distributions are in fact different. The mean of the rear wheel drive histogram is outside the 95% confidence interval of the front wheel drive histogram. This combined with our t-test statistics is proof enough that we can reject the null hypothesis that the difference in means is from random variation. This means that the drive wheel type is a statistically significant indicator of price.
Now let's compare the Insurance bins to the curb weight. This will show us if not risky cars are heavier than risky cars or vice versa.
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'InsBin'
depVar = 'curb-weight'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p)
print(testStats) s
Looking at our t_test statistics our 95% confidence interval did not include 0 and our p-value is small. Looking at these statistics we must reject our null hypothesis that the difference in mean curb weight due to risk are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.
Let's plot the histogram of our two risk types.
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p)
Looking at the histogram the difference in means between the two categories is not due to random variation alone. The mean of the not risky histogram is outside the 95% confidence interval of the risky histogram. This combined with our t_test statistics is proof enough that we can reject the null hypothesis that the difference in means is from random variation. This means that the riskyness is a statistically significant indicator of curb weight.
Now let's look at our cylinder bin compared to curb weight.
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'CylBin'
depVar = 'curb-weight'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p)
print(testStats)
Looking at our t-test statistics our 95% confidence interval doesn't include 0 and our p-value is very small, well below our .05 cutoff. Looking at these statistics we must reject our null hypothesis that the difference in means due to number of cylinders are from random variation alone. Let's look at the data graphically and see if it agrees with what our t_test told us.
Now let's plot the histogram of our two cylinder bins.
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) #plot histogram with means and CI
Looking at the histogram it seems that these two distributions are in fact different. The mean of the extra cylinders histogram is outside the 95% confidence interval of the normal cylinder histogram. This combined with our t-test statistics is proof enough that we can reject the null hypothesis that the difference in means is from random variation. This means that the number of cylinders is a statistically significant indicator of curb weight. Cars with more cylinders generally weigh more.
Now let's look at the cylinder bins in relation to the insurance risk ratings.
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'CylBin'
depVar = 'symboling'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p)
print(testStats)
Looking at our t_test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in means in insurance rating due to the number of cylinders are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.
Now let's plot the histogram of our two cylinder bins in relation to their risk ratings.
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p)
Looking at the histogram it is clear that the difference in means between the two categories is due to random variation alone. The mean of the extra histogram is inside the 95% confidence interval of the normal histogram. This combined with our t_test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the number of cylinders is not a statistically significant indicator of risk rating. It is interesting to see that the extra cylinders bin has the same risk as average cars.
Now let's look at Cylinder Bins vs normalized losses.
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'CylBin'
depVar = 'normalized-losses'
type1, type2, totalTypes = stratify(df = dfNorm, p = p, column = column, depVar = depVar) #let's stratify and save our dfs
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p) #let's get our test stats
print(testStats) #print test stats
Looking at our t_test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in mean normalized losses due to number of cylinders are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) #plot histogram with means and CI
Looking at the histogram we can see that the difference in means between the two categories is due to random variation alone. The mean of the extra histogram is inside the 95% confidence interval of the normal histogram. This combined with our t_test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the number of cylinders is not a statistically significant indicator of normalized losses.
Let's move on and Apply ANOVA and Tukey's HSD test to our data to compare the log price of autos stratified by body style using our createDF and cleanDf function. Let's start by creating our list of data frames for each body style. Let's also print how many unique data frames we have and the unique body styles.
column = 'body-style'
depVar = 'log_price'
uniques, allDfs = createDfs(df,column)
print(len(allDfs))
print(uniques)
#Unpack all dfs list to multiple different dfs
df1 = cleanDf(df = allDfs[0], depVar = depVar)
df2 = cleanDf(df = allDfs[1], depVar = depVar)
df3 = cleanDf(df = allDfs[2], depVar = depVar)
df4 = cleanDf(df = allDfs[3], depVar = depVar)
df5 = cleanDf(df = allDfs[4], depVar = depVar)
#Plot box plot of all body style data frames
plt.boxplot([df1, df2, df3, df4, df5])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 5 variables have the same mean, as the probability of the differences arrising from random chance is quite low.
Let's perform the Tukey HSD test on body style as well.
TukeyDF = df.copy(deep=True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Looking at the Tukey HSD output above we can see that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for the following relations: convertible/hardtop, convertible/sedan, convertible/wagon, hardtop/sedan, hardtop/wagon, hatchback/wagon, and sedan/wagon. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for convertible/hatchback, hardtop/hatchback, and hatchback/sedan. This means relatively speaking these pairs prices are drawn from different distributions.
Let's look at these relations graphically.
Tukey_HSD.plot_simultaneous() #Tukey HSD Plot
The graph above shows us what we saw above. For instance, hatchback has overlap in the interval with wagon only. That is why when comparing hatchback to wagon we had to accept the null hypothesis that they are similiar. However, when comparing hatchback with all other types their is no overlap and therefore we can reject the null hypothesis and we accept that the prices would be drawn from different distributions.
Now let's move on to our number of cylinders column. First let's drop our eight, three and twelve cylinder rows as we only have 4,1,and 1 of them respectively. We will not be able to draw good insights with such a low count. Then let's seperate each number of cylinder types into seperate dfs and perform the ANOVA test.
column = 'num-of-cylinders'
depVar = 'log_price'
dfCylinders = df.copy(deep = True)
toDrop = dfCylinders.loc[dfCylinders[column] == 'eight', column]
toDropIndex = toDrop.index.tolist()
dfCylinders.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfCylinders.loc[dfCylinders[column] == 'three', column]
toDropIndex = toDrop.index.tolist()
dfCylinders.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfCylinders.loc[dfCylinders[column] == 'twelve', column]
toDropIndex = toDrop.index.tolist()
dfCylinders.drop(toDropIndex, axis = 0, inplace = True)
uniques, allDfs = createDfs(dfCylinders,column)
print(len(allDfs))
print(uniques)
df1 = cleanDf(df = allDfs[0], depVar = depVar)
df2 = cleanDf(df = allDfs[1], depVar = depVar)
df3 = cleanDf(df = allDfs[2], depVar = depVar)
plt.boxplot([df1, df2, df3])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
f_statistic, p_value = ss.f_oneway(df1, df2, df3)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
The F-Statistic is really large and the p-value is small. We can reject the null hypothesis that the 3 variables have the same mean, as the probability of the differences arrising from random chance is quite low.
Let's perform the Tukey HSD test on number of cylinders as well.
TukeyDF = dfCylinders.copy(deep = True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for five/six cylinder relationship. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for five/four and four/six cylinder relationships. This means relatively speaking these pairs prices are drawn from different distributions.
Let's look at these relations graphically.
Tukey_HSD.plot_simultaneous()
The graph above shows us what we saw above. Four cylinder prices do not overlap with five or six therefore it is likely that they are drawn from a different distribution; however, five and six overlap and are therefore likely similiar.
Let's look at the engine types in relation to prices next.
column = 'engine-type'
depVar = 'log_price'
uniques, allDfs = createDfs(df,column)
print(len(allDfs))
print(uniques)
df1 = cleanDf(df = allDfs[0], depVar = depVar)
df2 = cleanDf(df = allDfs[1], depVar = depVar)
df3 = cleanDf(df = allDfs[2], depVar = depVar)
df4 = cleanDf(df = allDfs[3], depVar = depVar)
df5 = cleanDf(df = allDfs[4], depVar = depVar)
plt.boxplot([df1, df2, df3, df4, df5])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 5 variables have the same mean, as the probability of the differences arrising from random chance is quite low.
Let's perform the Tukey HSD test on engine type as well.
TukeyDF = df.copy(deep=True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for the following relations: dohc/l, dohc/ohcf, dohc/ohcv, l/ohc, l/ohcf, l/ohcv, and ohc/ohcf. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for dohc/ohc, ohc/ohcv, and ohcf/ohcv. This means relatively speaking these pairs prices are drawn from different distributions.
Let's look at these relations graphically.
Tukey_HSD.plot_simultaneous()
The graph above shows us what we saw above. For instance, ohcv has overlap in the interval with dohc and l only. That is why when comparing ohcv to dohc and l we had to accept the null hypothesis that they are similiar. However, when comparing ohcv with all other types their is no overlap and therefore we can reject the null hypothesis and we accept that the prices would be drawn from different distributions.
Now let's look at our fuel systems in relation to price. We must get rid of the spfi row and the mfi row as there is only 1 of each and it will not provide us with useful information. Once that is done we will split up our seperate data frames by fuel system type and do our ANOVA test.
column = 'fuel-system'
depVar = 'log_price'
dfFuels = df.copy(deep = True)
toDrop = dfFuels.loc[dfFuels[column] == 'spfi', column]
toDropIndex = toDrop.index.tolist()
dfFuels.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfFuels.loc[dfFuels[column] == 'mfi', column]
toDropIndex = toDrop.index.tolist()
dfFuels.drop(toDropIndex, axis = 0, inplace = True)
uniques, allDfs = createDfs(dfFuels,column)
print(len(allDfs))
print(uniques)
df1 = cleanDf(df = allDfs[0], depVar = depVar)
df2 = cleanDf(df = allDfs[1], depVar = depVar)
df3 = cleanDf(df = allDfs[2], depVar = depVar)
df4 = cleanDf(df = allDfs[3], depVar = depVar)
df5 = cleanDf(df = allDfs[4], depVar = depVar)
plt.boxplot([df1, df2, df3, df4, df5])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 5 variables have the same mean, as the probability of the differences arrising from random chance is quite low.
Let's perform the Tukey HSD test on fuel system as well.
TukeyDF = dfFuels.copy(deep = True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for the following relations: 1bbl/2bbl, 1bbl/spdi, idi/mpfi, and idi/spdi. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for 1bbl/idi, 1bbl/mpfi, 2bbl/idi, 2bbl/mpfi, 2bbl/spdi, and mpfi/spdi. This means relatively speaking these pairs prices are drawn from different distributions.
Let's look at these relations graphically.
Tukey_HSD.plot_simultaneous()
Looking at the chart above we can see that some have overlapping intervals and therefore we cannot reject the null hypothesis when comparing them with their difference in means. Some intervals do not overlap at all, for these we can reject the null hypothesis and assume that their price distributions are probably different.
Finally let's take a look at makes versus insurance risk ratings. We will drop makes that have less than 4 vehicles.
column = 'make'
depVar = 'symboling'
dfMakes = df.copy(deep = True)
toDrop = dfMakes.loc[dfMakes[column] == 'mercury', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'isuzu', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'jaguar', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'alfa-romero', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'chevrolet', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
uniques, allDfs = createDfs(dfMakes,column) #create list of DataFrames for each unique in given column
print(len(allDfs))#Number of final DFs to unpack
print(uniques)
df1 = cleanDf(df = allDfs[0], depVar = depVar)
df2 = cleanDf(df = allDfs[1], depVar = depVar)
df3 = cleanDf(df = allDfs[2], depVar = depVar)
df4 = cleanDf(df = allDfs[3], depVar = depVar)
df5 = cleanDf(df = allDfs[4], depVar = depVar)
df6 = cleanDf(df = allDfs[5], depVar = depVar)
df7 = cleanDf(df = allDfs[6], depVar = depVar)
df8 = cleanDf(df = allDfs[7], depVar = depVar)
df9 = cleanDf(df = allDfs[8], depVar = depVar)
df10 = cleanDf(df = allDfs[9], depVar = depVar)
df11 = cleanDf(df = allDfs[10], depVar = depVar)
df12 = cleanDf(df = allDfs[11], depVar = depVar)
df13 = cleanDf(df = allDfs[12], depVar = depVar)
df14 = cleanDf(df = allDfs[13], depVar = depVar)
df15 = cleanDf(df = allDfs[14], depVar = depVar)
df16 = cleanDf(df = allDfs[15], depVar = depVar)
plt.boxplot([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 16 variables have the same mean, as the probability of the differences arrising from random chance is quite low.
Let's perform the Tukey HSD test on this as well.
TukeyDF = dfMakes.copy(deep = True)
TukeyDF = TukeyDF.loc[:,[column, depVar]]
Tukey_HSD = pairwise_tukeyhsd(TukeyDF[depVar], TukeyDF[column])
print(Tukey_HSD)
Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Our output above shows us which distributions we can reject the Null hypothesis.
Let's look at these relations graphically.
Tukey_HSD.plot_simultaneous()
Looking at the above graphs we can see that Volvo by far makes the safest cars with all their cars having a below average risk rating. Most of the other cars fall around the same range with Porsche and Saab by far making the riskiest cars.
Next let's go ahead and bootstrap some of our data and then we will compare them graphically. Let's create the functions that will help us bootsrap our data. Our first functions will plot our data distributions or histograms with their 95% confidence intervals.
def plot_dists(a, b, nbins, a_label='pop_A', b_label='pop_B', depVar = 'variable', p=5):
# Create a single sequence of bins to be shared across both
#distribution plots for visualization consistency.
combined = pd.concat([a, b])
breaks = np.linspace(
combined.min(),
combined.max(),
num=nbins+1)
plt.subplot(2, 1, 1)
plt.hist(a, bins=breaks, alpha = .5)
plt.axvline(a.mean(), linewidth = 4)
# 95% confidence interval
plt.axvline(np.percentile(a, p/2.), color='red', linewidth=3)
plt.axvline(np.percentile(a, 100-p/2.), color='red', linewidth=3)
plt.title(a_label.capitalize())
plt.xlabel(depVar.capitalize())
plt.subplot(2, 1, 2) #plot 2
plt.hist(b, bins=breaks, alpha = .5)
plt.axvline(b.mean(), linewidth = 4)
# 95% confidence interval
plt.axvline(np.percentile(b, p/2.), color='red', linewidth=3)
plt.axvline(np.percentile(b, 100-p/2.), color='red', linewidth=3)
plt.title(b_label.capitalize())
plt.xlabel(depVar.capitalize())
plt.tight_layout()
def plot_hist(x, a, b, depVar = 'variable', p=5):
# Plot the distribution and mark the mean
plt.hist(x, alpha = .5)
plt.axvline(x.mean(), linewidth = 4)
# 95% confidence interval
plt.axvline(np.percentile(x, p/2.), color='red', linewidth=3)
plt.axvline(np.percentile(x, 100-p/2.), color='red', linewidth=3)
plt.title(a.capitalize() + " - " + b.capitalize())
plt.xlabel(depVar.capitalize() + " Mean Differences")
Now let's create our bootstrap function. This will take a column with two possible choices and bootsrap the mean of a dependent variable for the two choices. Then it will plot the distribution graph of the two choice's means. This will have a default replica value of 1000.
#Bootstrap
def bootStrap(df,column, choice1, choice2, depVar, n_replicas = 1000):
a = df[df[column] == choice1]
b = df[df[column] == choice2]
a_bootstrap_means = pd.Series([
a.sample(frac=1, replace=True)[depVar].mean()
for i in range(n_replicas)])
b_bootstrap_means = pd.Series([
b.sample(frac=1, replace=True)[depVar].mean()
for i in range(n_replicas)])
plot_dists(a = a_bootstrap_means, b = b_bootstrap_means,
nbins=80, a_label= choice1, b_label=choice2, depVar = depVar) #plot distributions
Our final bootstrap function will take a column in a data frame, and sample the mean differences of a variable between two choices in that column. It will have a default replica value of 1000.
#Bootstrap Mean Diffs
def bootStrapDiffs(df, column, choice1, choice2, depVar, n_replicas = 1000):
diffs = []
for i in range(n_replicas):
sample = df.sample(frac=1.0, replace=True)
a_sample_mean = sample[sample[column] == choice1][depVar].mean()
b_sample_mean = sample[sample[column] == choice2][depVar].mean()
diffs.append(a_sample_mean - b_sample_mean)
diffs = pd.Series(diffs)
plot_hist(x = diffs, a = choice1, b = choice2, depVar = depVar) #plot histogram
Now that our functions are ready let's bootstrap the insurance bins in relation to horsepower.
column = 'InsBin' #set column
choice1 = 'Risky' #set column 1st choice
choice2 = 'Not Risky' #set column 2nd choice
depVar = 'horsepower' #set dependent var
bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar) #bootstrap
Looking at the above graph we can see that the horsepower is not affected by insurance risks. This can be seen as the Not risky 95% confidence intervals do overlap the risky confidence intervals and the mean of the Not Risky distribution falls inside the 95% confidence interval of the risky distribution.
Let's bootstrap and plot their difference in means as well.
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar) #bootstrap differences
We can see that in general risky cars have 8 less horsepower than their not risky counterparts.
Now let's look at the horsepower difference between the normal cylinder cars and the extra cylinder cars to see if the horsepower difference is significant.
column = 'CylBin' #set column
choice1 = 'Extra' #set column 1st choice
choice2 = 'Normal' #set column 2nd choice
depVar = 'horsepower' #set dependent var
bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
Looking at the above graph we can see that the Extra cylinder cars have more horsepower. This can be seen as the 95% confidence intervals do not overlap.
Let's bootstrap and plot their difference in means as well.
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
We can see that in general greater than four cylinder cars have on average 64 extra horsepower than cars with four or less cylinders.
Now let's look at the difference in horsepower between gas and diesel cars.
column = 'fuel-type' #set column
choice1 = 'gas' #set column 1st choice
choice2 = 'diesel' #set column 2nd choice
depVar = 'horsepower' #set dependent var
bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
Looking at the above graph we can see that the gas cars in general have more horsepower. This can be seen as the 95% confidence intervals do not overlap.
Let's bootstrap and plot their difference in means as well.
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
We can see that in general the gas cars have on average 20 extra horsepower than diesel cars.
Now let's look at the difference in horsepower between front and rear wheel drive cars.
column = 'drive-wheels' #set column
choice1 = 'fwd' #set column 1st choice
choice2 = 'rwd' #set column 2nd choice
depVar = 'horsepower' #set dependent var
bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
Looking at the above graph we can see that the rear wheel drive cars have more horsepower in general. This can be seen as the 95% confidence intervals do not overlap, in fact there is a considerable gap between the two distributions.
Let's bootstrap and plot their difference in means.
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
We can see that in general the front wheel drive cars have on average 47 less horsepower than rear wheel drive cars.
Now let's look at the difference in city mpg between front and rear wheel drive cars.
column = 'drive-wheels' #set column
choice1 = 'fwd' #set column 1st choice
choice2 = 'rwd' #set column 2nd choice
depVar = 'city-mpg' #set dependent var
bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
Looking at the above graph we can see that the front wheel drive cars have better city gas mileage. This can be seen as the 95% confidence intervals above do not overlap.
Let's bootstrap and plot their difference in means.
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
We can see that in general the front wheel drive cars have better gas mileage than rear wheel drive cars to the tune of 7 extra miles per gallon.
Now let's look at the difference in city mpg between gas and diesel cars.
column = 'fuel-type' #set column
choice1 = 'gas' #set column 1st choice
choice2 = 'diesel' #set column 2nd choice
depVar = 'city-mpg' #set dependent var
bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
Looking at the above graph we can see that the diesel cars in general have better gas mileage. This can be seen as the 95% confidence intervals do not overlap.
Let's bootstrap and plot their difference in means.
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)
We can see that in general the diesel cars have better gas mileage than gas cars to the tune of 5 extra miles per gallon.
Now let's move on and create some bayesian models and plot them with their credible intervals.
Let's create some functions that will help us along with creating our bayesian models.
Our first function will compute and return our normalized likelihood
def comp_like(p, x):
variance = np.std(x)**2
x_mean = np.asarray(x).mean()
print('Mean = %.3f, Standard deviation = %.3f' % (x_mean, np.std(x)))
n = len(x)
l = np.exp(-n * np.square(x_mean - p) / (2 * variance))
return l / l.sum()
Our next function will compute our posterior and return a normalized posterior distribution. This will be done by multiplying the prior with the likelihood then normalizing.
def posterior(prior, like):
post = prior * like
return post / sum(post)
Our next function will plot the posterior distribution along with the credible interval specified.
def plot_ci(p, post, num_samples, lower_q, upper_q, xLab = 'Parameter value'):
## Compute a large sample by resampling with replacement
samples = np.random.choice(p, size=num_samples, replace=True, p=post)
ci = scipy.percentile(samples, [lower_q*100, upper_q*100])
interval = upper_q - lower_q
plt.title('Posterior density with %.3f credible interval' % interval)
plt.plot(p, post, color='blue')
plt.xlabel(xLab)
plt.ylabel('Density')
plt.axvline(x=ci[0], color='red')
plt.axvline(x=ci[1], color='red')
print('The %.3f credible interval is %.3f to %.3f'
% (interval, ci[0], ci[1]))
Now that our functions are ready let's create our first bayesian model.
Let's calculate the posterior distributions of Price for Risky and Not Risky bins. We will calculate this given 50 samples.
num_samples = 50
risky = df[df['InsBin'] == 'Risky'].sample(n=num_samples)
nRisky = df[df['InsBin'] == 'Not Risky'].sample(n=num_samples)
#Set x range
N = 1000
p = np.linspace(8000, 20000, num=N)
#create posterior of risky
pp = ss.norm.pdf(p, loc= risky['price'].mean(), scale=1000)
pp = pp / pp.sum()
like_risky = comp_like(p, risky['price'])
post_risky = posterior(pp, like_risky)
#Create posterior of not risky
pp = ss.norm.pdf(p, loc= nRisky['price'].mean(), scale=1000)
pp = pp / pp.sum()
like_nRisky = comp_like(p, nRisky['price'])
post_nRisky = posterior(pp, like_nRisky)
#Plot posteriors with 95% credible intervals
plt.subplot(2,1,1)
plot_ci(p, post_risky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Risky Price')
plt.subplot(2,1,2)
plot_ci(p, post_nRisky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Not Risky Price')
plt.tight_layout()
Looking above we can see the 95% credible intervals of price for our Risky and Not Risky vehicles do not overlap. We can assume Not Risky vehicles are drawn from a different distribution than Risky vehicles and are generally more expensive.
Now let's create bayesian models of price distributions for Normal and Extra cylinders.
num_samples = 25
normalCyl = df[df['CylBin'] == 'Normal'].sample(n=num_samples)
extraCyl = df[df['CylBin'] == 'Extra'].sample(n=num_samples)
#Set x range
N = 1000
p = np.linspace(6000, 32000, num=N)
pp = ss.norm.pdf(p, loc= normalCyl['price'].mean(), scale=1000)
pp = pp / pp.sum()
like_normal = comp_like(p, normalCyl['price'])
post_normal = posterior(pp, like_normal)
pp = ss.norm.pdf(p, loc= extraCyl['price'].mean(), scale=1000)
pp = pp / pp.sum()
like_extra = comp_like(p, extraCyl['price'])
post_extra = posterior(pp, like_extra)
plt.subplot(2,1,1)
plot_ci(p, post_normal, num_samples, lower_q=.025, upper_q=.975, xLab = 'Normal Cylinder Price')
plt.subplot(2,1,2)
plot_ci(p, post_extra, num_samples, lower_q=.025, upper_q=.975, xLab = 'Extra Cylinder Price')
plt.tight_layout()
We can see from above that the 95% credible intervals of Normal and Extra cylinders do not overlap and are not even close to each other. It is safe to assume that Extra Cylinder cars are in general more expensive than Normal cylinder (4 cylinder or less) cars.
Now let's create bayesian models of risk factor distributions for Normal and Extra cylinders.
num_samples = 25
normalCyl = df[df['CylBin'] == 'Normal'].sample(n=num_samples)
extraCyl = df[df['CylBin'] == 'Extra'].sample(n=num_samples)
#set x range
N = 100
p = np.linspace(-2, 3, num=N)
pp = ss.norm.pdf(p, loc= normalCyl['symboling'].mean(), scale=1000)
pp = pp / pp.sum()
like_normal = comp_like(p, normalCyl['symboling'])
post_normal = posterior(pp, like_normal)
pp = ss.norm.pdf(p, loc= extraCyl['symboling'].mean(), scale=1000)
pp = pp / pp.sum()
like_extra = comp_like(p, extraCyl['symboling'])
post_extra = posterior(pp, like_extra)
plt.subplot(2,1,1)
plot_ci(p, post_normal, num_samples, lower_q=.025, upper_q=.975, xLab = 'Normal Cylinder Insurance Rating')
plt.subplot(2,1,2)
plot_ci(p, post_extra, num_samples, lower_q=.025, upper_q=.975, xLab = 'Extra Cylinder Insurance Rating')
plt.tight_layout()
Looking at the above graph we can see that the normal and extra cylinder cars were generally considered to be just as risky as each other as far as insurance standards. The 95% credible intervals overlap significantly showing that both distributions were drawn from similair populations.
Now let's create bayesian models of normalized loss distributions for Risky and Not Risky vehicles.
num_samples = 50
risky = dfNorm[dfNorm['InsBin'] == 'Risky'].sample(n=num_samples)
nRisky = dfNorm[dfNorm['InsBin'] == 'Not Risky'].sample(n=num_samples)
#set x range
N = 1000
p = np.linspace(70, 180, num=N)
pp = ss.norm.pdf(p, loc= risky['normalized-losses'].mean(), scale=1000)
pp = pp / pp.sum()
like_risky = comp_like(p, risky['normalized-losses'])
post_risky = posterior(pp, like_risky)
pp = ss.norm.pdf(p, loc= nRisky['normalized-losses'].mean(), scale=1000)
pp = pp / pp.sum()
like_nRisky = comp_like(p, nRisky['normalized-losses'])
post_nRisky = posterior(pp, like_nRisky)
plt.subplot(2,1,1)
plot_ci(p, post_risky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Risky Normalized Losses')
plt.subplot(2,1,2)
plot_ci(p, post_nRisky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Not Risky Normalized Losses')
plt.tight_layout()
Looking at the above graph we can see that the risky cars cost slightly more in normalized losses for the insurance companies. The 95% credible intervals do not overlap between the two distributions showing significance in their differences.
Now let's create bayesian models of city mpg distributions for Risky and Not Risky vehicles.
num_samples = 50
risky = df[df['InsBin'] == 'Risky'].sample(n=num_samples)
nRisky = df[df['InsBin'] == 'Not Risky'].sample(n=num_samples)
#set x range
N = 1000
p = np.linspace(18, 33, num=N)
pp = ss.norm.pdf(p, loc= risky['city-mpg'].mean(), scale=1000)
pp = pp / pp.sum()
like_risky = comp_like(p, risky['city-mpg'])
post_risky = posterior(pp, like_risky)
pp = ss.norm.pdf(p, loc= nRisky['city-mpg'].mean(), scale=1000)
pp = pp / pp.sum()
like_nRisky = comp_like(p, nRisky['city-mpg'])
post_nRisky = posterior(pp, like_nRisky)
plt.subplot(2,1,1)
plot_ci(p, post_risky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Risky City MPG')
plt.subplot(2,1,2)
plot_ci(p, post_nRisky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Not Risky City MPG')
plt.tight_layout()
Looking at the above graph we can see that the Risky and Not Risky cars generally have similiar city MPG. The 95% credible intervals overlap significantly showing that both distributions were drawn from similair populations.
Finally let's create bayesian models of city mpg distributions for Normal and Extra cylinder vehicles.
num_samples = 25
normalCyl = df[df['CylBin'] == 'Normal'].sample(n=num_samples)
extraCyl = df[df['CylBin'] == 'Extra'].sample(n=num_samples)
#set x range
N = 1000
p = np.linspace(15, 33, num=N)
pp = ss.norm.pdf(p, loc= normalCyl['city-mpg'].mean(), scale=1000)
pp = pp / pp.sum() # normalize
like_normal = comp_like(p, normalCyl['city-mpg'])
post_normal = posterior(pp, like_normal)
pp = ss.norm.pdf(p, loc= extraCyl['city-mpg'].mean(), scale=1000)
pp = pp / pp.sum()
like_extra = comp_like(p, extraCyl['city-mpg'])
post_extra = posterior(pp, like_extra)
plt.subplot(2,1,1)
plot_ci(p, post_normal, num_samples, lower_q=.025, upper_q=.975, xLab = 'Normal Cylinder City MPG')
plt.subplot(2,1,2)
plot_ci(p, post_extra, num_samples, lower_q=.025, upper_q=.975, xLab = 'Extra Cylinder City MPG')
plt.tight_layout()
We can see from above that the 95% credible intervals of Normal and Extra cylinders do not overlap and are not even close to each other. It is safe to assume that Normal Cylinder cars in general have better City MPG than extra cylinder cars.
We've had quite the journey. We started out by comparing our price and log price distributions to normal distributions to see that the log price was slightly more normally distributed. Then we stratified and compared fuel types, aspiration and rear vs front wheel drive to log price, Insurance Bins and Cylinder bins to curb weight, Cylinder bins to risk symboling and Cylinder bins to Normalized losses. We did this both graphically and with the t test.
We then performed Tukey HSD and ANOVA tests on some of our categorical columns. We found many different interesting relationships there and learned which relationships had similiar price distributions and which probably came from different distributions. We did this by looking at the F-statistic, the p values and the 95% confidence intervals. We also did this by looking at the Tukey HSD plot and by looking at histogram comparasions with means and 95% confidence intervals plotted.
We then bootstrapped the means and mean differences for Insurance Bins, Cylinder Bins, Fuel type and Drive wheels to horsepower and fuel type, and drive wheels to city mpg and plotted them.
Finally we created and analyzed Bayesian price models for Insurance bins and Cylinder bins. We created and analyzed bayesian city mpg models for Insurance bins and cylinder bins. And finally we also created bayesian models of normalized losses for Insurance bins and Risk symboling for cylinder bins. We analyzed all the different posterior density's of the above relationships with their credible intervals and determined relationship significance.