Hypothesis Simulation

Abstract

A large-item auction site client is getting ready to acquire a smaller automobile auction company primarily dealing with classic cars and is interested in understanding how the price relates to certain automobile characteristics.

First let's import our libraries and define some auxiliary functions.

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import math
import seaborn as sns
from sklearn.preprocessing import scale
import scipy.stats as ss
import scipy
import statsmodels.stats.weightstats as ws
import matplotlib.pyplot as plt
from scipy.stats import kstest 
from statistics import mean
from statsmodels.stats.multicomp import pairwise_tukeyhsd

Auxiliary Functions

Let's create some functions that we will use in our analysis below.

In [2]:
#Create getLog Function to create the log(price) column
def getLog(price):
     return math.log(price)
In [3]:
%matplotlib inline
def plot_cums(dist1, dist2): 
    ## sort the first data sample:
    data_sorted = np.sort(dist1)
    # calculate the proportional values of samples
    p = 1. * np.arange(len(dist1)) / (len(dist1) - 1)

    # Now plot as line and scatter plot. 
    plt.plot(data_sorted, p)
    plt.scatter(data_sorted, p, color = 'red')
    
    # sort the seond data sample:
    data_sorted = np.sort(dist2)
    
    # And the second plot
    plt.plot(data_sorted, p, color = 'yellow')
    plt.scatter(data_sorted, p, color = 'green')
In [4]:
def ks_test(data, dist = 'norm'):
    ks_stat, pvalue = kstest(data, dist) 
    print('KS-statistic = ' + str(ks_stat)) 
    print('P-value = ' + str(pvalue)) 
In [5]:
def stratify(df,p,column, depVar):
    groupedCounts = df.loc[:,column].value_counts() 
    groupedIndexes = groupedCounts.index.tolist() 
    maxNums = groupedCounts.min() 
    wantNums = int(p* groupedCounts.sum()) 
    if wantNums <= maxNums: 
        item1Loc = df.loc[:, column] == groupedIndexes[0] 
        item2Loc = df.loc[:, column] == groupedIndexes[1] 
    else: #p is to large
        pmax = maxNums / df.shape[0]
        raise Exception('The maximum value of p = ' + str(pmax))
    item1Data = df.loc[item1Loc] 
    item2Data = df.loc[item2Loc] 
    item1Samples = item1Data.sample(n=wantNums) 
    item2Samples = item2Data.sample(n=wantNums) 
    type1 = item1Samples.loc[:, [column,depVar]] 
    type2 = item2Samples.loc[:, [column,depVar]] 
    totalTypes = pd.concat([type1,type2]) 
    return type1, type2, totalTypes 

We've defined a few functions including a stratify function, ks test function, plot cumaltive distribution function and a function to add log price to our data frame.

Next let's define our t_test function. This function will return our degrees of freedom, the difference of mean, the t-statistic, p-value, and high and low confidence intervals.

In [6]:
def t_test(type1, type2, depVar, p):
    a = type1.loc[:,depVar] 
    b = type2.loc[:,depVar] 
    diff = a.mean() - b.mean() 

    res = ss.ttest_ind(a, b) 
      
    means = ws.CompareMeans(ws.DescrStatsW(a), ws.DescrStatsW(b)) 
    confint = means.tconfint_diff(alpha=p, alternative='two-sided', usevar='unequal') 
    degfree = means.dof_satt() 

    index = ['DegFreedom', 'Difference', 'Statistic', 'PValue', 'Low95CI', 'High95CI'] 
    return pd.Series([degfree, diff, res[0], res[1], confint[0], confint[1]], index = index)   

Now let's create our plothist function. This will create a subplot of two histograms that share an x and y axis. This will also graph the 95% confidence intervals for the first histogram with dashed vertical lines, and means for the first and second histogram with a solid vertical line.

In [7]:
def plotHist(type1, type2, depVar, p):
    a = type1.loc[:,depVar] 
    b = type2.loc[:,depVar] 
    means = ws.CompareMeans(ws.DescrStatsW(a), ws.DescrStatsW(b))
    confint = means.tconfint_diff(alpha=p, alternative='two-sided', usevar='unequal')  
    diff = a.mean() - b.mean() 
    ax1 = plt.subplot(211) 
    upper = mean(a) + confint[1] - diff 
    lower = mean(a) + confint[0] - diff 
    plt.hist(a, label = type1.iloc[0,0],bins=10, color='blue', density=True)
    plt.legend() 
    plt.axvline(mean(a), color = 'red') 
    plt.axvline(upper, color = 'red', linestyle='--') 
    plt.axvline(lower, color = 'red', linestyle='--')
    plt.ylabel('Frequency') 
    ax2 = plt.subplot(212, sharex=ax1, sharey=ax1) 
    plt.hist(b, label = type2.iloc[0,0],bins=10, color='green', density=True) 
    plt.legend() 
    plt.axvline(mean(b), color = 'red')
    plt.ylabel('Frequency') 
    plt.xlabel('Value') 

Now let's create a function that will create a data frame for each unique value in a given column in our data frame. It will return all data frames in a list along with a variable containing all unique values found in the given column. We will also create a cleanDF function that will take our data frame and extract only the specified column.

In [8]:
def createDfs(df,  column):
     uniquevals = df[column].unique().tolist() 
     allDf = [] 
     for vals in uniquevals:
         allDf.append(vals) 
     for i in range(len(uniquevals)): 
          strvals = str(uniquevals[i]) 
          allDf[i] = df.copy(deep=True) 
          toDrop = allDf[i].loc[allDf[i][column] != strvals, column] 
          toDropIndex = toDrop.index.tolist()
          allDf[i].drop(toDropIndex, axis=0, inplace=True) 
     
     return uniquevals, allDf 

def cleanDf(df, depVar):
     df = df.loc[:, depVar] 
     return df 

Load and Prepare Data

Now that our auxiliary functions are defined let's load the data. We will create 2 data frames to use for different analysis. We will change our numeric columns to numeric and remove all rows of missing data. In dfNorm we will be removing all missing data from the normalized losses column along with the other changes made in the original df. Our cleaned up original Df will have 193 rows, and the dfNorm will have 160 rows.

In [10]:
#Load the data set and save it as a df
df = pd.read_csv(fileName,header = 'infer')
dfNorm = pd.read_csv(fileName,header = 'infer')
In [11]:
#Set columns with missing values to fix
cols = ['price', 'bore', 'stroke', 
        'horsepower', 'peak-rpm']
cols2 = ['price', 'bore', 'stroke', 
        'horsepower', 'peak-rpm', 'normalized-losses']

#Coerce columns to numeric and put NaN's for unknown values
for i in cols:
     df.loc[:,i] = pd.to_numeric(df.loc[:,i], errors = 'coerce')

for i in cols2:
     dfNorm.loc[:,i] = pd.to_numeric(dfNorm.loc[:,i], errors = 'coerce')

#Drop all columns with NaN values
df.dropna(axis=0, inplace=True)
dfNorm.dropna(axis=0, inplace=True)
In [12]:
#Cleanup ? from Num of Doors
column = 'num-of-doors' 
toDrop = df.loc[df[column] == '?', column] 
toDropIndex = toDrop.index.tolist() 
df.drop(toDropIndex, axis = 0, inplace = True) 

Now let's add the log price and standardized prices columns to our data frames.

In [13]:
#Add log price column. Use get log function to apply it to the data frame
df['log_price'] = df['price'].apply(getLog)
dfNorm['log_price'] = dfNorm['price'].apply(getLog)
In [14]:
#Standardize our price and log price and save to dfs
df['priceS'] = scale(df.loc[:,'price'])
df['logPS']= scale(df.loc[:,'log_price'])
dfNorm['priceS'] = scale(dfNorm.loc[:,'price'])
dfNorm['logPS']= scale(dfNorm.loc[:,'log_price'])

Now that our data is loaded and prepared let's take a look at the data and some statistics of the data to make sure everything has been loaded correctly and to get a feel for our data.

In [15]:
#Let's look at the top of our cleaned up data frame 
pd.set_option('display.max_columns', 500)
print(df.head())
   symboling normalized-losses         make fuel-type aspiration num-of-doors  \
0          3                 ?  alfa-romero       gas        std          two   
1          3                 ?  alfa-romero       gas        std          two   
2          1                 ?  alfa-romero       gas        std          two   
3          2               164         audi       gas        std         four   
4          2               164         audi       gas        std         four   

    body-style drive-wheels engine-location  wheel-base  length  width  \
0  convertible          rwd           front        88.6   168.8   64.1   
1  convertible          rwd           front        88.6   168.8   64.1   
2    hatchback          rwd           front        94.5   171.2   65.5   
3        sedan          fwd           front        99.8   176.6   66.2   
4        sedan          4wd           front        99.4   176.6   66.4   

   height  curb-weight engine-type num-of-cylinders  engine-size fuel-system  \
0    48.8         2548        dohc             four          130        mpfi   
1    48.8         2548        dohc             four          130        mpfi   
2    52.4         2823        ohcv              six          152        mpfi   
3    54.3         2337         ohc             four          109        mpfi   
4    54.3         2824         ohc             five          136        mpfi   

   bore  stroke  compression-ratio  horsepower  peak-rpm  city-mpg  \
0  3.47    2.68                9.0       111.0    5000.0        21   
1  3.47    2.68                9.0       111.0    5000.0        21   
2  2.68    3.47                9.0       154.0    5000.0        19   
3  3.19    3.40               10.0       102.0    5500.0        24   
4  3.19    3.40                8.0       115.0    5500.0        18   

   highway-mpg    price  log_price    priceS     logPS  
0           27  13495.0   9.510075  0.026025  0.311082  
1           27  16500.0   9.711116  0.398480  0.705417  
2           26  16500.0   9.711116  0.398480  0.705417  
3           30  13950.0   9.543235  0.082420  0.376125  
4           22  17450.0   9.767095  0.516227  0.815219  
In [16]:
#Let's print the dtypes to make sure everything looks right.
print(df.dtypes)
symboling              int64
normalized-losses     object
make                  object
fuel-type             object
aspiration            object
num-of-doors          object
body-style            object
drive-wheels          object
engine-location       object
wheel-base           float64
length               float64
width                float64
height               float64
curb-weight            int64
engine-type           object
num-of-cylinders      object
engine-size            int64
fuel-system           object
bore                 float64
stroke               float64
compression-ratio    float64
horsepower           float64
peak-rpm             float64
city-mpg               int64
highway-mpg            int64
price                float64
log_price            float64
priceS               float64
logPS                float64
dtype: object
In [17]:
#Let's print a few more statistics
print('Shape: ', df.shape)
print('Total Nulls: ',df.isnull().sum().sum()) #check for NaN's
print('Columns: ', df.columns)
Shape:  (193, 29)
Total Nulls:  0
Columns:  Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price', 'log_price', 'priceS', 'logPS'],
      dtype='object')
In [18]:
#Let's print the descriptive statistics 
df.describe()
Out[18]:
symboling wheel-base length width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price log_price priceS logPS
count 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 193.000000 1.930000e+02 1.930000e+02
mean 0.797927 98.923834 174.326425 65.893782 53.869948 2561.507772 128.124352 3.330622 3.248860 10.143627 103.481865 5099.740933 25.326425 30.787565 13285.025907 9.351478 5.867500e-17 2.575948e-15
std 1.235582 6.152409 12.478593 2.137795 2.394770 526.700026 41.590452 0.272385 0.315421 3.977491 37.960107 468.694369 6.387828 6.816910 8089.082886 0.511149 1.002601e+00 1.002601e+00
min -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000 8.540519 -1.012261e+00 -1.590668e+00
25% 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 98.000000 3.150000 3.110000 8.500000 70.000000 4800.000000 19.000000 25.000000 7738.000000 8.953899 -6.875257e-01 -7.798385e-01
50% 1.000000 97.000000 173.200000 65.400000 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5100.000000 25.000000 30.000000 10245.000000 9.234545 -3.767958e-01 -2.293598e-01
75% 2.000000 102.400000 184.600000 66.900000 55.700000 2952.000000 146.000000 3.590000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16515.000000 9.712024 4.003389e-01 7.071996e-01
max 3.000000 120.900000 208.100000 72.000000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 262.000000 6600.000000 49.000000 54.000000 45400.000000 10.723267 3.980488e+00 2.690719e+00

Now let's look at all the value counts or unique values for our categories to get a further idea of the data we are working with, and to gather ideas for our analysis process.

In [19]:
col = 'symboling'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'make'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'num-of-doors'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'engine-location'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'engine-type'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'num-of-cylinders'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'fuel-system'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'fuel-type'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'aspiration'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'drive-wheels'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'body-style'
print("\n", col, ":\n", df.loc[:,col].value_counts())
col = 'engine-size'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'normalized-losses'
print("\n", col, ":\n", dfNorm.loc[:,col].unique())
col = 'city-mpg'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'highway-mpg'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'horsepower'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'peak-rpm'
print("\n", col, ":\n", df.loc[:,col].unique())
col = 'curb-weight'
print("\n", col, ":\n", df.loc[:,col].unique())
 symboling :
  0    63
 1    51
 2    31
 3    23
-1    22
-2     3
Name: symboling, dtype: int64

 make :
 toyota           32
nissan           18
honda            13
mitsubishi       13
mazda            12
subaru           12
volkswagen       12
volvo            11
peugot           11
mercedes-benz     8
bmw               8
dodge             8
plymouth          7
audi              6
saab              6
porsche           4
alfa-romero       3
chevrolet         3
jaguar            3
isuzu             2
mercury           1
Name: make, dtype: int64

 num-of-doors :
 four    112
two      81
Name: num-of-doors, dtype: int64

 engine-location :
 front    190
rear       3
Name: engine-location, dtype: int64

 engine-type :
 ohc     141
ohcf     15
ohcv     13
dohc     12
l        12
Name: engine-type, dtype: int64

 num-of-cylinders :
 four      153
six        24
five       10
eight       4
twelve      1
three       1
Name: num-of-cylinders, dtype: int64

 fuel-system :
 mpfi    88
2bbl    64
idi     19
1bbl    11
spdi     9
spfi     1
mfi      1
Name: fuel-system, dtype: int64

 fuel-type :
 gas       174
diesel     19
Name: fuel-type, dtype: int64

 aspiration :
 std      158
turbo     35
Name: aspiration, dtype: int64

 drive-wheels :
 fwd    114
rwd     71
4wd      8
Name: drive-wheels, dtype: int64

 body-style :
 sedan          92
hatchback      63
wagon          24
hardtop         8
convertible     6
Name: body-style, dtype: int64

 engine-size :
 [130 152 109 136 131 108 164 209  61  90  98 122 156  92  79 110 111 119
 258 326  91 140 134 183 234 308 304  97 103 120 181 151 194 121 146 171
 161 141 173 145]

 normalized-losses :
 [164. 158. 192. 188. 121.  98.  81. 118. 148. 110. 145. 137. 101.  78.
 106.  85. 107. 104. 113. 129. 115.  93. 142. 161. 153. 125. 128. 122.
 103. 168. 108. 194. 231. 119. 154.  74. 186. 150.  83. 102.  89.  87.
  77.  91. 134.  65. 197.  90.  94. 256.  95.]

 city-mpg :
 [21 19 24 18 17 23 20 16 15 47 38 37 31 49 30 27 25 13 26 22 14 45 28 32
 35 34 29 33]

 highway-mpg :
 [27 26 30 22 25 20 29 28 53 43 41 38 24 54 42 34 33 31 19 17 32 39 18 16
 37 50 23 36 47 46]

 horsepower :
 [111. 154. 102. 115. 110. 140. 101. 121. 182.  48.  70.  68.  88. 145.
  58.  76.  60.  86. 100.  78.  90. 176. 262.  84. 120.  72. 123. 155.
 184. 175. 116.  69.  55.  97. 152. 160. 200.  95. 142. 143. 207.  73.
  82.  94.  62.  56. 112.  92. 161. 156.  52.  85. 114. 162. 134. 106.]

 peak-rpm :
 [5000. 5500. 5800. 4250. 5400. 5100. 4800. 6000. 4750. 4200. 4350. 4500.
 5200. 4150. 5600. 5900. 5250. 4900. 4400. 6600. 5300.]

 curb-weight :
 [2548 2823 2337 2824 2507 2844 2954 3086 2395 2710 2765 3055 3230 3380
 3505 1488 1874 1909 1876 2128 1967 1989 2535 2811 1713 1819 1837 1940
 1956 2010 2024 2236 2289 2304 2372 2465 2293 2734 4066 3950 1890 1900
 1905 1945 1950 2385 2410 2425 2670 2700 3515 3750 3495 3770 3740 3685
 3900 3715 2910 1918 1944 2004 2145 2370 2328 2833 2921 2926 2365 2405
 2403 1889 2017 1938 1951 2028 1971 2037 2008 2324 2302 3095 3296 3060
 3071 3139 3020 3197 3430 3075 3252 3285 3485 3130 2191 2818 2778 2756
 2800 2658 2695 2707 2758 2808 2847 2050 2120 2240 2190 2340 2510 2290
 2455 2420 2650 1985 2040 2015 2280 3110 2081 2109 2275 2094 2122 2140
 2169 2204 2265 2300 2540 2536 2551 2679 2714 2975 2326 2480 2414 2458
 2976 3016 3131 3151 2261 2209 2264 2212 2319 2254 2221 2661 2579 2563
 2912 3034 2935 3042 3045 3157 2952 3049 3012 3217 3062]

Let's create some Bins for our cylinders and insurance ratings. We will bin cars with a risk rating of 0 or less as Not Risky and those with a rating of 1 or more as Risky. We will consider cars with four or less cylinders as Normal and those with over four cylinders as Extra.

In [20]:
#Function to bin insurance risks
def binIns(symbol):
    if symbol < 1:
        return 'Not Risky'
    elif symbol > 0:
        return 'Risky'
In [21]:
#Function to bin cylinders
def binCyls(cylinders):
    if cylinders == 'four':
        return 'Normal'
    elif cylinders == 'three':
        return 'Normal'   
    elif cylinders == 'five':
        return 'Extra'   
    elif cylinders == 'six':
        return 'Extra'
    elif cylinders == 'eight':
        return 'Extra'
    elif cylinders == 'twelve':
        return 'Extra'   
In [22]:
df['InsBin'] = df['symboling'].apply(binIns) #bin symboling column using our function and save in df
df['CylBin'] = df['num-of-cylinders'].apply(binCyls) #bin cylinder column using our function and save in df
dfNorm['InsBin'] = dfNorm['symboling'].apply(binIns) #bin symboling column using our function and save in df
dfNorm['CylBin'] = dfNorm['num-of-cylinders'].apply(binCyls) #bin cylinder column using our function and save in df

Distribution Analysis

Now that our data is ready let's start with our first test. We will compare and test the Normality of the distributions of price and log price.

Let's plot the distributions of price and log price.

In [23]:
#Plot price vs log price using our function
plot_cums(df.loc[:,'price'], df.loc[:,'log_price'])    

As we can see above these two distributions when plotted don't provide us a great plot. This is because log price naturally has much smaller values than price. Let's standardize price and log(price) so that we can plot them more effectively against each other.

In [24]:
#Plot our standardize price columns
plot_cums(df['priceS'], df['logPS'])  

Now that they are standardized we can see that both log(price) and price have similiar distributions. It is hard to see if these look normal. Now let's create a function that will perform our kstest and return the KS statistic along with the P value. The function will test the distribution of our given data versus that of a normal distribution.

Next let's run our KS test for our standardized price and standardized log price against a normal distribution and compare their statistics and p values.

In [25]:
ks_test(df['priceS'])  
KS-statistic = 0.15570653643875398
P-value = 0.00014924310941233365
In [26]:
ks_test(df['logPS']) 
KS-statistic = 0.10302687448682402
P-value = 0.030871835339169262

Judging by the statistics above, these distributions are not really normal. Their k-statistics are not too large but they both possess low p-values, meaning we must reject our null hypothesis that they are similiar to a normal distribution. However, we can see that the log price column is more normally distributed as it has a smaller KS statistic and a larger p-value than the normal price.

Let's look at some Kernel Density Estimate plots of some of our variables to get a feel for their distributions.

In [27]:
fig = plt.figure(figsize=(6,6)) 
ax = fig.gca() 
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'log_price'], ax = ax)
ax.set_title('KDE plot of log auto price') 
ax.set_xlabel('Log Auto price') 
ax.set_ylabel('Density')
Out[27]:
Text(0,0.5,'Density')
In [28]:
fig = plt.figure(figsize=(6,6)) 
ax = fig.gca()  
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'price'], ax = ax)
ax.set_title('KDE plot of auto price') 
ax.set_xlabel('Auto price') 
ax.set_ylabel('Density')
Out[28]:
Text(0,0.5,'Density')

The price distributions are clearly right skewed with the mode being around $8,000. The log price distribution is slightly less skewed but it is still clearly right skewed.

In [29]:
fig = plt.figure(figsize=(6,6)) 
ax = fig.gca() 
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'curb-weight'], ax = ax)
ax.set_title('KDE plot of curb weight') 
ax.set_xlabel('Curb Weight') 
ax.set_ylabel('Density')
Out[29]:
Text(0,0.5,'Density')
In [30]:
fig = plt.figure(figsize=(6,6)) 
ax = fig.gca() 
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'symboling'], ax = ax)
ax.set_title('KDE plot of symboling') 
ax.set_xlabel('Symboling') 
ax.set_ylabel('Density')
Out[30]:
Text(0,0.5,'Density')

We can see above that the insurance ratings are actually left skewed. This shows that most of the cars are more risky to the insurance companies; however the mode is at 0 so the highest count falls at normal risk.

In [31]:
fig = plt.figure(figsize=(6,6)) 
ax = fig.gca() 
sns.set_style("whitegrid")
sns.kdeplot(dfNorm.loc[:, 'normalized-losses'], ax = ax)
ax.set_title('KDE plot of Normalized Losses') 
ax.set_xlabel('Normalized Losses')
ax.set_ylabel('Density')
Out[31]:
Text(0,0.5,'Density')

The above shows the distribution of our normalized losses. The mode is around 100 and the distribution is slightly skewed right. It is interesting to note that even though most cars are deemed risky their losses seem to be minimal.

In [32]:
fig = plt.figure(figsize=(6,6)) 
ax = fig.gca() 
sns.set_style("whitegrid")
sns.kdeplot(df.loc[:, 'horsepower'], ax = ax)
ax.set_title('KDE plot of Horsepower') 
ax.set_xlabel('Horsepower') 
ax.set_ylabel('Density')
Out[32]:
Text(0,0.5,'Density')

The horsepower distribution has a slighly similiar distribution as log price and price which makes sense as generally more horsepower on a car will cost more.

Now let's begin exploring the relationships between categories.

T-Test and Histograms with Confidence Intervals

Let's test significance of some variables using both graphical methods and the t-test.

Let's begin by stratifying our fuel-type column. We will then perform a t-test on the two fuel types and print our test statistics.

In [33]:
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'fuel-type' 
depVar = 'log_price'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar) 
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p) 
print(testStats)
DegFreedom    15.125833
Difference    -0.009848
Statistic     -0.041437
PValue         0.967460
Low95CI       -0.516031
High95CI       0.496336
dtype: float64

Looking at our t_test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in means due to fuel type are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.

Now let's plot the histogram of our two fuel types.

In [34]:
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) 

Looking at the histogram it is clear that the difference in means between the two categories is due to random variation alone. The mean of the diesel histogram is inside the 95% confidence interval of the gas histogram. This combined with our t_test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the fuel type is not a statistically significant indicator of price.

Next let's look at the importance of aspiration on price.

In [35]:
p = .05 #set our p
column = 'aspiration' 
depVar = 'log_price'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar) 
testStats = t_test(type1 = type1,type2 = type2, depVar=depVar, p = p) 
print(testStats) 
DegFreedom    14.446503
Difference    -0.312368
Statistic     -1.393047
PValue         0.182660
Low95CI       -0.791911
High95CI       0.167175
dtype: float64

Looking at our t-test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in means due to aspiration are from random variation alone. Let's look at the data graphically and see if it agrees with what our t-test told us.

Now let's plot the histogram of our two aspiration types.

In [36]:
plotHist(type1 = type1, type2 = type2, depVar= depVar, p = p)

Looking at the histogram it seems that the difference in means between the two categories is due to random variation alone. The mean of the turbo histogram is inside the 95% confidence interval of the standard histogram. This combined with our t-test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the aspiration type is not a statistically significant indicator of price.

Now let's look at the importance of rear or front wheel drive on price. First let's create a new df and remove 4wd cars from the drive-wheels column. Then let's stratify our new df and do our t test.

In [37]:
dfdoors = df.copy(deep = True) 
column = 'drive-wheels'
depVar = 'log_price'
wd4 = dfdoors.loc[dfdoors[column] == '4wd',column] 
wd4Index = wd4.index.tolist() 
dfdoors.drop(wd4Index, axis = 0, inplace = True) 
In [38]:
p = .05
column = 'drive-wheels' 
depVar = 'log_price'
type1, type2, totalTypes = stratify(df = dfdoors, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar=depVar, p = p)
print(testStats)
DegFreedom    15.189520
Difference    -0.909287
Statistic     -4.985125
PValue         0.000135
Low95CI       -1.297641
High95CI      -0.520932
dtype: float64

Looking at our t-test statistics our 95% confidence interval doesn't include 0 and our p-value is very small, well below our .05 cutoff. Looking at these statistics we must reject our null hypothesis that the difference in means due to front or rear wheel drive are from random variation alone. Let's look at the data graphically and see if it agrees with what our t_test told us.

Now let's plot the histogram of our two drive wheel types.

In [39]:
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p)

Looking at the histogram it seems that these two distributions are in fact different. The mean of the rear wheel drive histogram is outside the 95% confidence interval of the front wheel drive histogram. This combined with our t-test statistics is proof enough that we can reject the null hypothesis that the difference in means is from random variation. This means that the drive wheel type is a statistically significant indicator of price.

Now let's compare the Insurance bins to the curb weight. This will show us if not risky cars are heavier than risky cars or vice versa.

In [40]:
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'InsBin'
depVar = 'curb-weight'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar) 
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p) 
print(testStats) s
DegFreedom     15.991805
Difference    -28.000000
Statistic      -0.121272
PValue          0.904986
Low95CI      -517.476404
High95CI      461.476404
dtype: float64

Looking at our t_test statistics our 95% confidence interval did not include 0 and our p-value is small. Looking at these statistics we must reject our null hypothesis that the difference in mean curb weight due to risk are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.

Let's plot the histogram of our two risk types.

In [41]:
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) 

Looking at the histogram the difference in means between the two categories is not due to random variation alone. The mean of the not risky histogram is outside the 95% confidence interval of the risky histogram. This combined with our t_test statistics is proof enough that we can reject the null hypothesis that the difference in means is from random variation. This means that the riskyness is a statistically significant indicator of curb weight.

Now let's look at our cylinder bin compared to curb weight.

In [42]:
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'CylBin' 
depVar = 'curb-weight'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar) 
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p) 
print(testStats) 
DegFreedom      15.943009
Difference    -715.444444
Statistic       -3.589450
PValue           0.002453
Low95CI      -1138.104021
High95CI      -292.784868
dtype: float64

Looking at our t-test statistics our 95% confidence interval doesn't include 0 and our p-value is very small, well below our .05 cutoff. Looking at these statistics we must reject our null hypothesis that the difference in means due to number of cylinders are from random variation alone. Let's look at the data graphically and see if it agrees with what our t_test told us.

Now let's plot the histogram of our two cylinder bins.

In [43]:
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) #plot histogram with means and CI

Looking at the histogram it seems that these two distributions are in fact different. The mean of the extra cylinders histogram is outside the 95% confidence interval of the normal cylinder histogram. This combined with our t-test statistics is proof enough that we can reject the null hypothesis that the difference in means is from random variation. This means that the number of cylinders is a statistically significant indicator of curb weight. Cars with more cylinders generally weigh more.

Now let's look at the cylinder bins in relation to the insurance risk ratings.

In [44]:
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'CylBin' 
depVar = 'symboling'
type1, type2, totalTypes = stratify(df = df, p = p, column = column, depVar = depVar)
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p) 
print(testStats) 
DegFreedom    15.401785
Difference    -1.000000
Statistic     -1.537844
PValue         0.143627
Low95CI       -2.382856
High95CI       0.382856
dtype: float64

Looking at our t_test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in means in insurance rating due to the number of cylinders are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.

Now let's plot the histogram of our two cylinder bins in relation to their risk ratings.

In [45]:
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) 

Looking at the histogram it is clear that the difference in means between the two categories is due to random variation alone. The mean of the extra histogram is inside the 95% confidence interval of the normal histogram. This combined with our t_test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the number of cylinders is not a statistically significant indicator of risk rating. It is interesting to see that the extra cylinders bin has the same risk as average cars.

Now let's look at Cylinder Bins vs normalized losses.

In [46]:
p = .05 #set our probability for reject hypothesis, our cutoff confidence
column = 'CylBin' 
depVar = 'normalized-losses'
type1, type2, totalTypes = stratify(df = dfNorm, p = p, column = column, depVar = depVar) #let's stratify and save our dfs
testStats = t_test(type1 = type1,type2 = type2, depVar = depVar, p = p) #let's get our test stats
print(testStats) #print test stats
DegFreedom    11.495777
Difference   -14.375000
Statistic     -0.941219
PValue         0.362551
Low95CI      -47.814051
High95CI      19.064051
dtype: float64

Looking at our t_test statistics our 95% confidence interval includes 0 and our p-value is large. Looking at these statistics we must accept our null hypothesis that the difference in mean normalized losses due to number of cylinders are from random variation alone. Let's look at the data graphically and see if it agrees with out what out t-test told us.

In [47]:
plotHist(type1 = type1, type2 = type2, depVar=depVar, p = p) #plot histogram with means and CI

Looking at the histogram we can see that the difference in means between the two categories is due to random variation alone. The mean of the extra histogram is inside the 95% confidence interval of the normal histogram. This combined with our t_test statistics is proof enough that we cannot reject the null hypothesis that the difference in means is from random variation. This means that the number of cylinders is not a statistically significant indicator of normalized losses.

ANOVA and Tukey's HSD Test

Let's move on and Apply ANOVA and Tukey's HSD test to our data to compare the log price of autos stratified by body style using our createDF and cleanDf function. Let's start by creating our list of data frames for each body style. Let's also print how many unique data frames we have and the unique body styles.

In [48]:
column = 'body-style' 
depVar = 'log_price'
uniques, allDfs = createDfs(df,column) 
print(len(allDfs))
print(uniques)
5
['convertible', 'hatchback', 'sedan', 'wagon', 'hardtop']
In [49]:
#Unpack all dfs list to multiple different dfs 
df1 = cleanDf(df = allDfs[0], depVar = depVar) 
df2 = cleanDf(df = allDfs[1], depVar = depVar)  
df3 = cleanDf(df = allDfs[2], depVar = depVar) 
df4 = cleanDf(df = allDfs[3], depVar = depVar) 
df5 = cleanDf(df = allDfs[4], depVar = depVar)
In [50]:
#Plot box plot of all body style data frames
plt.boxplot([df1, df2, df3, df4, df5])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
Out[50]:
Text(0.5,1,'Box plot of variables')
In [51]:
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5) 
print('F statistic = ' + str(f_statistic)) 
print('P-value = ' + str(p_value)) 
F statistic = 8.821980795225977
P-value = 1.5055219795547482e-06

The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 5 variables have the same mean, as the probability of the differences arrising from random chance is quite low.

Let's perform the Tukey HSD test on body style as well.

In [52]:
TukeyDF = df.copy(deep=True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
In [53]:
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column]) 
print(Tukey_HSD)
  Multiple Comparison of Means - Tukey HSD,FWER=0.05 
=====================================================
   group1     group2  meandiff  lower   upper  reject
-----------------------------------------------------
convertible  hardtop  -0.0966  -0.8018  0.6085 False 
convertible hatchback -0.7854  -1.3432 -0.2276  True 
convertible   sedan   -0.4461  -0.9962  0.104  False 
convertible   wagon    -0.531  -1.1269  0.0649 False 
  hardtop   hatchback -0.6887  -1.1788 -0.1987  True 
  hardtop     sedan   -0.3495  -0.8307  0.1318 False 
  hardtop     wagon   -0.4344  -0.9674  0.0986 False 
 hatchback    sedan    0.3393   0.1258  0.5528  True 
 hatchback    wagon    0.2544  -0.0588  0.5675 False 
   sedan      wagon   -0.0849  -0.3842  0.2143 False 
-----------------------------------------------------

Looking at the Tukey HSD output above we can see that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for the following relations: convertible/hardtop, convertible/sedan, convertible/wagon, hardtop/sedan, hardtop/wagon, hatchback/wagon, and sedan/wagon. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for convertible/hatchback, hardtop/hatchback, and hatchback/sedan. This means relatively speaking these pairs prices are drawn from different distributions.

Let's look at these relations graphically.

In [54]:
Tukey_HSD.plot_simultaneous() #Tukey HSD Plot
Out[54]:

The graph above shows us what we saw above. For instance, hatchback has overlap in the interval with wagon only. That is why when comparing hatchback to wagon we had to accept the null hypothesis that they are similiar. However, when comparing hatchback with all other types their is no overlap and therefore we can reject the null hypothesis and we accept that the prices would be drawn from different distributions.

Now let's move on to our number of cylinders column. First let's drop our eight, three and twelve cylinder rows as we only have 4,1,and 1 of them respectively. We will not be able to draw good insights with such a low count. Then let's seperate each number of cylinder types into seperate dfs and perform the ANOVA test.

In [55]:
column = 'num-of-cylinders'
depVar = 'log_price'
dfCylinders = df.copy(deep = True)
toDrop = dfCylinders.loc[dfCylinders[column] == 'eight', column]
toDropIndex = toDrop.index.tolist()
dfCylinders.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfCylinders.loc[dfCylinders[column] == 'three', column]
toDropIndex = toDrop.index.tolist()
dfCylinders.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfCylinders.loc[dfCylinders[column] == 'twelve', column]
toDropIndex = toDrop.index.tolist()
dfCylinders.drop(toDropIndex, axis = 0, inplace = True)
In [56]:
uniques, allDfs = createDfs(dfCylinders,column) 
print(len(allDfs))
print(uniques)
3
['four', 'six', 'five']
In [57]:
df1 = cleanDf(df = allDfs[0], depVar = depVar) 
df2 = cleanDf(df = allDfs[1], depVar = depVar)  
df3 = cleanDf(df = allDfs[2], depVar = depVar) 
In [58]:
plt.boxplot([df1, df2, df3])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
Out[58]:
Text(0.5,1,'Box plot of variables')
In [59]:
f_statistic, p_value = ss.f_oneway(df1, df2, df3)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
F statistic = 73.1580287123013
P-value = 4.185126417218107e-24

The F-Statistic is really large and the p-value is small. We can reject the null hypothesis that the 3 variables have the same mean, as the probability of the differences arrising from random chance is quite low.

Let's perform the Tukey HSD test on number of cylinders as well.

In [60]:
TukeyDF = dfCylinders.copy(deep = True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
In [61]:
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
 five   four  -0.7861  -1.0616 -0.5105  True 
 five   six    0.0467  -0.2711  0.3644 False 
 four   six    0.8327   0.6474  1.0181  True 
---------------------------------------------

Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for five/six cylinder relationship. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for five/four and four/six cylinder relationships. This means relatively speaking these pairs prices are drawn from different distributions.

Let's look at these relations graphically.

In [62]:
Tukey_HSD.plot_simultaneous()
Out[62]:

The graph above shows us what we saw above. Four cylinder prices do not overlap with five or six therefore it is likely that they are drawn from a different distribution; however, five and six overlap and are therefore likely similiar.

Let's look at the engine types in relation to prices next.

In [63]:
column = 'engine-type' 
depVar = 'log_price'
uniques, allDfs = createDfs(df,column) 
print(len(allDfs))
print(uniques)
5
['dohc', 'ohcv', 'ohc', 'l', 'ohcf']
In [64]:
df1 = cleanDf(df = allDfs[0], depVar = depVar) 
df2 = cleanDf(df = allDfs[1], depVar = depVar)  
df3 = cleanDf(df = allDfs[2], depVar = depVar) 
df4 = cleanDf(df = allDfs[3], depVar = depVar) 
df5 = cleanDf(df = allDfs[4], depVar = depVar)
In [65]:
plt.boxplot([df1, df2, df3, df4, df5])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
Out[65]:
Text(0.5,1,'Box plot of variables')
In [66]:
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
F statistic = 11.60384058605387
P-value = 1.9261012780440077e-08

The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 5 variables have the same mean, as the probability of the differences arrising from random chance is quite low.

Let's perform the Tukey HSD test on engine type as well.

In [67]:
TukeyDF = df.copy(deep=True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
In [68]:
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
 dohc    l    -0.1815  -0.7017  0.3387 False 
 dohc   ohc   -0.4846  -0.8678 -0.1014  True 
 dohc   ohcf  -0.4163  -0.9098  0.0772 False 
 dohc   ohcv   0.3087  -0.2014  0.8188 False 
  l     ohc   -0.3031  -0.6863  0.0801 False 
  l     ohcf  -0.2348  -0.7283  0.2587 False 
  l     ohcv   0.4902  -0.0199  1.0003 False 
 ohc    ohcf   0.0684  -0.2777  0.4144 False 
 ohc    ohcv   0.7933   0.424   1.1627  True 
 ohcf   ohcv   0.725    0.2421  1.2078  True 
---------------------------------------------

Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for the following relations: dohc/l, dohc/ohcf, dohc/ohcv, l/ohc, l/ohcf, l/ohcv, and ohc/ohcf. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for dohc/ohc, ohc/ohcv, and ohcf/ohcv. This means relatively speaking these pairs prices are drawn from different distributions.

Let's look at these relations graphically.

In [69]:
Tukey_HSD.plot_simultaneous()
Out[69]:

The graph above shows us what we saw above. For instance, ohcv has overlap in the interval with dohc and l only. That is why when comparing ohcv to dohc and l we had to accept the null hypothesis that they are similiar. However, when comparing ohcv with all other types their is no overlap and therefore we can reject the null hypothesis and we accept that the prices would be drawn from different distributions.

Now let's look at our fuel systems in relation to price. We must get rid of the spfi row and the mfi row as there is only 1 of each and it will not provide us with useful information. Once that is done we will split up our seperate data frames by fuel system type and do our ANOVA test.

In [70]:
column = 'fuel-system'
depVar = 'log_price'
dfFuels = df.copy(deep = True)
toDrop = dfFuels.loc[dfFuels[column] == 'spfi', column]
toDropIndex = toDrop.index.tolist()
dfFuels.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfFuels.loc[dfFuels[column] == 'mfi', column]
toDropIndex = toDrop.index.tolist()
dfFuels.drop(toDropIndex, axis = 0, inplace = True)
In [71]:
uniques, allDfs = createDfs(dfFuels,column) 
print(len(allDfs))
print(uniques)
5
['mpfi', '2bbl', '1bbl', 'idi', 'spdi']
In [72]:
df1 = cleanDf(df = allDfs[0], depVar = depVar) 
df2 = cleanDf(df = allDfs[1], depVar = depVar)  
df3 = cleanDf(df = allDfs[2], depVar = depVar) 
df4 = cleanDf(df = allDfs[3], depVar = depVar)  
df5 = cleanDf(df = allDfs[4], depVar = depVar) 
In [73]:
plt.boxplot([df1, df2, df3, df4, df5])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
Out[73]:
Text(0.5,1,'Box plot of variables')
In [74]:
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
F statistic = 53.02195443924761
P-value = 9.34039038031428e-30

The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 5 variables have the same mean, as the probability of the differences arrising from random chance is quite low.

Let's perform the Tukey HSD test on fuel system as well.

In [75]:
TukeyDF = dfFuels.copy(deep = True)
TukeyDF = TukeyDF.loc[:,[column, 'log_price']]
In [76]:
Tukey_HSD = pairwise_tukeyhsd(TukeyDF['log_price'], TukeyDF[column])
print(Tukey_HSD)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
=============================================
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
 1bbl   2bbl   -0.018  -0.3372  0.3011 False 
 1bbl   idi    0.6566   0.2862  1.0271  True 
 1bbl   mpfi   0.7807   0.4681  1.0934  True 
 1bbl   spdi   0.362   -0.0775  0.8014 False 
 2bbl   idi    0.6747   0.4192  0.9301  True 
 2bbl   mpfi   0.7988   0.6382  0.9594  True 
 2bbl   spdi    0.38    0.0319  0.7281  True 
 idi    mpfi   0.1241  -0.1232  0.3715 False 
 idi    spdi  -0.2947  -0.6903  0.101  False 
 mpfi   spdi  -0.4188   -0.761 -0.0766  True 
---------------------------------------------

Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Therefore we cannot reject the null hypothesis for the following relations: 1bbl/2bbl, 1bbl/spdi, idi/mpfi, and idi/spdi. This means relatively speaking these pairs prices would be drawn from similiar distributions. However, we can reject the null hypothesis for 1bbl/idi, 1bbl/mpfi, 2bbl/idi, 2bbl/mpfi, 2bbl/spdi, and mpfi/spdi. This means relatively speaking these pairs prices are drawn from different distributions.

Let's look at these relations graphically.

In [77]:
Tukey_HSD.plot_simultaneous()
Out[77]:

Looking at the chart above we can see that some have overlapping intervals and therefore we cannot reject the null hypothesis when comparing them with their difference in means. Some intervals do not overlap at all, for these we can reject the null hypothesis and assume that their price distributions are probably different.

Finally let's take a look at makes versus insurance risk ratings. We will drop makes that have less than 4 vehicles.

In [78]:
column = 'make'
depVar = 'symboling'
dfMakes = df.copy(deep = True)
toDrop = dfMakes.loc[dfMakes[column] == 'mercury', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'isuzu', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'jaguar', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'alfa-romero', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
toDrop = dfMakes.loc[dfMakes[column] == 'chevrolet', column]
toDropIndex = toDrop.index.tolist()
dfMakes.drop(toDropIndex, axis = 0, inplace = True)
In [79]:
uniques, allDfs = createDfs(dfMakes,column) #create list of DataFrames for each unique in given column
print(len(allDfs))#Number of final DFs to unpack
print(uniques)
16
['audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo']
In [80]:
df1 = cleanDf(df = allDfs[0], depVar = depVar) 
df2 = cleanDf(df = allDfs[1], depVar = depVar)  
df3 = cleanDf(df = allDfs[2], depVar = depVar) 
df4 = cleanDf(df = allDfs[3], depVar = depVar)  
df5 = cleanDf(df = allDfs[4], depVar = depVar) 
df6 = cleanDf(df = allDfs[5], depVar = depVar) 
df7 = cleanDf(df = allDfs[6], depVar = depVar)  
df8 = cleanDf(df = allDfs[7], depVar = depVar) 
df9 = cleanDf(df = allDfs[8], depVar = depVar)  
df10 = cleanDf(df = allDfs[9], depVar = depVar) 
df11 = cleanDf(df = allDfs[10], depVar = depVar) 
df12 = cleanDf(df = allDfs[11], depVar = depVar)  
df13 = cleanDf(df = allDfs[12], depVar = depVar) 
df14 = cleanDf(df = allDfs[13], depVar = depVar)  
df15 = cleanDf(df = allDfs[14], depVar = depVar) 
df16 = cleanDf(df = allDfs[15], depVar = depVar)
In [81]:
plt.boxplot([df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16])
plt.ylabel('Value')
plt.xlabel('Variable')
plt.title('Box plot of variables')
Out[81]:
Text(0.5,1,'Box plot of variables')
In [82]:
f_statistic, p_value = ss.f_oneway(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12, df13, df14, df15, df16)
print('F statistic = ' + str(f_statistic))
print('P-value = ' + str(p_value))
F statistic = 9.560239615067879
P-value = 6.000306953381814e-16

The F-Statistic is fairly large and the p-value is small. We can reject the null hypothesis that the 16 variables have the same mean, as the probability of the differences arrising from random chance is quite low.

Let's perform the Tukey HSD test on this as well.

In [83]:
TukeyDF = dfMakes.copy(deep = True)
TukeyDF = TukeyDF.loc[:,[column, depVar]]
In [84]:
Tukey_HSD = pairwise_tukeyhsd(TukeyDF[depVar], TukeyDF[column])
print(Tukey_HSD)
     Multiple Comparison of Means - Tukey HSD,FWER=0.05    
===========================================================
    group1        group2    meandiff  lower   upper  reject
-----------------------------------------------------------
     audi          bmw       -1.125  -2.9066  0.6566 False 
     audi         dodge       -0.5   -2.2816  1.2816 False 
     audi         honda     -0.8846  -2.5128  0.7436 False 
     audi         mazda     -0.9167  -2.5661  0.7328 False 
     audi     mercedes-benz   -1.5   -3.2816  0.2816 False 
     audi       mitsubishi   0.3462   -1.282  1.9743 False 
     audi         nissan      -0.5   -2.0551  1.0551 False 
     audi         peugot      -1.5   -3.1743  0.1743 False 
     audi        plymouth     -0.5   -2.3354  1.3354 False 
     audi        porsche      1.5    -0.6294  3.6294 False 
     audi          saab       1.0    -0.9046  2.9046 False 
     audi         subaru      -1.0   -2.6495  0.6495 False 
     audi         toyota    -0.9375  -2.4051  0.5301 False 
     audi       volkswagen   0.1667  -1.4828  1.8161 False 
     audi         volvo     -2.7727   -4.447 -1.0985  True 
     bmw          dodge      0.625   -1.0245  2.2745 False 
     bmw          honda      0.2404   -1.242  1.7228 False 
     bmw          mazda      0.2083  -1.2974  1.7141 False 
     bmw      mercedes-benz  -0.375  -2.0245  1.2745 False 
     bmw        mitsubishi   1.4712  -0.0112  2.9536 False 
     bmw          nissan     0.625   -0.7768  2.0268 False 
     bmw          peugot     -0.375  -1.9079  1.1579 False 
     bmw         plymouth    0.625   -1.0824  2.3324 False 
     bmw         porsche     2.625    0.6048  4.6452  True 
     bmw           saab      2.125    0.3434  3.9066  True 
     bmw          subaru     0.125   -1.3807  1.6307 False 
     bmw          toyota     0.1875  -1.1165  1.4915 False 
     bmw        volkswagen   1.2917  -0.2141  2.7974 False 
     bmw          volvo     -1.6477  -3.1806 -0.1148  True 
    dodge         honda     -0.3846   -1.867  1.0978 False 
    dodge         mazda     -0.4167  -1.9224  1.0891 False 
    dodge     mercedes-benz   -1.0   -2.6495  0.6495 False 
    dodge       mitsubishi   0.8462  -0.6362  2.3286 False 
    dodge         nissan      0.0    -1.4018  1.4018 False 
    dodge         peugot      -1.0   -2.5329  0.5329 False 
    dodge        plymouth     0.0    -1.7074  1.7074 False 
    dodge        porsche      2.0    -0.0202  4.0202 False 
    dodge          saab       1.5    -0.2816  3.2816 False 
    dodge         subaru      -0.5   -2.0057  1.0057 False 
    dodge         toyota    -0.4375  -1.7415  0.8665 False 
    dodge       volkswagen   0.6667  -0.8391  2.1724 False 
    dodge         volvo     -2.2727  -3.8056 -0.7398  True 
    honda         mazda     -0.0321  -1.3527  1.2886 False 
    honda     mercedes-benz -0.6154  -2.0978  0.867  False 
    honda       mitsubishi   1.2308  -0.0632  2.5247 False 
    honda         nissan     0.3846  -0.8161  1.5853 False 
    honda         peugot    -0.6154  -1.9669  0.7361 False 
    honda        plymouth    0.3846  -1.1619  1.9312 False 
    honda        porsche     2.3846   0.4984  4.2708  True 
    honda          saab      1.8846   0.2564  3.5128  True 
    honda         subaru    -0.1154   -1.436  1.2052 False 
    honda         toyota    -0.0529  -1.1379  1.0321 False 
    honda       volkswagen   1.0513  -0.2693  2.3719 False 
    honda         volvo     -1.8881  -3.2396 -0.5366  True 
    mazda     mercedes-benz -0.5833  -2.0891  0.9224 False 
    mazda       mitsubishi   1.2628  -0.0578  2.5834 False 
    mazda         nissan     0.4167  -0.8128  1.6461 False 
    mazda         peugot    -0.5833  -1.9604  0.7937 False 
    mazda        plymouth    0.4167  -1.1523  1.9856 False 
    mazda        porsche     2.4167   0.512   4.3213  True 
    mazda          saab      1.9167   0.2672  3.5661  True 
    mazda         subaru    -0.0833  -1.4301  1.2634 False 
    mazda         toyota    -0.0208  -1.1375  1.0959 False 
    mazda       volkswagen   1.0833  -0.2634  2.4301 False 
    mazda         volvo     -1.8561  -3.2331  -0.479  True 
mercedes-benz   mitsubishi   1.8462   0.3638  3.3286  True 
mercedes-benz     nissan      1.0    -0.4018  2.4018 False 
mercedes-benz     peugot      0.0    -1.5329  1.5329 False 
mercedes-benz    plymouth     1.0    -0.7074  2.7074 False 
mercedes-benz    porsche      3.0     0.9798  5.0202  True 
mercedes-benz      saab       2.5     0.7184  4.2816  True 
mercedes-benz     subaru      0.5    -1.0057  2.0057 False 
mercedes-benz     toyota     0.5625  -0.7415  1.8665 False 
mercedes-benz   volkswagen   1.6667   0.1609  3.1724  True 
mercedes-benz     volvo     -1.2727  -2.8056  0.2602 False 
  mitsubishi      nissan    -0.8462  -2.0469  0.3546 False 
  mitsubishi      peugot    -1.8462  -3.1976 -0.4947  True 
  mitsubishi     plymouth   -0.8462  -2.3927  0.7004 False 
  mitsubishi     porsche     1.1538  -0.7324  3.0401 False 
  mitsubishi       saab      0.6538  -0.9743  2.282  False 
  mitsubishi      subaru    -1.3462  -2.6668 -0.0255  True 
  mitsubishi      toyota    -1.2837  -2.3687 -0.1986  True 
  mitsubishi    volkswagen  -0.1795  -1.5001  1.1411 False 
  mitsubishi      volvo     -3.1189  -4.4704 -1.7674  True 
    nissan        peugot      -1.0   -2.2625  0.2625 False 
    nissan       plymouth     0.0    -1.4695  1.4695 False 
    nissan       porsche      2.0     0.1765  3.8235  True 
    nissan         saab       1.5    -0.0551  3.0551 False 
    nissan        subaru      -0.5   -1.7294  0.7294 False 
    nissan        toyota    -0.4375  -1.4095  0.5345 False 
    nissan      volkswagen   0.6667  -0.5628  1.8961 False 
    nissan        volvo     -2.2727  -3.5352 -1.0102  True 
    peugot       plymouth     1.0     -0.595  2.595  False 
    peugot       porsche      3.0     1.0738  4.9262  True 
    peugot         saab       2.5     0.8257  4.1743  True 
    peugot        subaru      0.5     -0.877  1.877  False 
    peugot        toyota     0.5625  -0.5905  1.7155 False 
    peugot      volkswagen   1.6667   0.2896  3.0437  True 
    peugot        volvo     -1.2727  -2.6794  0.1339 False 
   plymouth      porsche      2.0    -0.0677  4.0677 False 
   plymouth        saab       1.5    -0.3354  3.3354 False 
   plymouth       subaru      -0.5    -2.069  1.069  False 
   plymouth       toyota    -0.4375   -1.814  0.939  False 
   plymouth     volkswagen   0.6667  -0.9023  2.2356 False 
   plymouth       volvo     -2.2727  -3.8677 -0.6777  True 
   porsche         saab       -0.5   -2.6294  1.6294 False 
   porsche        subaru      -2.5   -4.4046 -0.5954  True 
   porsche        toyota    -2.4375   -4.187  -0.688  True 
   porsche      volkswagen  -1.3333   -3.238  0.5713 False 
   porsche        volvo     -4.2727  -6.1989 -2.3466  True 
     saab         subaru      -2.0   -3.6495 -0.3505  True 
     saab         toyota    -1.9375  -3.4051 -0.4699  True 
     saab       volkswagen  -0.8333  -2.4828  0.8161 False 
     saab         volvo     -3.7727   -5.447 -2.0985  True 
    subaru        toyota     0.0625  -1.0542  1.1792 False 
    subaru      volkswagen   1.1667  -0.1801  2.5134 False 
    subaru        volvo     -1.7727  -3.1498 -0.3957  True 
    toyota      volkswagen   1.1042  -0.0125  2.2209 False 
    toyota        volvo     -1.8352  -2.9882 -0.6822  True 
  volkswagen      volvo     -2.9394  -4.3164 -1.5623  True 
-----------------------------------------------------------

Looking at the Tukey HSD output above we know that if two groups have a 95% confidence interval that includes 0 we cannot reject the Null hypothesis that the variation in means is due to random variation alone. Our output above shows us which distributions we can reject the Null hypothesis.

Let's look at these relations graphically.

In [85]:
Tukey_HSD.plot_simultaneous()
Out[85]:

Looking at the above graphs we can see that Volvo by far makes the safest cars with all their cars having a below average risk rating. Most of the other cars fall around the same range with Porsche and Saab by far making the riskiest cars.

Bootstrap Means and Difference in Means

Next let's go ahead and bootstrap some of our data and then we will compare them graphically. Let's create the functions that will help us bootsrap our data. Our first functions will plot our data distributions or histograms with their 95% confidence intervals.

In [86]:
def plot_dists(a, b, nbins, a_label='pop_A', b_label='pop_B', depVar = 'variable', p=5):
    # Create a single sequence of bins to be shared across both
    #distribution plots for visualization consistency.
    combined = pd.concat([a, b])
    breaks = np.linspace(
        combined.min(), 
        combined.max(), 
        num=nbins+1)

    plt.subplot(2, 1, 1) 
    plt.hist(a, bins=breaks, alpha = .5) 
    plt.axvline(a.mean(), linewidth = 4) 
    # 95% confidence interval    
    plt.axvline(np.percentile(a, p/2.), color='red', linewidth=3) 
    plt.axvline(np.percentile(a, 100-p/2.), color='red', linewidth=3) 
    plt.title(a_label.capitalize()) 
    plt.xlabel(depVar.capitalize()) 
    
    plt.subplot(2, 1, 2) #plot 2
    plt.hist(b, bins=breaks, alpha = .5) 
    plt.axvline(b.mean(), linewidth = 4) 
    # 95% confidence interval    
    plt.axvline(np.percentile(b, p/2.), color='red', linewidth=3) 
    plt.axvline(np.percentile(b, 100-p/2.), color='red', linewidth=3) 
    plt.title(b_label.capitalize()) 
    plt.xlabel(depVar.capitalize()) 
    
    plt.tight_layout() 
    

def plot_hist(x, a, b, depVar = 'variable', p=5):
    # Plot the distribution and mark the mean
    plt.hist(x, alpha = .5)
    plt.axvline(x.mean(), linewidth = 4)
    # 95% confidence interval    
    plt.axvline(np.percentile(x, p/2.), color='red', linewidth=3)
    plt.axvline(np.percentile(x, 100-p/2.), color='red', linewidth=3)
    plt.title(a.capitalize() + " - " + b.capitalize())
    plt.xlabel(depVar.capitalize() + " Mean Differences")

Now let's create our bootstrap function. This will take a column with two possible choices and bootsrap the mean of a dependent variable for the two choices. Then it will plot the distribution graph of the two choice's means. This will have a default replica value of 1000.

In [87]:
#Bootstrap
def bootStrap(df,column, choice1, choice2, depVar, n_replicas = 1000):
    a = df[df[column] == choice1] 
    b = df[df[column] == choice2] 

    a_bootstrap_means = pd.Series([
        a.sample(frac=1, replace=True)[depVar].mean()
        for i in range(n_replicas)]) 

    b_bootstrap_means = pd.Series([
        b.sample(frac=1, replace=True)[depVar].mean()
        for i in range(n_replicas)]) 

    plot_dists(a = a_bootstrap_means, b = b_bootstrap_means, 
               nbins=80, a_label= choice1, b_label=choice2, depVar = depVar) #plot distributions

Our final bootstrap function will take a column in a data frame, and sample the mean differences of a variable between two choices in that column. It will have a default replica value of 1000.

In [88]:
#Bootstrap Mean Diffs
def bootStrapDiffs(df, column, choice1, choice2, depVar, n_replicas = 1000):
    diffs = [] 
    for i in range(n_replicas):
        sample = df.sample(frac=1.0, replace=True)
        a_sample_mean = sample[sample[column] == choice1][depVar].mean() 
        b_sample_mean = sample[sample[column] == choice2][depVar].mean() 
        diffs.append(a_sample_mean - b_sample_mean)
    diffs = pd.Series(diffs) 

    plot_hist(x = diffs, a = choice1, b = choice2, depVar = depVar) #plot histogram

Now that our functions are ready let's bootstrap the insurance bins in relation to horsepower.

In [89]:
column = 'InsBin' #set column
choice1 = 'Risky' #set column 1st choice
choice2 = 'Not Risky' #set column 2nd choice
depVar = 'horsepower' #set dependent var

bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar) #bootstrap

Looking at the above graph we can see that the horsepower is not affected by insurance risks. This can be seen as the Not risky 95% confidence intervals do overlap the risky confidence intervals and the mean of the Not Risky distribution falls inside the 95% confidence interval of the risky distribution.

Let's bootstrap and plot their difference in means as well.

In [90]:
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar) #bootstrap differences

We can see that in general risky cars have 8 less horsepower than their not risky counterparts.

Now let's look at the horsepower difference between the normal cylinder cars and the extra cylinder cars to see if the horsepower difference is significant.

In [91]:
column = 'CylBin' #set column
choice1 = 'Extra' #set column 1st choice
choice2 = 'Normal' #set column 2nd choice
depVar = 'horsepower' #set dependent var

bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

Looking at the above graph we can see that the Extra cylinder cars have more horsepower. This can be seen as the 95% confidence intervals do not overlap.

Let's bootstrap and plot their difference in means as well.

In [92]:
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

We can see that in general greater than four cylinder cars have on average 64 extra horsepower than cars with four or less cylinders.

Now let's look at the difference in horsepower between gas and diesel cars.

In [93]:
column = 'fuel-type' #set column
choice1 = 'gas' #set column 1st choice
choice2 = 'diesel' #set column 2nd choice
depVar = 'horsepower' #set dependent var

bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

Looking at the above graph we can see that the gas cars in general have more horsepower. This can be seen as the 95% confidence intervals do not overlap.

Let's bootstrap and plot their difference in means as well.

In [94]:
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

We can see that in general the gas cars have on average 20 extra horsepower than diesel cars.

Now let's look at the difference in horsepower between front and rear wheel drive cars.

In [95]:
column = 'drive-wheels' #set column
choice1 = 'fwd' #set column 1st choice
choice2 = 'rwd' #set column 2nd choice
depVar = 'horsepower' #set dependent var

bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

Looking at the above graph we can see that the rear wheel drive cars have more horsepower in general. This can be seen as the 95% confidence intervals do not overlap, in fact there is a considerable gap between the two distributions.

Let's bootstrap and plot their difference in means.

In [96]:
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

We can see that in general the front wheel drive cars have on average 47 less horsepower than rear wheel drive cars.

Now let's look at the difference in city mpg between front and rear wheel drive cars.

In [97]:
column = 'drive-wheels' #set column
choice1 = 'fwd' #set column 1st choice
choice2 = 'rwd' #set column 2nd choice
depVar = 'city-mpg' #set dependent var

bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

Looking at the above graph we can see that the front wheel drive cars have better city gas mileage. This can be seen as the 95% confidence intervals above do not overlap.

Let's bootstrap and plot their difference in means.

In [98]:
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

We can see that in general the front wheel drive cars have better gas mileage than rear wheel drive cars to the tune of 7 extra miles per gallon.

Now let's look at the difference in city mpg between gas and diesel cars.

In [99]:
column = 'fuel-type' #set column
choice1 = 'gas' #set column 1st choice
choice2 = 'diesel' #set column 2nd choice
depVar = 'city-mpg' #set dependent var

bootStrap(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

Looking at the above graph we can see that the diesel cars in general have better gas mileage. This can be seen as the 95% confidence intervals do not overlap.

Let's bootstrap and plot their difference in means.

In [100]:
bootStrapDiffs(df = df, column = column, choice1 = choice1, choice2 = choice2, depVar = depVar)

We can see that in general the diesel cars have better gas mileage than gas cars to the tune of 5 extra miles per gallon.

Now let's move on and create some bayesian models and plot them with their credible intervals.

Bayesian Models

Let's create some functions that will help us along with creating our bayesian models.

Our first function will compute and return our normalized likelihood

In [101]:
def comp_like(p, x): 
    variance = np.std(x)**2 
    x_mean = np.asarray(x).mean()  
    print('Mean = %.3f, Standard deviation = %.3f' % (x_mean, np.std(x))) 
    n = len(x)
    l = np.exp(-n * np.square(x_mean - p) / (2 * variance))
    return l / l.sum() 

Our next function will compute our posterior and return a normalized posterior distribution. This will be done by multiplying the prior with the likelihood then normalizing.

In [102]:
def posterior(prior, like):
    post = prior * like 
    return post / sum(post) 

Our next function will plot the posterior distribution along with the credible interval specified.

In [103]:
def plot_ci(p, post, num_samples, lower_q, upper_q, xLab = 'Parameter value'):
    ## Compute a large sample by resampling with replacement
    samples = np.random.choice(p, size=num_samples, replace=True, p=post)
    ci = scipy.percentile(samples, [lower_q*100, upper_q*100]) 
    
    interval = upper_q - lower_q 
    plt.title('Posterior density with %.3f credible interval' % interval) 
    plt.plot(p, post, color='blue') 
    plt.xlabel(xLab) 
    plt.ylabel('Density')
    plt.axvline(x=ci[0], color='red')
    plt.axvline(x=ci[1], color='red') 
    print('The %.3f credible interval is %.3f to %.3f'  
          % (interval, ci[0], ci[1]))  

Now that our functions are ready let's create our first bayesian model.

Let's calculate the posterior distributions of Price for Risky and Not Risky bins. We will calculate this given 50 samples.

In [104]:
num_samples = 50
risky = df[df['InsBin'] == 'Risky'].sample(n=num_samples) 
nRisky = df[df['InsBin'] == 'Not Risky'].sample(n=num_samples) 
In [105]:
#Set x range
N = 1000
p = np.linspace(8000, 20000, num=N)
In [106]:
#create posterior of risky
pp = ss.norm.pdf(p, loc= risky['price'].mean(), scale=1000) 
pp = pp / pp.sum()
like_risky = comp_like(p, risky['price']) 
post_risky = posterior(pp, like_risky) 
Mean = 10912.340, Standard deviation = 6657.256
In [107]:
#Create posterior of not risky
pp = ss.norm.pdf(p, loc= nRisky['price'].mean(), scale=1000) 
pp = pp / pp.sum()
like_nRisky = comp_like(p, nRisky['price']) 
post_nRisky = posterior(pp, like_nRisky) 
Mean = 16223.820, Standard deviation = 8488.007
In [108]:
#Plot posteriors with 95% credible intervals
plt.subplot(2,1,1)
plot_ci(p, post_risky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Risky Price')
plt.subplot(2,1,2)
plot_ci(p, post_nRisky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Not Risky Price')
plt.tight_layout()
The 0.950 credible interval is 9636.336 to 12531.832
The 0.950 credible interval is 14867.568 to 17610.210

Looking above we can see the 95% credible intervals of price for our Risky and Not Risky vehicles do not overlap. We can assume Not Risky vehicles are drawn from a different distribution than Risky vehicles and are generally more expensive.

Now let's create bayesian models of price distributions for Normal and Extra cylinders.

In [109]:
num_samples = 25
normalCyl = df[df['CylBin'] == 'Normal'].sample(n=num_samples)
extraCyl = df[df['CylBin'] == 'Extra'].sample(n=num_samples)
In [110]:
#Set x range
N = 1000
p = np.linspace(6000, 32000, num=N)
In [111]:
pp = ss.norm.pdf(p, loc= normalCyl['price'].mean(), scale=1000) 
pp = pp / pp.sum() 
like_normal = comp_like(p, normalCyl['price']) 
post_normal = posterior(pp, like_normal) 
Mean = 9957.720, Standard deviation = 4420.081
In [112]:
pp = ss.norm.pdf(p, loc= extraCyl['price'].mean(), scale=1000) 
pp = pp / pp.sum() 
like_extra = comp_like(p, extraCyl['price']) 
post_extra = posterior(pp, like_extra) 
Mean = 25601.640, Standard deviation = 8128.966
In [113]:
plt.subplot(2,1,1)
plot_ci(p, post_normal, num_samples, lower_q=.025, upper_q=.975, xLab = 'Normal Cylinder Price')
plt.subplot(2,1,2)
plot_ci(p, post_extra, num_samples, lower_q=.025, upper_q=.975, xLab = 'Extra Cylinder Price')
plt.tight_layout()
The 0.950 credible interval is 8951.351 to 10809.610
The 0.950 credible interval is 24082.883 to 26763.564

We can see from above that the 95% credible intervals of Normal and Extra cylinders do not overlap and are not even close to each other. It is safe to assume that Extra Cylinder cars are in general more expensive than Normal cylinder (4 cylinder or less) cars.

Now let's create bayesian models of risk factor distributions for Normal and Extra cylinders.

In [114]:
num_samples = 25
normalCyl = df[df['CylBin'] == 'Normal'].sample(n=num_samples) 
extraCyl = df[df['CylBin'] == 'Extra'].sample(n=num_samples)
In [115]:
#set x range
N = 100
p = np.linspace(-2, 3, num=N)
In [116]:
pp = ss.norm.pdf(p, loc= normalCyl['symboling'].mean(), scale=1000) 
pp = pp / pp.sum() 
like_normal = comp_like(p, normalCyl['symboling'])
post_normal = posterior(pp, like_normal) 
Mean = 0.760, Standard deviation = 1.141
In [117]:
pp = ss.norm.pdf(p, loc= extraCyl['symboling'].mean(), scale=1000) 
pp = pp / pp.sum() 
like_extra = comp_like(p, extraCyl['symboling']) 
post_extra = posterior(pp, like_extra) 
Mean = 0.840, Standard deviation = 1.488
In [118]:
plt.subplot(2,1,1)
plot_ci(p, post_normal, num_samples, lower_q=.025, upper_q=.975, xLab = 'Normal Cylinder Insurance Rating') 
plt.subplot(2,1,2) 
plot_ci(p, post_extra, num_samples, lower_q=.025, upper_q=.975, xLab = 'Extra Cylinder Insurance Rating')
plt.tight_layout() 
The 0.950 credible interval is 0.414 to 1.313
The 0.950 credible interval is 0.131 to 1.303

Looking at the above graph we can see that the normal and extra cylinder cars were generally considered to be just as risky as each other as far as insurance standards. The 95% credible intervals overlap significantly showing that both distributions were drawn from similair populations.

Now let's create bayesian models of normalized loss distributions for Risky and Not Risky vehicles.

In [119]:
num_samples = 50
risky = dfNorm[dfNorm['InsBin'] == 'Risky'].sample(n=num_samples)
nRisky = dfNorm[dfNorm['InsBin'] == 'Not Risky'].sample(n=num_samples)
In [120]:
#set x range
N = 1000
p = np.linspace(70, 180, num=N)
In [121]:
pp = ss.norm.pdf(p, loc= risky['normalized-losses'].mean(), scale=1000)
pp = pp / pp.sum() 
like_risky = comp_like(p, risky['normalized-losses']) 
post_risky = posterior(pp, like_risky)
Mean = 133.480, Standard deviation = 29.818
In [122]:
pp = ss.norm.pdf(p, loc= nRisky['normalized-losses'].mean(), scale=1000)
pp = pp / pp.sum()
like_nRisky = comp_like(p, nRisky['normalized-losses']) 
post_nRisky = posterior(pp, like_nRisky)
Mean = 108.540, Standard deviation = 33.697
In [123]:
plt.subplot(2,1,1) 
plot_ci(p, post_risky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Risky Normalized Losses')
plt.subplot(2,1,2) 
plot_ci(p, post_nRisky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Not Risky Normalized Losses')
plt.tight_layout() 
The 0.950 credible interval is 125.281 to 141.472
The 0.950 credible interval is 99.804 to 115.707

Looking at the above graph we can see that the risky cars cost slightly more in normalized losses for the insurance companies. The 95% credible intervals do not overlap between the two distributions showing significance in their differences.

Now let's create bayesian models of city mpg distributions for Risky and Not Risky vehicles.

In [124]:
num_samples = 50
risky = df[df['InsBin'] == 'Risky'].sample(n=num_samples)
nRisky = df[df['InsBin'] == 'Not Risky'].sample(n=num_samples)
In [125]:
#set x range
N = 1000
p = np.linspace(18, 33, num=N)
In [126]:
pp = ss.norm.pdf(p, loc= risky['city-mpg'].mean(), scale=1000) 
pp = pp / pp.sum() 
like_risky = comp_like(p, risky['city-mpg'])
post_risky = posterior(pp, like_risky) 
Mean = 26.280, Standard deviation = 6.533
In [127]:
pp = ss.norm.pdf(p, loc= nRisky['city-mpg'].mean(), scale=1000) 
pp = pp / pp.sum()
like_nRisky = comp_like(p, nRisky['city-mpg'])
post_nRisky = posterior(pp, like_nRisky) 
Mean = 23.940, Standard deviation = 5.453
In [128]:
plt.subplot(2,1,1) 
plot_ci(p, post_risky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Risky City MPG')
plt.subplot(2,1,2)
plot_ci(p, post_nRisky, num_samples, lower_q=.025, upper_q=.975, xLab = 'Not Risky City MPG')
plt.tight_layout()
The 0.950 credible interval is 24.787 to 28.388
The 0.950 credible interval is 22.482 to 25.322

Looking at the above graph we can see that the Risky and Not Risky cars generally have similiar city MPG. The 95% credible intervals overlap significantly showing that both distributions were drawn from similair populations.

Finally let's create bayesian models of city mpg distributions for Normal and Extra cylinder vehicles.

In [129]:
num_samples = 25
normalCyl = df[df['CylBin'] == 'Normal'].sample(n=num_samples)
extraCyl = df[df['CylBin'] == 'Extra'].sample(n=num_samples)
In [130]:
#set x range
N = 1000
p = np.linspace(15, 33, num=N)
In [131]:
pp = ss.norm.pdf(p, loc= normalCyl['city-mpg'].mean(), scale=1000) 
pp = pp / pp.sum() # normalize
like_normal = comp_like(p, normalCyl['city-mpg']) 
post_normal = posterior(pp, like_normal) 
Mean = 28.360, Standard deviation = 4.068
In [132]:
pp = ss.norm.pdf(p, loc= extraCyl['city-mpg'].mean(), scale=1000) 
pp = pp / pp.sum() 
like_extra = comp_like(p, extraCyl['city-mpg']) 
post_extra = posterior(pp, like_extra) 
Mean = 18.440, Standard deviation = 2.772
In [133]:
plt.subplot(2,1,1) 
plot_ci(p, post_normal, num_samples, lower_q=.025, upper_q=.975, xLab = 'Normal Cylinder City MPG')
plt.subplot(2,1,2) 
plot_ci(p, post_extra, num_samples, lower_q=.025, upper_q=.975, xLab = 'Extra Cylinder City MPG')
plt.tight_layout()
The 0.950 credible interval is 26.723 to 29.742
The 0.950 credible interval is 17.548 to 19.180

We can see from above that the 95% credible intervals of Normal and Extra cylinders do not overlap and are not even close to each other. It is safe to assume that Normal Cylinder cars in general have better City MPG than extra cylinder cars.

Conclusion

We've had quite the journey. We started out by comparing our price and log price distributions to normal distributions to see that the log price was slightly more normally distributed. Then we stratified and compared fuel types, aspiration and rear vs front wheel drive to log price, Insurance Bins and Cylinder bins to curb weight, Cylinder bins to risk symboling and Cylinder bins to Normalized losses. We did this both graphically and with the t test.

We then performed Tukey HSD and ANOVA tests on some of our categorical columns. We found many different interesting relationships there and learned which relationships had similiar price distributions and which probably came from different distributions. We did this by looking at the F-statistic, the p values and the 95% confidence intervals. We also did this by looking at the Tukey HSD plot and by looking at histogram comparasions with means and 95% confidence intervals plotted.

We then bootstrapped the means and mean differences for Insurance Bins, Cylinder Bins, Fuel type and Drive wheels to horsepower and fuel type, and drive wheels to city mpg and plotted them.

Finally we created and analyzed Bayesian price models for Insurance bins and Cylinder bins. We created and analyzed bayesian city mpg models for Insurance bins and cylinder bins. And finally we also created bayesian models of normalized losses for Insurance bins and Risk symboling for cylinder bins. We analyzed all the different posterior density's of the above relationships with their credible intervals and determined relationship significance.