Hepatitis Data Model- Predicting a Patient's Chance of Surviving Hepatitis¶

Today we will be looking at a dataset provided to us by the University of California, Irvine. This data set has information on 155 patients with hepatitis from various ages and with various health levels. The dataset provides us 19 columns of health-related information about each person with hepatitis. The dataset also provides us a column showing whether each patient is alive or dead at the time the dataset was collected. We will be using this dataset, looking at the various health statistics and the person's survival, to attempt to predict how patients with similar health attributes may similarly survive hepatitis.

First let's import pandas, numpy and matplotlib. Let's also import our classifiers from sklearn.

import pandas as pd
import numpy as np
from sklearn.metrics import *
import matplotlib.pyplot as plt 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.ensemble import RandomForestClassifier

Next we will save the URL of the data file we will be using under a URL variable and we will use Pandas to import this data set and save it as data frame.

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data'

df = pd.read_csv(url, header = None)

Next let's assign our column names to make the data easier to use and understand.

df_names = [
'Status', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia',
'Big Liver', 'Firm Liver', 'Spleen Palpable', 'Spiders', 'Ascites', 'Varices', 
'BILIRUBIN', 'Alk Phosphate', 'SGOT', 'Albumin', 'Protime', 'Histology'
          ]

df.columns = df_names

Let's take a look at our data frame so far.

print("\n", df.head())

   Status  Age  Sex Steroid  Antivirals Fatigue Malaise Anorexia Big Liver  \
0       2   30    2       1           2       2       2        2         1   
1       2   50    1       1           2       1       2        2         1   
2       2   78    1       2           2       1       2        2         2   
3       2   31    1       ?           1       2       2        2         2   
4       2   34    1       2           2       2       2        2         2   

  Firm Liver Spleen Palpable Spiders Ascites Varices BILIRUBIN Alk Phosphate  \
0          2               2       2       2       2      1.00            85   
1          2               2       2       2       2      0.90           135   
2          2               2       2       2       2      0.70            96   
3          2               2       2       2       2      0.70            46   
4          2               2       2       2       2      1.00             ?   

  SGOT Albumin Protime  Histology  
0   18     4.0       ?          1  
1   42     3.5       ?          1  
2   32     4.0       ?          1  
3   52     4.0      80          1  
4  200     4.0       ?          1

Now let's check our data types.

print("\n", df.dtypes)

Status              int64
Age                 int64
Sex                 int64
Steroid            object
Antivirals          int64
Fatigue            object
Malaise            object
Anorexia           object
Big Liver          object
Firm Liver         object
Spleen Palpable    object
Spiders            object
Ascites            object
Varices            object
BILIRUBIN          object
Alk Phosphate      object
SGOT               object
Albumin            object
Protime            object
Histology           int64
dtype: object

Let's begin by dealing with our categorical data. We will decode all variables and then we will deal with all abberant categorical data. We will also look at some Bar charts to see what categorical data we are dealing with.

First let's look at status. This tells us whether a person is alive or dead.

df.loc[:,"Status"].unique()
df.loc[df.loc[:,"Status"] == 1, "Status"] = "Dead"
df.loc[df.loc[:,"Status"] == 2, "Status"] = "Alive"
df.loc[:,"Status"].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x232185d74e0>

Now let's look at the sex column. This tells us whether a person is male or female.

i = "Sex"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == 1, i] = "M"
df.loc[df.loc[:,i] == 2, i] = "F"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218748ef0>

Our next column will tell us whether a person has used steroids or not.

i = "Steroid"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x232187fa908>

Now let's see if most people in our data set were using antivirals or not.

i = "Antivirals"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == 1, i] = "No"
df.loc[df.loc[:,i] == 2, i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x2321888bd68>

Our next column will give us information about whether a person is fatigued or not. Let's take a look at what our data set shows.

i = "Fatigue"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "No"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x232188ac828>

Let's keep going with our next column, the malaise column. This shows us if a person was generally uncomfortable.

i = "Malaise"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x232188484a8>

Our next column tells us whether the patient was also suffering from anorexia or not.

i = "Anorexia"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218b66898>

Next let's see if our patients were exhibiting signs of hepatitis such as having a big Liver.

i = "Big Liver"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218bd7630>

Our next column shows us if the patients were exhibiting signs of hepatitis such as having a firm Liver.

i = "Firm Liver"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218c312e8>

Let's keep going and look at our Spleen palpable column. This will show us the distribution of the palpable spleens in our data set.

i = "Spleen Palpable"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218cbcc88>

Our next column shows us whether our patients were exhibiting signs of spiders.

i = "Spiders"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218cd1128>

We only have a few more categorical variables so let's keep going. Our next column shows us if the person showed signs of Ascites.

i = "Ascites"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218cd1fd0>

Our next column shows us if the person showed signs of Varices.

i = "Varices"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218de1f98>

Finally let's take a look at our final column. This column measures whether the patient has a histology or not.

i = "Histology"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == 1, i] = "No"
df.loc[df.loc[:,i] == 2, i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")

<matplotlib.axes._subplots.AxesSubplot at 0x23218e10dd8>

All of our Cateogrical data has been decoded. And we have gotten some good information from our graphs. We were able to vaguely see how our patients with hepatitis spread out in our various categories. Now let's impute all missing data in our numeric columns with the median.

for x in range(14,19):
     i = df_names[x]
     df.loc[:,i] = pd.to_numeric(df.loc[:,i], errors='coerce')
     HasNan = np.isnan(df.loc[:,i]) 
     df.loc[HasNan,i] = np.nanmedian(df.loc[:,i])

Now let's get rid of outliers for our Numerical columns Let's make a function to help with the repetition We will assume that any numerical value 2 standard deviations from the mean is an outlier. The mean and standard deviation used to calculate outliers are calculated with numerical data, where all missing numerical data was imputed with the median

def replaceOutliers(df, col, r):
    LimitHi = np.mean((df.loc[:,col])) + 2*np.std((df.loc[:,col]))
    LimitLo = np.mean((df.loc[:,col])) - 2*np.std((df.loc[:,col]))
    FlagBad = (df.loc[:,col] > LimitHi) | (df.loc[:,col] < LimitLo)
    repMean = round((np.mean(df.loc[:,col][~FlagBad])),r)
    df.loc[FlagBad,col] = repMean

replaceOutliers(df, "BILIRUBIN", 1)
replaceOutliers(df, "Alk Phosphate", 0)
replaceOutliers(df, "SGOT", 0)
replaceOutliers(df, "Albumin", 1)
replaceOutliers(df, "Protime", 0)

Now that we've dealt with all abberant data for our categorical and numerical variables. Let's z-normalize our numerical variables. Let's create a function to assist us.

def normZ(df):
     R, c = df.shape
     normed = np.zeros([R,c])
     
     for i in range(c):
          oMean = np.mean(df.iloc[:,i])
          oSD = np.std(df.iloc[:,i])
          normed[:,i] = (df.iloc[:,i] - oMean)/oSD
     
     return normed

df.iloc[:,14:19] = normZ(df.iloc[:,14:19])

Next let's Bin our Age variable. We will use split the ages up into 10 year ranges.

NB = 7
freq, bounds = np.histogram(df.loc[:,"Age"], NB)

def bin2(df, col, b): # x = data array, b = boundaries array
    nb = len(b)
    N = len(df.loc[:,col])
    y = np.empty(N, int) # empty integer array to store the bin numbers (output)
    
    for i in range(1, nb): # repeat for each pair of bin boundaries
        y[(df.loc[:,col] >= b[i-1])&(df.loc[:,col] < b[i])] = i
    
    y[df.loc[:,col] == b[-1]] = nb - 1 # ensure that the borderline cases are also binned appropriately
    return y

df.loc[:,"Age"] = bin2(df, "Age", bounds)

Now let's turn back to our categorical data. All our categorical variables are Binary. Let's go ahead and one hot encode our categorical variables and then drop all obsolete columns.

df.loc[:, "isAlive"] = (df.loc[:, "Status"] == "Alive").astype(int)
df.loc[:, "isMale"] = (df.loc[:, "Sex"] == "M").astype(int)
df.loc[:, "isSteroidUser"] = (df.loc[:, "Steroid"] == "Yes").astype(int)
df.loc[:, "usesAntivirals"] = (df.loc[:, "Antivirals"] == "Yes").astype(int)
df.loc[:, "isFatigued"] = (df.loc[:, "Fatigue"] == "Yes").astype(int)
df.loc[:, "isMalaised"] = (df.loc[:, "Malaise"] == "Yes").astype(int)
df.loc[:, "isAnorexic"] = (df.loc[:, "Anorexia"] == "Yes").astype(int)
df.loc[:, "hasBigLiver"] = (df.loc[:, "Big Liver"] == "Yes").astype(int)
df.loc[:, "hasFirmLiver"] = (df.loc[:, "Firm Liver"] == "Yes").astype(int)
df.loc[:, "hasPalpableSpleen"] = (df.loc[:, "Spleen Palpable"] == "Yes").astype(int)
df.loc[:, "hasSpiders"] = (df.loc[:, "Spiders"] == "Yes").astype(int)
df.loc[:, "hasAscites"] = (df.loc[:, "Ascites"] == "Yes").astype(int)
df.loc[:, "hasVarices"] = (df.loc[:, "Varices"] == "Yes").astype(int)
df.loc[:, "hasHistology"] = (df.loc[:, "Histology"] == "Yes").astype(int)

df = df.drop("Status", axis=1)
df = df.drop("Sex", axis=1)
df = df.drop("Steroid", axis=1)
df = df.drop("Antivirals", axis=1)
df = df.drop("Fatigue", axis=1)
df = df.drop("Malaise", axis=1)
df = df.drop("Anorexia", axis=1)
df = df.drop("Big Liver", axis=1)
df = df.drop("Firm Liver", axis=1)
df = df.drop("Spleen Palpable", axis=1)
df = df.drop("Spiders", axis=1)
df = df.drop("Ascites", axis=1)
df = df.drop("Varices", axis=1)
df = df.drop("Histology", axis=1)

Now let's look at all our final data types

print("\n", df.dtypes)

Age                    int32
BILIRUBIN            float64
Alk Phosphate        float64
SGOT                 float64
Albumin              float64
Protime              float64
isAlive                int32
isMale                 int32
isSteroidUser          int32
usesAntivirals         int32
isFatigued             int32
isMalaised             int32
isAnorexic             int32
hasBigLiver            int32
hasFirmLiver           int32
hasPalpableSpleen      int32
hasSpiders             int32
hasAscites             int32
hasVarices             int32
hasHistology           int32
dtype: object

Now let's print the head of our final result

print("\n", df.head())

   Age  BILIRUBIN  Alk Phosphate      SGOT   Albumin   Protime  isAlive  \
0    3  -0.269847      -0.283355 -1.005102  0.254056  0.132291        1   
1    5  -0.439184       1.188497 -0.561569 -0.787707  0.132291        1   
2    7  -0.777858       0.040452 -0.746374  0.254056  0.132291        1   
3    3  -0.777858      -1.431400 -0.376764  0.254056  1.735573        1   
4    3  -0.269847      -0.283355  2.358352  0.254056  0.132291        1   

   isMale  isSteroidUser  usesAntivirals  isFatigued  isMalaised  isAnorexic  \
0       0              0               1           1           1           1   
1       1              0               1           0           1           1   
2       1              1               1           0           1           1   
3       1              1               0           1           1           1   
4       1              1               1           1           1           1   

   hasBigLiver  hasFirmLiver  hasPalpableSpleen  hasSpiders  hasAscites  \
0            0             1                  1           1           1   
1            0             1                  1           1           1   
2            1             1                  1           1           1   
3            1             1                  1           1           1   
4            1             1                  1           1           1   

   hasVarices  hasHistology  
0           1             0  
1           1             0  
2           1             0  
3           1             0  
4           1             0

Now that our data is loaded, all abberant data has been dealt with and our categorical variables have been binned, let's move on to splitting our data. Using a few classifiers we will first train them and then use them to predict our patients' survival. We will then use some analysis measures to determine our best classifier to apply.

def split_dataset(df, colName, r): 
     N = len(df)	
     trainFeat = []
     trainTarg = []
	
     if r >= 1: 
          print ("Parameter r needs to be smaller than 1!")
          return
     elif r <= 0:
          print ("Parameter r needs to be larger than 0!")
          return

     n = int(round(N*r))
     ind = -np.ones(n,int)
     R = np.random.randint(N) 
	
     for i in range(n):
          while R in ind: R = np.random.randint(N) 
          ind[i] = R

     ind_ = list(set(range(N)).difference(ind)) 
     ind = np.sort(ind) 
     ind = list(ind) 
     trainFeat = df.iloc[ind_,:] 
     trainFeat = trainFeat.drop(colName, axis=1) 
     testFeat = df.iloc[ind,:] 
     testFeat = testFeat.drop(colName, axis=1) 
     trainTarg = df.iloc[ind_,df.columns.get_loc(colName)] 
     testTarg = df.iloc[ind,df.columns.get_loc(colName)] 
     return trainFeat, testFeat, trainTarg, testTarg

r = 0.2
trainFeat, testFeat, trainTarg, testTarg = split_dataset(df, "isAlive", r)

Now let's pick our predictive models. We will train on the Train set and Test the predictive models on the Test set.

Let's start with the Logistic Regression Classifier

print ('\n\n\nLogistic regression classifier\n')
C_parameter = 50. / len(trainFeat) 
class_parameter = 'ovr' 
penalty_parameter = 'l1' 
solver_parameter = 'saga' 
tolerance_parameter = 0.1
clfLR = LogisticRegression(C=C_parameter, multi_class=class_parameter, penalty=penalty_parameter, solver=solver_parameter, tol=tolerance_parameter)
clfLR.fit(trainFeat, trainTarg)
print ('coefficients:')
print (clfLR.coef_) 
print ('intercept:')
print (clfLR.intercept_) 
predictedLR = clfLR.predict(testFeat)
predProbsLR = clfLR.predict_proba(testFeat)[:,1]
print ("predictions for test set:")
print (predictedLR)
print ('actual class values:')
print (testTarg.values.reshape(1,-1))



Logistic regression classifier

coefficients:
[[-9.77911334e-02 -2.20811831e-01  1.38753015e-04 -6.92870930e-02
   7.38637739e-02  3.22113371e-01 -2.07524315e-02  1.94438037e-01
   0.00000000e+00  1.02104203e-01  4.44090733e-01  0.00000000e+00
  -6.18046781e-05  0.00000000e+00  2.98777279e-01  7.29763693e-01
   8.10972262e-01  0.00000000e+00 -1.56822486e-01]]
intercept:
[0.27605145]
predictions for test set:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0]
actual class values:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0]]

Now that we have trained and predicted using our first classifier. Let's take a look at our confusion matrix.

CMLR = confusion_matrix(testTarg, predictedLR, labels=[1,0])
print ("\n\nConfusion matrix:\n", CMLR)
tpLR, fnLR, fpLR, tnLR = CMLR.ravel()
print ("TP:", tpLR,  ", FP:", fpLR,", FN:,", fnLR, ", TN:", tnLR)


Confusion matrix:
 [[25  0]
 [ 3  3]]
TP: 25 , FP: 3 , FN:, 0 , TN: 3

And now let's take a look at the accuracy measures for Logistic Regression classifier.

ARLR = accuracy_score(testTarg,predictedLR)
print ("\nAccuracy rate:", ARLR)
ERLR = (fpLR + fnLR)/len(testTarg)
print ("\nError rate:", ERLR)
PLR = tpLR/(tpLR + fpLR)
print ("\nPrecision:", np.round(PLR, 2))
RLR = tpLR/(tpLR+fnLR)
print ("\nRecall:", np.round(RLR, 2))
F1LR = 2*tpLR/(2*tpLR + fpLR + fnLR)
print ("\nF1 score:", np.round(F1LR, 2))

Accuracy rate: 0.9032258064516129

Error rate: 0.0967741935483871

Precision: 0.89

Recall: 1.0

F1 score: 0.94

Let's keep going with our analysis of our model. Let's take a look at the FPR, TPR, ROC-Curve and the AUC.

fprLR, tprLR, thLR = roc_curve(testTarg, predProbsLR)
AUCLR = auc(fprLR, tprLR)
print ("\nTP rates:", np.round(tprLR, 2))
print ("\nFP rates:", np.round(fprLR, 2))
print ("\nProbability thresholds:", np.round(thLR, 2))

TP rates: [0.04 0.88 0.88 0.96 0.96 1.   1.  ]

FP rates: [0.   0.   0.17 0.17 0.5  0.5  1.  ]

Probability thresholds: [0.95 0.83 0.82 0.62 0.56 0.53 0.41]

LW = 1.5
LL = "lower right" 
LC = 'darkgreen' 
plt.figure()
plt.title('ROC curve Logistic Regression Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FALSE Positive Rate')
plt.ylabel('TRUE Positive Rate')
plt.plot(fprLR, tprLR, color=LC,lw=LW, label='ROC curve (area = %0.2f)' % AUCLR)
plt.plot([0, 1], [0, 1], color='navy', lw=LW, linestyle='--')
plt.legend(loc=LL)
plt.show()

print ("\nAUC score:", np.round(AUCLR, 2))

AUC score: 0.97

Now let's look at the K-Nearest Neighbors classifier

print ('\n\nK nearest neighbors classifier\n')
k = 5 
distance_metric = 'manhattan'
knn = KNeighborsClassifier(n_neighbors=k, metric=distance_metric)
knn.fit(trainFeat, trainTarg)
predictedKNN = knn.predict(testFeat)
predProbsKNN = knn.predict_proba(testFeat)[:,1]
print ("predictions for test set:")
print (predictedKNN)
print ('actual class values:')
print (testTarg.values.reshape(1,-1))


K nearest neighbors classifier

predictions for test set:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 0]
actual class values:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0]]

Now that we have trained and predicted using our second classifier. Let's take a look at our confusion matrix.

CMknn = confusion_matrix(testTarg, predictedKNN, labels=[1,0])
print ("\n\nConfusion matrix:\n", CMknn)
tpKNN, fnKNN, fpKNN, tnKNN = CMknn.ravel()
print ("TP:", tpKNN,  ", FP:", fpKNN,", FN:,", fnKNN, ", TN:", tnKNN)


Confusion matrix:
 [[25  0]
 [ 1  5]]
TP: 25 , FP: 1 , FN:, 0 , TN: 5

And now let's take a look at the accuracy measures for K-Nearest Neighbors classifier.

ARknn = accuracy_score(testTarg,predictedKNN)
print ("\nAccuracy rate:", ARknn)
ERknn = (fpKNN + fnKNN)/len(testTarg)
print ("\nError rate:", ERknn)
Pknn = tpKNN/(tpKNN + fpKNN)
print ("\nPrecision:", np.round(Pknn, 2))
Rknn = tpKNN/(tpKNN+fnKNN)
print ("\nRecall:", np.round(Rknn, 2))
F1knn = 2*tpKNN/(2*tpKNN + fpKNN + fnKNN)
print ("\nF1 score:", np.round(F1knn, 2))

Accuracy rate: 0.967741935483871

Error rate: 0.03225806451612903

Precision: 0.96

Recall: 1.0

F1 score: 0.98

Let's keep going with our analysis of our model. Let's take a look at the FPR, TPR, ROC-Curve and the AUC.

fprKNN, tprKNN, thKNN = roc_curve(testTarg, predProbsKNN) 
AUCknn = auc(fprKNN, tprKNN)
print ("\nTP rates:", np.round(tprKNN, 2))
print ("\nFP rates:", np.round(fprKNN, 2))
print ("\nProbability thresholds:", np.round(thKNN, 2))

TP rates: [0.   0.72 0.84 1.   1.   1.  ]

FP rates: [0.   0.17 0.17 0.17 0.5  1.  ]

Probability thresholds: [2.  1.  0.8 0.6 0.4 0.2]

LW = 1.5 
LL = "lower right" 
LC = 'darkgreen' 
plt.figure()
plt.title('ROC curve KNN')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FALSE Positive Rate')
plt.ylabel('TRUE Positive Rate')
plt.plot(fprKNN, tprKNN, color=LC,lw=LW, label='ROC curve (area = %0.2f)' % AUCknn)
plt.plot([0, 1], [0, 1], color='navy', lw=LW, linestyle='--')
plt.legend(loc=LL)
plt.show()

print ("\nAUC score:", np.round(AUCknn, 2))

AUC score: 0.89

Finally let's look at how a Random Forest Classifier would do predicting the survival of our hepatitis patients.

estimators = 100
mss = 2 
print ('\n\nRandom Forest classifier\n')
clfFST = RandomForestClassifier(n_estimators=estimators, min_samples_split=mss) # default parameters are fine
clfFST.fit(trainFeat, trainTarg)
predictedFST = clfFST.predict(testFeat)
predProbsFST = clfFST.predict_proba(testFeat)[:,1]
print ("predictions for test set:")
print (predictedFST)
print ('actual class values:')
print (testTarg.values.reshape(1,-1))


Random Forest classifier

predictions for test set:
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 0 1 0 1 0 1 0]
actual class values:
[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 1 0 1 0]]

Now that we have trained and predicted using our final classifier. Let's take a look at our confusion matrix.

CMfst = confusion_matrix(testTarg, predictedFST, labels=[1,0])
print ("\n\nConfusion matrix:\n", CMfst)
tpFST, fnFST, fpFST, tnFST = CMfst.ravel()
print ("TP:", tpFST, ", FP:", fpFST, ", FN:,", fnFST, ", TN:", tnFST)


Confusion matrix:
 [[24  1]
 [ 1  5]]
TP: 24 , FP: 1 , FN:, 1 , TN: 5

Now let's take a look at the accuracy measures for random forest classifier.

ARfst = accuracy_score(testTarg,predictedFST)
print ("\nAccuracy rate:", ARfst)
ERfst = (fpFST + fnFST)/len(testTarg)
print ("\nError rate:", ERfst)
Pfst = tpFST/(tpFST + fpFST)
print ("\nPrecision:", np.round(Pfst, 2))
Rfst = tpFST/(tpFST+fnFST)
print ("\nRecall:", np.round(Rfst, 2))
F1fst = 2*tpFST/(2*tpFST + fpFST + fnFST)
print ("\nF1 score:", np.round(F1fst, 2))

Accuracy rate: 0.9354838709677419

Error rate: 0.06451612903225806

Precision: 0.96

Recall: 0.96

F1 score: 0.96

Let's keep going with our analysis of our model. Let's take a look at the FPR, TPR, ROC-Curve and the AUC.

fprFST, tprFST, thFST = roc_curve(testTarg, predProbsFST) # False Positive Rate, True Posisive Rate, probability thresholds
AUCfst = auc(fprFST, tprFST)
print ("\nTP rates:", np.round(tprFST, 2))
print ("\nFP rates:", np.round(fprFST, 2))
print ("\nProbability thresholds:", np.round(thFST, 2))

TP rates: [0.12 0.16 0.32 0.36 0.44 0.48 0.56 0.76 0.76 0.96 0.96 1.   1.  ]

FP rates: [0.   0.   0.   0.   0.   0.   0.   0.   0.17 0.17 0.67 0.83 1.  ]

Probability thresholds: [1.   0.99 0.98 0.97 0.96 0.95 0.93 0.82 0.79 0.55 0.39 0.37 0.35]

LW = 1.5
LL = "lower right"
LC = 'darkgreen'
plt.figure()
plt.title('ROC curve Random Forest')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FALSE Positive Rate')
plt.ylabel('TRUE Positive Rate')
plt.plot(fprFST, tprFST, color=LC,lw=LW, label='ROC curve (area = %0.2f)' % AUCfst)
plt.plot([0, 1], [0, 1], color='navy', lw=LW, linestyle='--') 
plt.legend(loc=LL)
plt.show()

print ("\nAUC score:", np.round(AUCfst, 2))

AUC score: 0.94

This model was built to predict a person's chance of surviving hepatitis given a person's various health statistics. The three classifiers compared were the Random Forest Classifier, the KNN classifier and the Logistic Regression Classifier. The logistic regression classifier was conducted with a termination parameter of .1 and other parameters listed above. The KNN was run with 5 neighbors using manhattan distance and the RF was run with 100 trees with a minimum sample split of 2. The outliers of the data were imputed with the mean while all missing values were imputed with the median. Missing values were imputed prior to outliers being removed. Outliers were deemed to be data more than 2 standard deviations from the mean. The code was run 14 times and the classifiers predictions and metrics were averaged and then compared The Random Forest was generally the best of the classifiers. The average AUC for the Random Forest was over .88 while Logistic Regression averaged .85 and KNN averaged .83. The ROC curves for the random forest seemed to be the most inclusive The Accuracy rate of all the classifiers was right around 85% The Random Forest had the highest precision at .92 while KNN and Logistic Regression averaged .89 The KNN classifier had the highest recall at .93 while the Random Forest and the Logistic Regression averaged .90. The F1 scores for all classifiers averaged .91. Looking at the confusion matrixes: The Random Forest Averaged 2 False positives and 2 false negatives for 4 missed guesses. The KNN Classifier Averaged 3 False positives and 2 false negatives for 5 missed guesses. The Logistic Regression Classifier Averaged 3 False positives and 2 false negatives for 5 missed guesses.

What a journey. That was an interesting problem we looked at. We were able to import our data and make it usable. After which we were able to create 3 different models that can predict the chance of survival for our hepatitis patients. Our best model was the Random Forest model, but the K-Nearest Neighbors model was not far behind.