Today we will be looking at a dataset provided to us by the University of California, Irvine. This data set has information on 155 patients with hepatitis from various ages and with various health levels. The dataset provides us 19 columns of health-related information about each person with hepatitis. The dataset also provides us a column showing whether each patient is alive or dead at the time the dataset was collected. We will be using this dataset, looking at the various health statistics and the person's survival, to attempt to predict how patients with similar health attributes may similarly survive hepatitis.
First let's import pandas, numpy and matplotlib. Let's also import our classifiers from sklearn.
import pandas as pd
import numpy as np
from sklearn.metrics import *
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
Next we will save the URL of the data file we will be using under a URL variable and we will use Pandas to import this data set and save it as data frame.
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data'
df = pd.read_csv(url, header = None)
Next let's assign our column names to make the data easier to use and understand.
df_names = [
'Status', 'Age', 'Sex', 'Steroid', 'Antivirals', 'Fatigue', 'Malaise', 'Anorexia',
'Big Liver', 'Firm Liver', 'Spleen Palpable', 'Spiders', 'Ascites', 'Varices',
'BILIRUBIN', 'Alk Phosphate', 'SGOT', 'Albumin', 'Protime', 'Histology'
]
df.columns = df_names
Let's take a look at our data frame so far.
print("\n", df.head())
Now let's check our data types.
print("\n", df.dtypes)
Let's begin by dealing with our categorical data. We will decode all variables and then we will deal with all abberant categorical data. We will also look at some Bar charts to see what categorical data we are dealing with.
First let's look at status. This tells us whether a person is alive or dead.
df.loc[:,"Status"].unique()
df.loc[df.loc[:,"Status"] == 1, "Status"] = "Dead"
df.loc[df.loc[:,"Status"] == 2, "Status"] = "Alive"
df.loc[:,"Status"].value_counts().plot(kind="Bar")
Now let's look at the sex column. This tells us whether a person is male or female.
i = "Sex"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == 1, i] = "M"
df.loc[df.loc[:,i] == 2, i] = "F"
df.loc[:,i].value_counts().plot(kind="Bar")
Our next column will tell us whether a person has used steroids or not.
i = "Steroid"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Now let's see if most people in our data set were using antivirals or not.
i = "Antivirals"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == 1, i] = "No"
df.loc[df.loc[:,i] == 2, i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Our next column will give us information about whether a person is fatigued or not. Let's take a look at what our data set shows.
i = "Fatigue"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "No"
df.loc[:,i].value_counts().plot(kind="Bar")
Let's keep going with our next column, the malaise column. This shows us if a person was generally uncomfortable.
i = "Malaise"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Our next column tells us whether the patient was also suffering from anorexia or not.
i = "Anorexia"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Next let's see if our patients were exhibiting signs of hepatitis such as having a big Liver.
i = "Big Liver"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Our next column shows us if the patients were exhibiting signs of hepatitis such as having a firm Liver.
i = "Firm Liver"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Let's keep going and look at our Spleen palpable column. This will show us the distribution of the palpable spleens in our data set.
i = "Spleen Palpable"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Our next column shows us whether our patients were exhibiting signs of spiders.
i = "Spiders"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
We only have a few more categorical variables so let's keep going. Our next column shows us if the person showed signs of Ascites.
i = "Ascites"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Our next column shows us if the person showed signs of Varices.
i = "Varices"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == "1", i] = "No"
df.loc[df.loc[:,i] == "2", i] = "Yes"
df.loc[df.loc[:,i] == "?", i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
Finally let's take a look at our final column. This column measures whether the patient has a histology or not.
i = "Histology"
df.loc[:,i].unique()
df.loc[df.loc[:,i] == 1, i] = "No"
df.loc[df.loc[:,i] == 2, i] = "Yes"
df.loc[:,i].value_counts().plot(kind="Bar")
All of our Cateogrical data has been decoded. And we have gotten some good information from our graphs. We were able to vaguely see how our patients with hepatitis spread out in our various categories. Now let's impute all missing data in our numeric columns with the median.
for x in range(14,19):
i = df_names[x]
df.loc[:,i] = pd.to_numeric(df.loc[:,i], errors='coerce')
HasNan = np.isnan(df.loc[:,i])
df.loc[HasNan,i] = np.nanmedian(df.loc[:,i])
Now let's get rid of outliers for our Numerical columns Let's make a function to help with the repetition We will assume that any numerical value 2 standard deviations from the mean is an outlier. The mean and standard deviation used to calculate outliers are calculated with numerical data, where all missing numerical data was imputed with the median
def replaceOutliers(df, col, r):
LimitHi = np.mean((df.loc[:,col])) + 2*np.std((df.loc[:,col]))
LimitLo = np.mean((df.loc[:,col])) - 2*np.std((df.loc[:,col]))
FlagBad = (df.loc[:,col] > LimitHi) | (df.loc[:,col] < LimitLo)
repMean = round((np.mean(df.loc[:,col][~FlagBad])),r)
df.loc[FlagBad,col] = repMean
replaceOutliers(df, "BILIRUBIN", 1)
replaceOutliers(df, "Alk Phosphate", 0)
replaceOutliers(df, "SGOT", 0)
replaceOutliers(df, "Albumin", 1)
replaceOutliers(df, "Protime", 0)
Now that we've dealt with all abberant data for our categorical and numerical variables. Let's z-normalize our numerical variables. Let's create a function to assist us.
def normZ(df):
R, c = df.shape
normed = np.zeros([R,c])
for i in range(c):
oMean = np.mean(df.iloc[:,i])
oSD = np.std(df.iloc[:,i])
normed[:,i] = (df.iloc[:,i] - oMean)/oSD
return normed
df.iloc[:,14:19] = normZ(df.iloc[:,14:19])
Next let's Bin our Age variable. We will use split the ages up into 10 year ranges.
NB = 7
freq, bounds = np.histogram(df.loc[:,"Age"], NB)
def bin2(df, col, b): # x = data array, b = boundaries array
nb = len(b)
N = len(df.loc[:,col])
y = np.empty(N, int) # empty integer array to store the bin numbers (output)
for i in range(1, nb): # repeat for each pair of bin boundaries
y[(df.loc[:,col] >= b[i-1])&(df.loc[:,col] < b[i])] = i
y[df.loc[:,col] == b[-1]] = nb - 1 # ensure that the borderline cases are also binned appropriately
return y
df.loc[:,"Age"] = bin2(df, "Age", bounds)
Now let's turn back to our categorical data. All our categorical variables are Binary. Let's go ahead and one hot encode our categorical variables and then drop all obsolete columns.
df.loc[:, "isAlive"] = (df.loc[:, "Status"] == "Alive").astype(int)
df.loc[:, "isMale"] = (df.loc[:, "Sex"] == "M").astype(int)
df.loc[:, "isSteroidUser"] = (df.loc[:, "Steroid"] == "Yes").astype(int)
df.loc[:, "usesAntivirals"] = (df.loc[:, "Antivirals"] == "Yes").astype(int)
df.loc[:, "isFatigued"] = (df.loc[:, "Fatigue"] == "Yes").astype(int)
df.loc[:, "isMalaised"] = (df.loc[:, "Malaise"] == "Yes").astype(int)
df.loc[:, "isAnorexic"] = (df.loc[:, "Anorexia"] == "Yes").astype(int)
df.loc[:, "hasBigLiver"] = (df.loc[:, "Big Liver"] == "Yes").astype(int)
df.loc[:, "hasFirmLiver"] = (df.loc[:, "Firm Liver"] == "Yes").astype(int)
df.loc[:, "hasPalpableSpleen"] = (df.loc[:, "Spleen Palpable"] == "Yes").astype(int)
df.loc[:, "hasSpiders"] = (df.loc[:, "Spiders"] == "Yes").astype(int)
df.loc[:, "hasAscites"] = (df.loc[:, "Ascites"] == "Yes").astype(int)
df.loc[:, "hasVarices"] = (df.loc[:, "Varices"] == "Yes").astype(int)
df.loc[:, "hasHistology"] = (df.loc[:, "Histology"] == "Yes").astype(int)
df = df.drop("Status", axis=1)
df = df.drop("Sex", axis=1)
df = df.drop("Steroid", axis=1)
df = df.drop("Antivirals", axis=1)
df = df.drop("Fatigue", axis=1)
df = df.drop("Malaise", axis=1)
df = df.drop("Anorexia", axis=1)
df = df.drop("Big Liver", axis=1)
df = df.drop("Firm Liver", axis=1)
df = df.drop("Spleen Palpable", axis=1)
df = df.drop("Spiders", axis=1)
df = df.drop("Ascites", axis=1)
df = df.drop("Varices", axis=1)
df = df.drop("Histology", axis=1)
Now let's look at all our final data types
print("\n", df.dtypes)
Now let's print the head of our final result
print("\n", df.head())
Now that our data is loaded, all abberant data has been dealt with and our categorical variables have been binned, let's move on to splitting our data. Using a few classifiers we will first train them and then use them to predict our patients' survival. We will then use some analysis measures to determine our best classifier to apply.
def split_dataset(df, colName, r):
N = len(df)
trainFeat = []
trainTarg = []
if r >= 1:
print ("Parameter r needs to be smaller than 1!")
return
elif r <= 0:
print ("Parameter r needs to be larger than 0!")
return
n = int(round(N*r))
ind = -np.ones(n,int)
R = np.random.randint(N)
for i in range(n):
while R in ind: R = np.random.randint(N)
ind[i] = R
ind_ = list(set(range(N)).difference(ind))
ind = np.sort(ind)
ind = list(ind)
trainFeat = df.iloc[ind_,:]
trainFeat = trainFeat.drop(colName, axis=1)
testFeat = df.iloc[ind,:]
testFeat = testFeat.drop(colName, axis=1)
trainTarg = df.iloc[ind_,df.columns.get_loc(colName)]
testTarg = df.iloc[ind,df.columns.get_loc(colName)]
return trainFeat, testFeat, trainTarg, testTarg
r = 0.2
trainFeat, testFeat, trainTarg, testTarg = split_dataset(df, "isAlive", r)
Now let's pick our predictive models. We will train on the Train set and Test the predictive models on the Test set.
Let's start with the Logistic Regression Classifier
print ('\n\n\nLogistic regression classifier\n')
C_parameter = 50. / len(trainFeat)
class_parameter = 'ovr'
penalty_parameter = 'l1'
solver_parameter = 'saga'
tolerance_parameter = 0.1
clfLR = LogisticRegression(C=C_parameter, multi_class=class_parameter, penalty=penalty_parameter, solver=solver_parameter, tol=tolerance_parameter)
clfLR.fit(trainFeat, trainTarg)
print ('coefficients:')
print (clfLR.coef_)
print ('intercept:')
print (clfLR.intercept_)
predictedLR = clfLR.predict(testFeat)
predProbsLR = clfLR.predict_proba(testFeat)[:,1]
print ("predictions for test set:")
print (predictedLR)
print ('actual class values:')
print (testTarg.values.reshape(1,-1))
Now that we have trained and predicted using our first classifier. Let's take a look at our confusion matrix.
CMLR = confusion_matrix(testTarg, predictedLR, labels=[1,0])
print ("\n\nConfusion matrix:\n", CMLR)
tpLR, fnLR, fpLR, tnLR = CMLR.ravel()
print ("TP:", tpLR, ", FP:", fpLR,", FN:,", fnLR, ", TN:", tnLR)
And now let's take a look at the accuracy measures for Logistic Regression classifier.
ARLR = accuracy_score(testTarg,predictedLR)
print ("\nAccuracy rate:", ARLR)
ERLR = (fpLR + fnLR)/len(testTarg)
print ("\nError rate:", ERLR)
PLR = tpLR/(tpLR + fpLR)
print ("\nPrecision:", np.round(PLR, 2))
RLR = tpLR/(tpLR+fnLR)
print ("\nRecall:", np.round(RLR, 2))
F1LR = 2*tpLR/(2*tpLR + fpLR + fnLR)
print ("\nF1 score:", np.round(F1LR, 2))
Let's keep going with our analysis of our model. Let's take a look at the FPR, TPR, ROC-Curve and the AUC.
fprLR, tprLR, thLR = roc_curve(testTarg, predProbsLR)
AUCLR = auc(fprLR, tprLR)
print ("\nTP rates:", np.round(tprLR, 2))
print ("\nFP rates:", np.round(fprLR, 2))
print ("\nProbability thresholds:", np.round(thLR, 2))
LW = 1.5
LL = "lower right"
LC = 'darkgreen'
plt.figure()
plt.title('ROC curve Logistic Regression Classifier')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FALSE Positive Rate')
plt.ylabel('TRUE Positive Rate')
plt.plot(fprLR, tprLR, color=LC,lw=LW, label='ROC curve (area = %0.2f)' % AUCLR)
plt.plot([0, 1], [0, 1], color='navy', lw=LW, linestyle='--')
plt.legend(loc=LL)
plt.show()
print ("\nAUC score:", np.round(AUCLR, 2))
Now let's look at the K-Nearest Neighbors classifier
print ('\n\nK nearest neighbors classifier\n')
k = 5
distance_metric = 'manhattan'
knn = KNeighborsClassifier(n_neighbors=k, metric=distance_metric)
knn.fit(trainFeat, trainTarg)
predictedKNN = knn.predict(testFeat)
predProbsKNN = knn.predict_proba(testFeat)[:,1]
print ("predictions for test set:")
print (predictedKNN)
print ('actual class values:')
print (testTarg.values.reshape(1,-1))
Now that we have trained and predicted using our second classifier. Let's take a look at our confusion matrix.
CMknn = confusion_matrix(testTarg, predictedKNN, labels=[1,0])
print ("\n\nConfusion matrix:\n", CMknn)
tpKNN, fnKNN, fpKNN, tnKNN = CMknn.ravel()
print ("TP:", tpKNN, ", FP:", fpKNN,", FN:,", fnKNN, ", TN:", tnKNN)
And now let's take a look at the accuracy measures for K-Nearest Neighbors classifier.
ARknn = accuracy_score(testTarg,predictedKNN)
print ("\nAccuracy rate:", ARknn)
ERknn = (fpKNN + fnKNN)/len(testTarg)
print ("\nError rate:", ERknn)
Pknn = tpKNN/(tpKNN + fpKNN)
print ("\nPrecision:", np.round(Pknn, 2))
Rknn = tpKNN/(tpKNN+fnKNN)
print ("\nRecall:", np.round(Rknn, 2))
F1knn = 2*tpKNN/(2*tpKNN + fpKNN + fnKNN)
print ("\nF1 score:", np.round(F1knn, 2))
Let's keep going with our analysis of our model. Let's take a look at the FPR, TPR, ROC-Curve and the AUC.
fprKNN, tprKNN, thKNN = roc_curve(testTarg, predProbsKNN)
AUCknn = auc(fprKNN, tprKNN)
print ("\nTP rates:", np.round(tprKNN, 2))
print ("\nFP rates:", np.round(fprKNN, 2))
print ("\nProbability thresholds:", np.round(thKNN, 2))
LW = 1.5
LL = "lower right"
LC = 'darkgreen'
plt.figure()
plt.title('ROC curve KNN')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FALSE Positive Rate')
plt.ylabel('TRUE Positive Rate')
plt.plot(fprKNN, tprKNN, color=LC,lw=LW, label='ROC curve (area = %0.2f)' % AUCknn)
plt.plot([0, 1], [0, 1], color='navy', lw=LW, linestyle='--')
plt.legend(loc=LL)
plt.show()
print ("\nAUC score:", np.round(AUCknn, 2))
Finally let's look at how a Random Forest Classifier would do predicting the survival of our hepatitis patients.
estimators = 100
mss = 2
print ('\n\nRandom Forest classifier\n')
clfFST = RandomForestClassifier(n_estimators=estimators, min_samples_split=mss) # default parameters are fine
clfFST.fit(trainFeat, trainTarg)
predictedFST = clfFST.predict(testFeat)
predProbsFST = clfFST.predict_proba(testFeat)[:,1]
print ("predictions for test set:")
print (predictedFST)
print ('actual class values:')
print (testTarg.values.reshape(1,-1))
Now that we have trained and predicted using our final classifier. Let's take a look at our confusion matrix.
CMfst = confusion_matrix(testTarg, predictedFST, labels=[1,0])
print ("\n\nConfusion matrix:\n", CMfst)
tpFST, fnFST, fpFST, tnFST = CMfst.ravel()
print ("TP:", tpFST, ", FP:", fpFST, ", FN:,", fnFST, ", TN:", tnFST)
Now let's take a look at the accuracy measures for random forest classifier.
ARfst = accuracy_score(testTarg,predictedFST)
print ("\nAccuracy rate:", ARfst)
ERfst = (fpFST + fnFST)/len(testTarg)
print ("\nError rate:", ERfst)
Pfst = tpFST/(tpFST + fpFST)
print ("\nPrecision:", np.round(Pfst, 2))
Rfst = tpFST/(tpFST+fnFST)
print ("\nRecall:", np.round(Rfst, 2))
F1fst = 2*tpFST/(2*tpFST + fpFST + fnFST)
print ("\nF1 score:", np.round(F1fst, 2))
Let's keep going with our analysis of our model. Let's take a look at the FPR, TPR, ROC-Curve and the AUC.
fprFST, tprFST, thFST = roc_curve(testTarg, predProbsFST) # False Positive Rate, True Posisive Rate, probability thresholds
AUCfst = auc(fprFST, tprFST)
print ("\nTP rates:", np.round(tprFST, 2))
print ("\nFP rates:", np.round(fprFST, 2))
print ("\nProbability thresholds:", np.round(thFST, 2))
LW = 1.5
LL = "lower right"
LC = 'darkgreen'
plt.figure()
plt.title('ROC curve Random Forest')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('FALSE Positive Rate')
plt.ylabel('TRUE Positive Rate')
plt.plot(fprFST, tprFST, color=LC,lw=LW, label='ROC curve (area = %0.2f)' % AUCfst)
plt.plot([0, 1], [0, 1], color='navy', lw=LW, linestyle='--')
plt.legend(loc=LL)
plt.show()
print ("\nAUC score:", np.round(AUCfst, 2))
This model was built to predict a person's chance of surviving hepatitis given a person's various health statistics. The three classifiers compared were the Random Forest Classifier, the KNN classifier and the Logistic Regression Classifier. The logistic regression classifier was conducted with a termination parameter of .1 and other parameters listed above. The KNN was run with 5 neighbors using manhattan distance and the RF was run with 100 trees with a minimum sample split of 2. The outliers of the data were imputed with the mean while all missing values were imputed with the median. Missing values were imputed prior to outliers being removed. Outliers were deemed to be data more than 2 standard deviations from the mean. The code was run 14 times and the classifiers predictions and metrics were averaged and then compared The Random Forest was generally the best of the classifiers. The average AUC for the Random Forest was over .88 while Logistic Regression averaged .85 and KNN averaged .83. The ROC curves for the random forest seemed to be the most inclusive The Accuracy rate of all the classifiers was right around 85% The Random Forest had the highest precision at .92 while KNN and Logistic Regression averaged .89 The KNN classifier had the highest recall at .93 while the Random Forest and the Logistic Regression averaged .90. The F1 scores for all classifiers averaged .91. Looking at the confusion matrixes: The Random Forest Averaged 2 False positives and 2 false negatives for 4 missed guesses. The KNN Classifier Averaged 3 False positives and 2 false negatives for 5 missed guesses. The Logistic Regression Classifier Averaged 3 False positives and 2 false negatives for 5 missed guesses.
What a journey. That was an interesting problem we looked at. We were able to import our data and make it usable. After which we were able to create 3 different models that can predict the chance of survival for our hepatitis patients. Our best model was the Random Forest model, but the K-Nearest Neighbors model was not far behind.