Stroke Prediction-EDA-Classification-Models Python

Introduction

First what is a stroke?

  • Stroke is a medical emergency. A stroke occurs when blood flow to a part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die within minutes. Through this data we will try to know more about strokes and Make a model to try to do stroke prediction.

Risk factors for having a stroke include:

  • Age: People aged 55 years and over
  • Hypertension: if the systolic pressure is 140 mm Hg or more, or the diastolic pressure is 90 mm Hg or more
  • Hypercholesterolemia: If the cholesterol level in the blood is 200 milligrams per deciliter
  • Smoking
  • Diabetes
  • Obesity: if the body mass index (BMI) is 30 or more

Import

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report




df=pd.read_csv("D:/Dataset/healthcare-dataset-stroke-data.csv")
df.head()

Read & Explore

HideIn [2]:

df.info()
df.describe()

Variance features Distribution

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
df.plot(kind="hist", y="age", bins=70, color="b", ax=axes[0][0])
df.plot(kind="hist", y="bmi", bins=100, color="r", ax=axes[0][1])
df.plot(kind="hist", y="heart_disease", bins=6, color="g", ax=axes[1][0])
df.plot(kind="hist", y="avg_glucose_level", bins=100, color="orange", ax=axes[1][1])
plt.show()
  • We have good distribution for age
  • I think we have outliers in bmi
  • Average glucose distribution is reasonable because the normal avg of blood in sugar is less than 140 , that may be not good this feature will not be helpful to know if diabetes have correlation between diabetes and strokes

Data Summary ( Check for missing values )

print ("Rows     : " , df.shape[0])
print ("Columns  : " , df.shape[1])
print ("\nFeatures : \n" , df.columns.tolist())
print ("\nMissing values :  ", df.isnull().sum().values.sum())
print ("\nUnique values :  \n",df.nunique())

Data Visualization

Stroke Pie Chart

labels =df['stroke'].value_counts(sort = True).index
sizes = df['stroke'].value_counts(sort = True)

colors = ["lightblue","red"]
explode = (0.05,0) 
 
plt.figure(figsize=(7,7))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90,)

plt.title('Stroke Breakdown')
plt.show()

Only 5% percent of people have Stroke!

Gender

plt.figure(figsize=(10,5))
sns.countplot(data=df,x='gender');

There is about 1000 diffrence between Female and Male in the data

Correlation with average glucose level

Visualize some features which maybe have correlation with avg glucose level

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
df.plot(kind='scatter', x='age', y='avg_glucose_level', alpha=0.5, color='green', ax=axes[0], title="Age vs. avg_glucose_level")
df.plot(kind='scatter', x='bmi', y='avg_glucose_level', alpha=0.5, color='red', ax=axes[1], title="bmi vs. avg_glucose_level")
plt.show()
  • Average glucose level is high with old people
  • BMI >40 have low average glucose.

Heatmap Correlation

plt.figure(figsize=(15,7))
sns.heatmap(df.corr(),annot=True);

There is no correlation between stroke and BMI

BMI Boxplot

plt.figure(figsize=(10,7))
sns.boxplot(data=df,x=df["bmi"],color='green');

we have many outliers but before we fix this we must study BMI first.

BMI

Body mass index is a value derived from the mass and height of a person

bmi_outliers=df.loc[df['bmi']>50]
bmi_outliers['bmi'].shape
# mean with outliers 
print(bmi_outliers['stroke'].value_counts())
print ("\nMissing values :  ", df.isnull().sum().values.sum())

Double Check for missing values

df["bmi"] = df["bmi"].apply(lambda x: 50 if x>50 else x)
df["bmi"] = df["bmi"].fillna(28.4)
print ("\nMissing values :  ", df.isnull().sum().values.sum())

Stroke or not in Categorical Features

cat_df = df[['gender','Residence_type','smoking_status','stroke']]
summary = pd.concat([pd.crosstab(cat_df[x], cat_df.stroke) for x in cat_df.columns[:-1]], keys=cat_df.columns[:-1])
summary

Stroke/Ever Married

plt.figure(figsize=(10,5))
strok=df.loc[df['stroke']==1]
sns.countplot(data=strok,x='ever_married',palette='inferno');

Stroke/Work Type

plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='work_type',palette='cool');

Private work exposes you to more stroke

Stroke/Smoking Status

plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='smoking_status',palette='autumn');

Being a smoker or a formerly smoker increases your risk of having a stroke

Residence Type

plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='Residence_type',palette='Greens');

Residence Type has nothing to do with stroke, We cannot take it as a standard

Stroke/Heart Disease

plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='heart_disease',palette='Reds');

Most people who have had a stroke do not have any heart disease, but that does not prevent it being an influential factor

plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='hypertension',palette='Pastel2');

more than 25% of strok cases They had hypertension

Notes

  • Avg glucose level is high with old people
  • BMI >40 have low avg glucose
  • Being unmarried reduces your risk of a stroke
  • Being a smoker or a formerly smoker increases your risk of having a stroke
  • more than 25% of strok cases They had hypertension

Data preprocessing

Encoding Categorical Features

df["Residence_type"] = df["Residence_type"].apply(lambda x: 1 if x=="Urban" else 0)
df["ever_married"] = df["ever_married"].apply(lambda x: 1 if x=="Yes" else 0)
df["gender"] = df["gender"].apply(lambda x: 1 if x=="Male" else 0)

 
df = pd.get_dummies(data=df, columns=['smoking_status'])
df = pd.get_dummies(data=df, columns=['work_type'])
df

Scaling The variance in Features

std=StandardScaler()
columns = ['avg_glucose_level','bmi','age']
scaled = std.fit_transform(df[['avg_glucose_level','bmi','age']])
scaled = pd.DataFrame(scaled,columns=columns)
df=df.drop(columns=columns,axis=1)
df=df.merge(scaled, left_index=True, right_index=True, how = "left")
df.head()

Drop ID feature and check for nulls

df=df.drop(columns='id',axis=1)
df.head()
df[df.isnull().any(axis=1)]

Classification Models

Target & Features

X = df.drop(['stroke'], axis=1).values 
y = df['stroke'].values

Splitting

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

adaboost classification

#create adaboost classification obj
ab_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, 
                            learning_rate=0.5, random_state=100)

#training via adaboost classficiation model
ab_clf.fit(X_train, y_train)
print("training....\n")

#make prediction using the test set
ab_pred_stroke= ab_clf.predict(X_train)
print('prediction: \n', ab_pred_stroke)

print('\nparms: \n', ab_clf.get_params)

#score
ab_clf_score = ab_clf.score(X_test, y_test)
print("\nmean accuracy: %.2f" % ab_clf.score(X_test, y_test))

XGboost

xgboost = GradientBoostingClassifier(random_state=0)
xgboost.fit(X_train, y_train)
#== 
#Score 
#== 
xgboost_score = xgboost.score(X_train, y_train)
xgboost_test = xgboost.score(X_test, y_test)
#== 
#testing model 
#== 
y_pred = xgboost.predict(X_test)
#== 
#evaluation
#== 
cm = confusion_matrix(y_test,y_pred)
print('Training Score',xgboost_score)
print('Testing Score \n',xgboost_test)

#=== 
#Confusion Matrix 
plt.figure(figsize=(14,5))

conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Greens");
print(accuracy_score(y_test,y_pred))

SVM

svc = SVC(random_state=0)
svc.fit(X_train, y_train)
#== 
#Score 
#== 
svc_score = svc.score(X_train, y_train)
svc_test = svc.score(X_test, y_test)
#== 
#testing model 
#== 
y_pred = svc.predict(X_test)
#== 
#evaluation
#== 
cm = confusion_matrix(y_test,y_pred)
print('Training Score',svc_score)
print('Testing Score \n',svc_test)
print(cm

Random Forest Classifier

forest = RandomForestClassifier(n_estimators = 100)
#== 
forest.fit(X_train, y_train)
#== 
#Score 
#== 
forest_score = forest.score(X_train, y_train)
forest_test = forest.score(X_test, y_test)
#== 
#testing model 
#== 
y_pred = forest.predict(X_test)
#== 
#evaluation
#== 
cm = confusion_matrix(y_test,y_pred)
print('Training Score',forest_score)
print('Testing Score \n',forest_test)
print(cm)

Logistic Regression

model = LogisticRegression()
model.fit(X_train, y_train)

score = model.score(X_test, y_test)
print('Testing Score \n',score)
logistic_score = model.score(X_train, y_train)
logistic_test = model.score(X_test, y_test)
#== 
y_pred= model.predict(X_test)
print(classification_report(y_test, y_pred))
#== 
cm = confusion_matrix(y_test,y_pred)
print(cm)

Feature Importance using Logistic Regression

coef = model.coef_[0]
coef = [abs(number) for number in coef]
print(coef)
cols = list(df.columns)
cols.index('stroke')
#== 
#Delete target label 
#== 
del cols[5]
cols
sorted_index = sorted(range(len(coef)), key = lambda k: coef[k], reverse = True)
for idx in sorted_index:
    print(cols[idx])

Although BMI is considered an indicator for recognizing strokes, there are a large number of values ​​in the normal range and not a high rate that indicates a stroke.

MLP NN Classifier

X=df.drop(['stroke','gender','bmi','Residence_type','work_type_Never_worked','smoking_status_Unknown'], axis=1).values 
#X = df.drop(['stroke','bmi'], axis=1).values 
y = df['stroke'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# mlp = MLPClassifier(hidden_layer_sizes=(1000,300, 300, 300), solver='adam', shuffle=False, tol = 0.0001)
mlp=MLPClassifier(hidden_layer_sizes=(300,300,300), max_iter=1000, alpha=0.00001,
                     solver='adam', verbose=10,  random_state=21)
mlp.fit(X_train, y_train)
mlp_pred= mlp.predict(X_test)

mlp_score = mlp.score(X_train, y_train)
mlp_test = mlp.score(X_test, y_test)


y_pred =mlp.predict(X_test)
#== 
#evaluation
#== 
cm = confusion_matrix(y_test,y_pred)
print('Training Score',mlp_score)
print('Testing Score \n',mlp_test)
print(cm)
Iteration 1, loss = 0.25073982
Iteration 2, loss = 0.15601721
Iteration 3, loss = 0.15148236
Iteration 4, loss = 0.15011300
Iteration 5, loss = 0.14801346
Iteration 6, loss = 0.14705574
Iteration 7, loss = 0.14343648
Iteration 8, loss = 0.14475396
Iteration 9, loss = 0.14122289
Iteration 10, loss = 0.14020491
Iteration 11, loss = 0.14082460
Iteration 12, loss = 0.13869296
Iteration 13, loss = 0.13551809
Iteration 14, loss = 0.13677271
Iteration 15, loss = 0.13306991
Iteration 16, loss = 0.13627428
Iteration 17, loss = 0.13310803
Iteration 18, loss = 0.13113676
Iteration 19, loss = 0.12786408
Iteration 20, loss = 0.12653028
Iteration 21, loss = 0.12525292
Iteration 22, loss = 0.12757926
Iteration 23, loss = 0.12214366
Iteration 24, loss = 0.12129737
Iteration 25, loss = 0.12211088
Iteration 26, loss = 0.12322562
Iteration 27, loss = 0.11950508
Iteration 28, loss = 0.11867142
Iteration 29, loss = 0.11774275
Iteration 30, loss = 0.11903667
Iteration 31, loss = 0.11632040
Iteration 32, loss = 0.11553193
Iteration 33, loss = 0.11295480
Iteration 34, loss = 0.11218260
Iteration 35, loss = 0.10999969
Iteration 36, loss = 0.11053086
Iteration 37, loss = 0.10904621
Iteration 38, loss = 0.10831232
Iteration 39, loss = 0.10686522
Iteration 40, loss = 0.10644428
Iteration 41, loss = 0.10688178
Iteration 42, loss = 0.10343191
Iteration 43, loss = 0.10450590
Iteration 44, loss = 0.10335569
Iteration 45, loss = 0.10186789
Iteration 46, loss = 0.10005436
Iteration 47, loss = 0.10356312
Iteration 48, loss = 0.10151862
Iteration 49, loss = 0.10214588
Iteration 50, loss = 0.10308373
Iteration 51, loss = 0.09923623
Iteration 52, loss = 0.09605030
Iteration 53, loss = 0.09936861
Iteration 54, loss = 0.09486939
Iteration 55, loss = 0.09245237
Iteration 56, loss = 0.09775333
Iteration 57, loss = 0.09387213
Iteration 58, loss = 0.09417488
Iteration 59, loss = 0.09496724
Iteration 60, loss = 0.09067467
Iteration 61, loss = 0.08957575
Iteration 62, loss = 0.09188115
Iteration 63, loss = 0.09131175
Iteration 64, loss = 0.08956810
Iteration 65, loss = 0.09027089
Iteration 66, loss = 0.09068501
Iteration 67, loss = 0.08620702
Iteration 68, loss = 0.08673546
Iteration 69, loss = 0.08283293
Iteration 70, loss = 0.08313578
Iteration 71, loss = 0.08808702
Iteration 72, loss = 0.08630748
Iteration 73, loss = 0.08130300
Iteration 74, loss = 0.08077653
Iteration 75, loss = 0.08214762
Iteration 76, loss = 0.08222929
Iteration 77, loss = 0.07996879
Iteration 78, loss = 0.08085455
Iteration 79, loss = 0.07764043
Iteration 80, loss = 0.08130066
Iteration 81, loss = 0.07998853
Iteration 82, loss = 0.07847984
Iteration 83, loss = 0.08112860
Iteration 84, loss = 0.07691877
Iteration 85, loss = 0.07564515
Iteration 86, loss = 0.07751632
Iteration 87, loss = 0.07696659
Iteration 88, loss = 0.08058930
Iteration 89, loss = 0.07747721
Iteration 90, loss = 0.07779515
Iteration 91, loss = 0.07564913
Iteration 92, loss = 0.07393943
Iteration 93, loss = 0.07744015
Iteration 94, loss = 0.07466905
Iteration 95, loss = 0.07443650
Iteration 96, loss = 0.07214443
Iteration 97, loss = 0.07238843
Iteration 98, loss = 0.07042956
Iteration 99, loss = 0.06888013
Iteration 100, loss = 0.06920919
Iteration 101, loss = 0.06901262
Iteration 102, loss = 0.07552961
Iteration 103, loss = 0.07174945
Iteration 104, loss = 0.07029673
Iteration 105, loss = 0.07013814
Iteration 106, loss = 0.06784715
Iteration 107, loss = 0.07159969
Iteration 108, loss = 0.06863485
Iteration 109, loss = 0.06673842
Iteration 110, loss = 0.06937063
Iteration 111, loss = 0.06617347
Iteration 112, loss = 0.06500215
Iteration 113, loss = 0.06340067
Iteration 114, loss = 0.06236733
Iteration 115, loss = 0.06458241
Iteration 116, loss = 0.06619115
Iteration 117, loss = 0.07260931
Iteration 118, loss = 0.06929901
Iteration 119, loss = 0.06682100
Iteration 120, loss = 0.06453708
Iteration 121, loss = 0.06246274
Iteration 122, loss = 0.06107513
Iteration 123, loss = 0.06234550
Iteration 124, loss = 0.06083020
Iteration 125, loss = 0.06177546
Iteration 126, loss = 0.05927088
Iteration 127, loss = 0.05970574
Iteration 128, loss = 0.06032682
Iteration 129, loss = 0.06070094
Iteration 130, loss = 0.06367095
Iteration 131, loss = 0.05975269
Iteration 132, loss = 0.06050048
Iteration 133, loss = 0.06072319
Iteration 134, loss = 0.06303969
Iteration 135, loss = 0.06479217
Iteration 136, loss = 0.06493533
Iteration 137, loss = 0.06678607
Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping.
Training Score 0.9751188146491473
Testing Score 
 0.9347684279191129
[[1420   29]
 [  71   13]]
plt.figure(figsize=(14,5))
cm = confusion_matrix(y_test,y_pred)
conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Reds");

Sensitivity & Specificity

TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) =       ',(TP+TN)/float(TP+TN+FP+FN),'\n','The Missclassification = 1-Accuracy =     ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n','Sensitivity or True Positive Rate = TP/(TP+FN) =       ',TP/float(TP+FN),'\n','Specificity or True Negative Rate = TN/(TN+FP) =       ',TN/float(TN+FP),'\n')

This Notebook was written on Kaggle By Ahmed Ashour. Click on his name to follow him. To read more about Python Notebooks click here

Leave a Reply

%d bloggers like this: