Introduction
First what is a stroke?
- Stroke is a medical emergency. A stroke occurs when blood flow to a part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die within minutes. Through this data we will try to know more about strokes and Make a model to try to do stroke prediction.
Risk factors for having a stroke include:
- Age: People aged 55 years and over
- Hypertension: if the systolic pressure is 140 mm Hg or more, or the diastolic pressure is 90 mm Hg or more
- Hypercholesterolemia: If the cholesterol level in the blood is 200 milligrams per deciliter
- Smoking
- Diabetes
- Obesity: if the body mass index (BMI) is 30 or more
Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
df=pd.read_csv("D:/Dataset/healthcare-dataset-stroke-data.csv")
df.head()

Read & Explore
HideIn [2]:
df.info()

df.describe()

Variance features Distribution
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12, 10))
df.plot(kind="hist", y="age", bins=70, color="b", ax=axes[0][0])
df.plot(kind="hist", y="bmi", bins=100, color="r", ax=axes[0][1])
df.plot(kind="hist", y="heart_disease", bins=6, color="g", ax=axes[1][0])
df.plot(kind="hist", y="avg_glucose_level", bins=100, color="orange", ax=axes[1][1])
plt.show()

- We have good distribution for age
- I think we have outliers in bmi
- Average glucose distribution is reasonable because the normal avg of blood in sugar is less than 140 , that may be not good this feature will not be helpful to know if diabetes have correlation between diabetes and strokes
Data Summary ( Check for missing values )
print ("Rows : " , df.shape[0])
print ("Columns : " , df.shape[1])
print ("\nFeatures : \n" , df.columns.tolist())
print ("\nMissing values : ", df.isnull().sum().values.sum())
print ("\nUnique values : \n",df.nunique())

Data Visualization
Stroke Pie Chart
labels =df['stroke'].value_counts(sort = True).index
sizes = df['stroke'].value_counts(sort = True)
colors = ["lightblue","red"]
explode = (0.05,0)
plt.figure(figsize=(7,7))
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=90,)
plt.title('Stroke Breakdown')
plt.show()

Only 5% percent of people have Stroke!
Gender
plt.figure(figsize=(10,5))
sns.countplot(data=df,x='gender');

There is about 1000 diffrence between Female and Male in the data
Correlation with average glucose level
Visualize some features which maybe have correlation with avg glucose level
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
df.plot(kind='scatter', x='age', y='avg_glucose_level', alpha=0.5, color='green', ax=axes[0], title="Age vs. avg_glucose_level")
df.plot(kind='scatter', x='bmi', y='avg_glucose_level', alpha=0.5, color='red', ax=axes[1], title="bmi vs. avg_glucose_level")
plt.show()

- Average glucose level is high with old people
- BMI >40 have low average glucose.
Heatmap Correlation
plt.figure(figsize=(15,7))
sns.heatmap(df.corr(),annot=True);

There is no correlation between stroke and BMI
BMI Boxplot
plt.figure(figsize=(10,7))
sns.boxplot(data=df,x=df["bmi"],color='green');

we have many outliers but before we fix this we must study BMI first.
BMI
Body mass index is a value derived from the mass and height of a person

bmi_outliers=df.loc[df['bmi']>50]
bmi_outliers['bmi'].shape

# mean with outliers
print(bmi_outliers['stroke'].value_counts())

print ("\nMissing values : ", df.isnull().sum().values.sum())

Double Check for missing values
df["bmi"] = df["bmi"].apply(lambda x: 50 if x>50 else x)
df["bmi"] = df["bmi"].fillna(28.4)
print ("\nMissing values : ", df.isnull().sum().values.sum())

Stroke or not in Categorical Features
cat_df = df[['gender','Residence_type','smoking_status','stroke']]
summary = pd.concat([pd.crosstab(cat_df[x], cat_df.stroke) for x in cat_df.columns[:-1]], keys=cat_df.columns[:-1])
summary

Stroke/Ever Married
plt.figure(figsize=(10,5))
strok=df.loc[df['stroke']==1]
sns.countplot(data=strok,x='ever_married',palette='inferno');

Stroke/Work Type
plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='work_type',palette='cool');

Private work exposes you to more stroke
Stroke/Smoking Status
plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='smoking_status',palette='autumn');

Being a smoker or a formerly smoker increases your risk of having a stroke
Residence Type
plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='Residence_type',palette='Greens');

Residence Type has nothing to do with stroke, We cannot take it as a standard
Stroke/Heart Disease
plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='heart_disease',palette='Reds');

Most people who have had a stroke do not have any heart disease, but that does not prevent it being an influential factor
plt.figure(figsize=(10,5))
sns.countplot(data=strok,x='hypertension',palette='Pastel2');

more than 25% of strok cases They had hypertension
Notes
- Avg glucose level is high with old people
- BMI >40 have low avg glucose
- Being unmarried reduces your risk of a stroke
- Being a smoker or a formerly smoker increases your risk of having a stroke
- more than 25% of strok cases They had hypertension
Data preprocessing
Encoding Categorical Features
df["Residence_type"] = df["Residence_type"].apply(lambda x: 1 if x=="Urban" else 0)
df["ever_married"] = df["ever_married"].apply(lambda x: 1 if x=="Yes" else 0)
df["gender"] = df["gender"].apply(lambda x: 1 if x=="Male" else 0)
df = pd.get_dummies(data=df, columns=['smoking_status'])
df = pd.get_dummies(data=df, columns=['work_type'])
df

Scaling The variance in Features
std=StandardScaler()
columns = ['avg_glucose_level','bmi','age']
scaled = std.fit_transform(df[['avg_glucose_level','bmi','age']])
scaled = pd.DataFrame(scaled,columns=columns)
df=df.drop(columns=columns,axis=1)
df=df.merge(scaled, left_index=True, right_index=True, how = "left")
df.head()

Drop ID feature and check for nulls
df=df.drop(columns='id',axis=1)
df.head()

df[df.isnull().any(axis=1)]

Classification Models
Target & Features
X = df.drop(['stroke'], axis=1).values
y = df['stroke'].values
Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
adaboost classification
#create adaboost classification obj
ab_clf = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100,
learning_rate=0.5, random_state=100)
#training via adaboost classficiation model
ab_clf.fit(X_train, y_train)
print("training....\n")
#make prediction using the test set
ab_pred_stroke= ab_clf.predict(X_train)
print('prediction: \n', ab_pred_stroke)
print('\nparms: \n', ab_clf.get_params)
#score
ab_clf_score = ab_clf.score(X_test, y_test)
print("\nmean accuracy: %.2f" % ab_clf.score(X_test, y_test))

XGboost
xgboost = GradientBoostingClassifier(random_state=0)
xgboost.fit(X_train, y_train)
#==
#Score
#==
xgboost_score = xgboost.score(X_train, y_train)
xgboost_test = xgboost.score(X_test, y_test)
#==
#testing model
#==
y_pred = xgboost.predict(X_test)
#==
#evaluation
#==
cm = confusion_matrix(y_test,y_pred)
print('Training Score',xgboost_score)
print('Testing Score \n',xgboost_test)
#===
#Confusion Matrix
plt.figure(figsize=(14,5))
conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Greens");
print(accuracy_score(y_test,y_pred))

SVM
svc = SVC(random_state=0)
svc.fit(X_train, y_train)
#==
#Score
#==
svc_score = svc.score(X_train, y_train)
svc_test = svc.score(X_test, y_test)
#==
#testing model
#==
y_pred = svc.predict(X_test)
#==
#evaluation
#==
cm = confusion_matrix(y_test,y_pred)
print('Training Score',svc_score)
print('Testing Score \n',svc_test)
print(cm

Random Forest Classifier
forest = RandomForestClassifier(n_estimators = 100)
#==
forest.fit(X_train, y_train)
#==
#Score
#==
forest_score = forest.score(X_train, y_train)
forest_test = forest.score(X_test, y_test)
#==
#testing model
#==
y_pred = forest.predict(X_test)
#==
#evaluation
#==
cm = confusion_matrix(y_test,y_pred)
print('Training Score',forest_score)
print('Testing Score \n',forest_test)
print(cm)

Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print('Testing Score \n',score)
logistic_score = model.score(X_train, y_train)
logistic_test = model.score(X_test, y_test)
#==
y_pred= model.predict(X_test)
print(classification_report(y_test, y_pred))
#==
cm = confusion_matrix(y_test,y_pred)
print(cm)

Feature Importance using Logistic Regression
coef = model.coef_[0]
coef = [abs(number) for number in coef]
print(coef)

cols = list(df.columns)
cols.index('stroke')
#==
#Delete target label
#==
del cols[5]
cols

sorted_index = sorted(range(len(coef)), key = lambda k: coef[k], reverse = True)
for idx in sorted_index:
print(cols[idx])

Although BMI is considered an indicator for recognizing strokes, there are a large number of values in the normal range and not a high rate that indicates a stroke.
MLP NN Classifier
X=df.drop(['stroke','gender','bmi','Residence_type','work_type_Never_worked','smoking_status_Unknown'], axis=1).values
#X = df.drop(['stroke','bmi'], axis=1).values
y = df['stroke'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# mlp = MLPClassifier(hidden_layer_sizes=(1000,300, 300, 300), solver='adam', shuffle=False, tol = 0.0001)
mlp=MLPClassifier(hidden_layer_sizes=(300,300,300), max_iter=1000, alpha=0.00001,
solver='adam', verbose=10, random_state=21)
mlp.fit(X_train, y_train)
mlp_pred= mlp.predict(X_test)
mlp_score = mlp.score(X_train, y_train)
mlp_test = mlp.score(X_test, y_test)
y_pred =mlp.predict(X_test)
#==
#evaluation
#==
cm = confusion_matrix(y_test,y_pred)
print('Training Score',mlp_score)
print('Testing Score \n',mlp_test)
print(cm)
Iteration 1, loss = 0.25073982 Iteration 2, loss = 0.15601721 Iteration 3, loss = 0.15148236 Iteration 4, loss = 0.15011300 Iteration 5, loss = 0.14801346 Iteration 6, loss = 0.14705574 Iteration 7, loss = 0.14343648 Iteration 8, loss = 0.14475396 Iteration 9, loss = 0.14122289 Iteration 10, loss = 0.14020491 Iteration 11, loss = 0.14082460 Iteration 12, loss = 0.13869296 Iteration 13, loss = 0.13551809 Iteration 14, loss = 0.13677271 Iteration 15, loss = 0.13306991 Iteration 16, loss = 0.13627428 Iteration 17, loss = 0.13310803 Iteration 18, loss = 0.13113676 Iteration 19, loss = 0.12786408 Iteration 20, loss = 0.12653028 Iteration 21, loss = 0.12525292 Iteration 22, loss = 0.12757926 Iteration 23, loss = 0.12214366 Iteration 24, loss = 0.12129737 Iteration 25, loss = 0.12211088 Iteration 26, loss = 0.12322562 Iteration 27, loss = 0.11950508 Iteration 28, loss = 0.11867142 Iteration 29, loss = 0.11774275 Iteration 30, loss = 0.11903667 Iteration 31, loss = 0.11632040 Iteration 32, loss = 0.11553193 Iteration 33, loss = 0.11295480 Iteration 34, loss = 0.11218260 Iteration 35, loss = 0.10999969 Iteration 36, loss = 0.11053086 Iteration 37, loss = 0.10904621 Iteration 38, loss = 0.10831232 Iteration 39, loss = 0.10686522 Iteration 40, loss = 0.10644428 Iteration 41, loss = 0.10688178 Iteration 42, loss = 0.10343191 Iteration 43, loss = 0.10450590 Iteration 44, loss = 0.10335569 Iteration 45, loss = 0.10186789 Iteration 46, loss = 0.10005436 Iteration 47, loss = 0.10356312 Iteration 48, loss = 0.10151862 Iteration 49, loss = 0.10214588 Iteration 50, loss = 0.10308373 Iteration 51, loss = 0.09923623 Iteration 52, loss = 0.09605030 Iteration 53, loss = 0.09936861 Iteration 54, loss = 0.09486939 Iteration 55, loss = 0.09245237 Iteration 56, loss = 0.09775333 Iteration 57, loss = 0.09387213 Iteration 58, loss = 0.09417488 Iteration 59, loss = 0.09496724 Iteration 60, loss = 0.09067467 Iteration 61, loss = 0.08957575 Iteration 62, loss = 0.09188115 Iteration 63, loss = 0.09131175 Iteration 64, loss = 0.08956810 Iteration 65, loss = 0.09027089 Iteration 66, loss = 0.09068501 Iteration 67, loss = 0.08620702 Iteration 68, loss = 0.08673546 Iteration 69, loss = 0.08283293 Iteration 70, loss = 0.08313578 Iteration 71, loss = 0.08808702 Iteration 72, loss = 0.08630748 Iteration 73, loss = 0.08130300 Iteration 74, loss = 0.08077653 Iteration 75, loss = 0.08214762 Iteration 76, loss = 0.08222929 Iteration 77, loss = 0.07996879 Iteration 78, loss = 0.08085455 Iteration 79, loss = 0.07764043 Iteration 80, loss = 0.08130066 Iteration 81, loss = 0.07998853 Iteration 82, loss = 0.07847984 Iteration 83, loss = 0.08112860 Iteration 84, loss = 0.07691877 Iteration 85, loss = 0.07564515 Iteration 86, loss = 0.07751632 Iteration 87, loss = 0.07696659 Iteration 88, loss = 0.08058930 Iteration 89, loss = 0.07747721 Iteration 90, loss = 0.07779515 Iteration 91, loss = 0.07564913 Iteration 92, loss = 0.07393943 Iteration 93, loss = 0.07744015 Iteration 94, loss = 0.07466905 Iteration 95, loss = 0.07443650 Iteration 96, loss = 0.07214443 Iteration 97, loss = 0.07238843 Iteration 98, loss = 0.07042956 Iteration 99, loss = 0.06888013 Iteration 100, loss = 0.06920919 Iteration 101, loss = 0.06901262 Iteration 102, loss = 0.07552961 Iteration 103, loss = 0.07174945 Iteration 104, loss = 0.07029673 Iteration 105, loss = 0.07013814 Iteration 106, loss = 0.06784715 Iteration 107, loss = 0.07159969 Iteration 108, loss = 0.06863485 Iteration 109, loss = 0.06673842 Iteration 110, loss = 0.06937063 Iteration 111, loss = 0.06617347 Iteration 112, loss = 0.06500215 Iteration 113, loss = 0.06340067 Iteration 114, loss = 0.06236733 Iteration 115, loss = 0.06458241 Iteration 116, loss = 0.06619115 Iteration 117, loss = 0.07260931 Iteration 118, loss = 0.06929901 Iteration 119, loss = 0.06682100 Iteration 120, loss = 0.06453708 Iteration 121, loss = 0.06246274 Iteration 122, loss = 0.06107513 Iteration 123, loss = 0.06234550 Iteration 124, loss = 0.06083020 Iteration 125, loss = 0.06177546 Iteration 126, loss = 0.05927088 Iteration 127, loss = 0.05970574 Iteration 128, loss = 0.06032682 Iteration 129, loss = 0.06070094 Iteration 130, loss = 0.06367095 Iteration 131, loss = 0.05975269 Iteration 132, loss = 0.06050048 Iteration 133, loss = 0.06072319 Iteration 134, loss = 0.06303969 Iteration 135, loss = 0.06479217 Iteration 136, loss = 0.06493533 Iteration 137, loss = 0.06678607 Training loss did not improve more than tol=0.000100 for 10 consecutive epochs. Stopping. Training Score 0.9751188146491473 Testing Score 0.9347684279191129 [[1420 29] [ 71 13]]
plt.figure(figsize=(14,5))
cm = confusion_matrix(y_test,y_pred)
conf_matrix = pd.DataFrame(data=cm,columns=['Predicted:0','Predicted:1'],index=['Actual:0','Actual:1'])
sns.heatmap(conf_matrix, annot=True,fmt='d',cmap="Reds");

Sensitivity & Specificity

TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]
sensitivity=TP/float(TP+FN)
specificity=TN/float(TN+FP)
print('The acuuracy of the model = TP+TN/(TP+TN+FP+FN) = ',(TP+TN)/float(TP+TN+FP+FN),'\n','The Missclassification = 1-Accuracy = ',1-((TP+TN)/float(TP+TN+FP+FN)),'\n','Sensitivity or True Positive Rate = TP/(TP+FN) = ',TP/float(TP+FN),'\n','Specificity or True Negative Rate = TN/(TN+FP) = ',TN/float(TN+FP),'\n')

This Notebook was written on Kaggle By Ahmed Ashour. Click on his name to follow him. To read more about Python Notebooks click here