Algorithms Blog Data Science Decision Tree Machine Learning Python Random Forest

Heart Disease? Explaining the ML Model | Part 2

The Explanation

Before reading this explanation If you have not read Machine Learning implementation of this post then read it here

Now let’s see what the model gives us from the ML explanability tools.

Permutation importance is the first tool for understanding a machine-learning model, and involves shuffling individual variables in the validation data (after a model has been fit), and seeing the effect on accuracy. Learn more here.

Let’s take a look,

perm = PermutationImportance(model, random_state=1).fit(X_test, y_test)
eli5.show_weights(perm, feature_names = X_test.columns.tolist())

So, it looks like the most important factors in terms of permutation is a thalessemia result of ‘reversable defect’. The high importance of ‘max heart rate achieved’ type makes sense, as this is the immediate, subjective state of the patient at the time of examination (as opposed to, say, age, which is a much more general factor).

Let’s take a closer look at the number of major vessels using a Partial Dependence Plot (learn more here). These plots vary a single variable in a single row across a range of values and see what effect it has on the outcome. It does this for several rows and plots the average effect. Let’s take a look at the ‘num_major_vessels’ variable, which was at the top of the permutation importance list,

base_features = dt.columns.values.tolist()

feat_name = 'num_major_vessels'
pdp_dist = pdp.pdp_isolate(model=model, dataset=X_test, model_features=base_features, feature=feat_name)

pdp.pdp_plot(pdp_dist, feat_name)

So, we can see that as the number of major blood vessels increases, the probability of heart disease decreases. That makes sense, as it means more blood can get to the heart.

What about the ‘age’,

feat_name = 'age'
pdp_dist = pdp.pdp_isolate(model=model, dataset=X_test, model_features=base_features, feature=feat_name)

pdp.pdp_plot(pdp_dist, feat_name)

That’s a bit odd. The higher the age, the lower the chance of heart disease? Althought the blue confidence regions show that this might not be true (the red baseline is within the blue zone).

What about the ‘st_depression’

feat_name = 'st_depression'
pdp_dist = pdp.pdp_isolate(model=model, dataset=X_test, model_features=base_features, feature=feat_name)

pdp.pdp_plot(pdp_dist, feat_name)

Interestingly, this variable also shows a reduction in probability the higher it goes. What exactly is this? A search on Google brought me to the following description by Anthony L. Komaroff, MD, an internal medicine specialist 5 …. “An electrocardiogram (ECG) measures the heart’s electrical activity. The waves that appear on it are labeled P, QRS, and T. Each corresponds to a different part of the heartbeat. The ST segment represents the heart’s electrical activity immediately after the right and left ventricles have contracted, pumping blood to the lungs and the rest of the body. Following this big effort, ventricular muscle cells relax and get ready for the next contraction.

During this period, little or no electricity is flowing, so the ST segment is even with the baseline or sometimes slightly above it. The faster the heart is beating during an ECG, the shorter all of the waves become. The shape and direction of the ST segment are far more important than its length. Upward or downward shifts can represent decreased blood flow to the heart from a variety of causes, including heart attack, spasms in one or more coronary arteries (Prinzmetal’s angina), infection of the lining of the heart (pericarditis) or the heart muscle itself (myocarditis), an excess of potassium in the bloodstream, a heart rhythm problem, or a blood clot in the lungs (pulmonary embolism).”

So, this variable, which is described as ‘ST depression induced by exercise relative to rest’, seems to suggest the higher the value the higher the probability of heart disease, but the plot above shows the opposite. Perhaps it’s not just the depression amount that’s important, but the interaction with the slope type? Let’s check with a 2D PDP,

inter1  =  pdp.pdp_interact(model=model, dataset=X_test, model_features=base_features, features=['st_slope_upsloping', 'st_depression'])

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['st_slope_upsloping', 'st_depression'], plot_type='contour')

inter1  =  pdp.pdp_interact(model=model, dataset=X_test, model_features=base_features, features=['st_slope_flat', 'st_depression'])

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=['st_slope_flat', 'st_depression'], plot_type='contour')

It looks like a low depression is bad in both cases. Odd.

Let’s see what the SHAP values tell us. These work by showing the influence of the values of every variable in a single row, compared to their baseline values (learn more here).

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

shap.summary_plot(shap_values[1], X_test, plot_type="bar")

The number of major vessels is at the top. Let’s use a summary plot of the SHAP values,

shap.summary_plot(shap_values[1], X_test)

The number of major vessels division is pretty clear, and it’s saying that low values are bad (blue on the right). The thalassemia ‘reversable defect’ division is very clear (yes = red = good, no = blue = bad).

You can see some clear separation in many of the other variables. Exercise induced angina has a clear separation, although not as expected, as ‘no’ (blue) increases the probability. Another clear one is the st_slope. It looks like when it’s flat, that’s a bad sign (red on the right).

It’s also odd is that the men (red) have a reduced chance of heart disease in this model. Why is this? Domain knowledge tells us that men have a greater chance.

Next, let’s pick out individual patients and see how the different variables are affecting their outcomes,

def heart_disease_risk_factors(model, patient):

    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(patient)
    return shap.force_plot(explainer.expected_value[1], shap_values[1], patient)
data_for_prediction = X_test.iloc[1,:].astype(float)
heart_disease_risk_factors(model, data_for_prediction)

For this person, their prediction is 36% (compared to a baseline of 58.4%). Many things are working in their favour, including having a major vessel, a reversible thalassemia defect, and not having a flat st_slope.

Let’s check another,

data_for_prediction = X_test.iloc[3,:].astype(float)
heart_disease_risk_factors(model, data_for_prediction)

For this person, their prediction is 70% (compared to a baseline of 58.4%). Not working in their favour are things like having no major vessels, a flat st_slope, and not a reversible thalassemia defect.

We can also plot something called ‘SHAP dependence contribution plots’ (learn more here), which are pretty self-explanatory in the context of SHAP values,

ax2 = fig.add_subplot(224)
shap.dependence_plot('num_major_vessels', shap_values[1], X_test, interaction_index="st_depression")

You can see the stark effect on the number of major vessels, but there doesn’t seem to be a lot to take from the colour (st_depression).

The final plot, for me, is one of the most effective. It shows the predictions and influencing factors for many (in this case 50) patients, all together. It’s also interactive, which is great. Hover over to see why each person ended up either red (prediction of disease) or blue (prediction of no disease),

shap_values = explainer.shap_values(X_train.iloc[:50])
shap.force_plot(explainer.expected_value[1], shap_values[1], X_test.iloc[:50])


This dataset is old and small by today’s standards. However, it’s allowed us to create a simple model and then use various machine learning explainability tools and techniques to peek inside. At the start, I hypothesized, using (Googled) domain knowledge that factors such as cholesterol and age would be major factors in the model. This dataset didn’t show that. Instead, the number of major factors and aspects of ECG results dominated. I actually feel like I’ve learnt a thing or two about heart disease!

I suspect this sort of approach will become increasingly important as machine learning has a greater and greater role in health care.


Leave a Reply

%d bloggers like this: