Bootstrapping and CIs with Veteran Suicides

Bagging
Bagging Feature Image

Introduction

This very sobering dataset documents the occurrences of suicides between veterans and non-veterans by state. As I stated in the introduction, state-side suicide rates amongst veterans are highly elevated over those of ordinary civilians. We can use this dataset to understand how much higher they are by building a confidence interval (often shorthanded to CI) around the data.

Confidence Intervals

You may also want to read the hypothesis testing notebook. And The salient point from that notebook was the fact that, due to the Central Limit Theorem. An estimator drawn more than approximately 30 times will have errors .(Deviations from the true value) that are approximately normally distributed.

To build a confidence interval, we first center on our estimator. In this case, the mean of the statewide suicide rates is an estimate for the nationwide rate x¯. We then plug it into the following formula:

formula

Where s is the standard deviation of the dataset , t* is the t-score; and n is the number of samples. Standard Deviation is the square root of its variance, a population parameter describing the amount of spread in the data.

A standard normal distribution is centered on the mean x¯. With decreasing probability as we go further out to the left or right of the distribution . The t* describes how far to the left and how far out on a standard normal distribution. We are going to try to capture “enough” of the distribution to be meaningfully accurate. The more confidence we want to be in our confidence interval, the larger the interval will be. Because the more cumulative probability we will have to cover. This is an accuracy-precision tradeoff.

Analysis

t* is based on the amount of confidence we want to have (95%,99%, etc.). And also on the number of samples in the distribution. A distribution with a very low number of samples will have significantly longer intervals. In other words, the errors will be distributed according to the standard normal. Only in the case when the number of samples is large enough! In cases with very few points, version of the normal with longer tails, the Student’s t-distribution, will be used instead. This limiting case, the standard normal, is what we can stick to in most practical cases (n≥15n≥15 or so) however; it is also known as the z score.

Implementation

With that jargon out of the way let’s go ahead and write up a confidence interval!

import pandas as pd
import numpy as np
veterans_2005 = pd.read_csv("../input/2005.csv", index_col=0)
veterans_2005.head(3)
import
df_2005 = pd.DataFrame(
    {'vet': veterans_2005['vet_suicides'] / veterans_2005['vet_pop'],
     'civ': (veterans_2005['all_suicides'] - veterans_2005['vet_suicides']) / 
            (veterans_2005['overall_pop_18'] - veterans_2005['vet_pop'])}
)
df_2005.head(3)
3
The formula is relatively straightforward to implement, though finding the t
t slash z
z score requires a table lookup in scipy.
import scipy.stats as st

def confidence_interval(X, c):
    x_bar = np.mean(X)
    z_score = st.norm.ppf(1 - ((1 - c) / 2))
    sqrt_n = np.sqrt(len(X))
    std_dev = np.std(X)
    
    delta = z_score * (std_dev / sqrt_n)
    return np.array([x_bar - delta, x_bar + delta])

Here are our 95% confidence intervals for suicide rates in 2005 for civilian and veteran populations, respectively.

confidence_interval(df_2005.civ, 0.95)
2
confidence_interval(df_2005.vet, 0.95)
4

You can clearly see a very large difference in these confidence intervals. However, the numbers themselves are hard to interpret, so let’s multiply things through a bit. The following values are 95% confidence intervals. This is for the annualized suicide risk for civilians and veterans, respectively, per one million population.

confidence_interval(df_2005.civ, 0.95) * 1000000
5
confidence_interval(df_2005.vet, 0.95) * 1000000
6

Non-military civilian suffer approximately 150 suicides per million per year. We are 95% confident the true mean lies between 142 and 166. Veterans, meanwhile, suffer approximately 300 suicides per year. We are 95% confident the true mean lies between 274 and 331.

Overall, the data backs up the fact that veterans commit suicide twice as often as non-military civilians do . This result is collaborated by research summarized e.g. here

Bootstrapping

I’m going to now introduce another concept, bootstrapping, which will help illustrate why confidence intervals work the way they do.

Bootstrapping is an extremely useful technique for non-parametrically . That is, without paying attention to distributions). Estimating things that we are interested in estimating. We bootstrap data by randomly sampling estimates from our data and applying our estimator to the random samples. The samples should be a small but significant slice of the overall data.

In this illustrative case I’ll use n=20.

draws = np.array([np.random.choice(df_2005.civ, size=20) for _ in range(10000)]) * 1000000
civ_means = np.array([np.mean(draw) for draw in draws])

draws = np.array([np.random.choice(df_2005.vet, size=20) for _ in range(10000)]) * 1000000
vet_means = np.array([np.mean(draw) for draw in draws])

del draws
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

fig, axarr = plt.subplots(1, 2, figsize=(12, 4))
plt.suptitle("Suicides Per Million Estimators, Bootstrapped").set_position([.5, 1.05])

import seaborn as sns
sns.distplot(pd.Series(civ_means), ax=axarr[0])
axarr[0].set_title("Civilians")

sns.distplot(pd.Series(vet_means), ax=axarr[1])
axarr[1].set_title("Veterans")
pass
9

Bootstrapping is great visual tool. Because it lets us get a handle on what numbers like “we are 95% confident” mean. In this case we see that, indeed, about 95 percent of our 20-state averages are between 142 and 166. A similar story emerges with the 274 and 331 values for veterans.

Change over Time

The analysis above was based on 2005 data. What has the change in the intervening years in the data been like? Has the military made progress on or made an impact on “closing the gap” in this troubling mental health outcome?

years = range(2005, 2012)

cis = []
for year in years:
    df = pd.read_csv("../input/{0}.csv".format(year), index_col=0)
    df = pd.DataFrame(
        {'vet': df['vet_suicides'] / df['vet_pop'],
         'civ': (df['all_suicides'] - df['vet_suicides']) / 
                (df['overall_pop_18'] - df['vet_pop'])}
    )
    cis.append({'civ': confidence_interval(df.civ, 0.95),
                'vet': confidence_interval(df.vet, 0.95)})
civ_means = [np.mean(c['civ'])*10**6 for c in cis]
vet_means = [np.mean(c['vet'])*10**6 for c in cis]

civ_mins = [0]*10**6 for c in cis]
vet_mins = [0]*10**6 for c in cis]

civ_maxs = [1]*10**6 for c in cis]
vet_maxs = [1]*10**6 for c in cis]
ind = pd.Index(range(2005, 2012))

fig, axarr = plt.subplots(1, 2, figsize=(12, 4))
plt.suptitle("Suicides Per Million Estimates, 2005-2011").set_position([.5, 1.05])

pd.Series(civ_means, index=ind).plot.line(color='black', ax=axarr[0])
pd.Series(civ_mins, index=ind).plot.line(color='steelblue', ax=axarr[0])
pd.Series(civ_maxs, index=ind).plot.line(color='steelblue', ax=axarr[0])
axarr[0].set_title("Civilians")

pd.Series(vet_means, index=ind).plot.line(color='black', ax=axarr[1])
pd.Series(vet_mins, index=ind).plot.line(color='steelblue', ax=axarr[1])
pd.Series(vet_maxs, index=ind).plot.line(color='steelblue', ax=axarr[1])
axarr[1].set_title("Veterans")
pass

No. In fact it appears that suicides are increasing in both the civilian and veteran populations.

That’s all folks. Thanks for reading on geekycodes.

Leave a Reply

%d bloggers like this: