Exploratory Data Analysis(EDA) With Python

Exploratory Data Analysis is method which is used by statisticians to show the patterns and some important results. This is mainly done by visualizing various graphs. In Data Analysis EDA is very important step to monitor and recognize the valuable patterns within the data.

Requirements:-
  1. Python
  2. Pandas
  3. Seaborn sns library
  4. matplotlib library
  5. NumPy

In this Post I am going to perform a Simple EDA on Titanic Datasets available on Kaggle. You can download it from here.

Importing Required Python Libraries.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline

# Set default matplot figure size
pylab.rcParams['figure.figsize'] = (10.0, 8.0)
Importing the DataSet using Pandas
titanic_df = pd.read_csv('train.csv')
Check first 5 rows of the dataset
titanic_df.head()
Check Column Names
titanic_df.columns
Information about the dataset.
titanic_df.info()
Number of Passengers in each class
titanic_df.groupby('Pclass')['Pclass'].count()
Plot the classes

As we can see there are 3 classes of passengers in the Dataset. Class 1 has 216 passengers, Class 2 has 184 passengers and class 3 has 491 passengers. Now let’s plot classes using Seaborn.

Plot By Sex
titanic_df.groupby('Sex')['Sex'].count()

As we can see the count of males and females in the passengers.

sns.factorplot('Sex', data=titanic_df, kind='count', aspect=1.5)

Number of Women and Men in each passenger Class

# Number of men and women in each of the passenger class
titanic_df.groupby(['Sex', 'Pclass'])['Sex'].count()
# Again use saeborn to group by Sex and class
g = sns.factorplot('Pclass', data=titanic_df, hue='Sex', kind='count', aspect=1.75)
g.set_xlabels('Class')
As shown in the figure above, there are more than two times males than females in class 3. However, in classes 1 and 2, the ratio of male to female is almost 1.

Now let’s look at the numbers of males and females who survived in titanic grouped by class

titanic_df.pivot_table('Survived', 'Sex', 'Pclass', aggfunc=np.sum, margins=True)
not_survived = titanic_df[titanic_df['Survived']==0]
sns.factorplot('Survived', data=titanic_df, kind='count')
len(not_survived)

Now let’s look at the Number of people who didn’t survive in each class grouped by sex.

not_survived.pivot_table('Survived', 'Sex', 'Pclass', aggfunc=len, margins=True)
Passengers who survived and who didn’t survive grouped by class and sex
table = pd.crosstab(index=[titanic_df.Survived,titanic_df.Pclass], columns=[titanic_df.Sex,titanic_df.Embarked])
table.unstack()
table.columns, table.index
table.columns.set_levels(['Female', 'Male'], level=0, inplace=True)
table.columns.set_levels(['Cherbourg','Queenstown','Southampton'], level=1, inplace=True)
print('Average and median age of passengers are %0.f and %0.f years old, respectively'%(titanic_df.Age.mean(), 
titanic_df.Age.median()))
my_df[my_df.isnull().any(axis=1)].head()
age = titanic_df['Age'].dropna()
age_dist = sns.distplot(age)
age_dist.set_title("Distribution of Passengers' Ages")

Another way to plot a histogram of ages is shown below

titanic_df['Age'].hist(bins=50)

Now let’s check the datatypes of different columns in dataset

titanic_df['Parch'].dtype, titanic_df['SibSp'].dtype, len(titanic_df.Cabin.dropna())

Now let’s create a function to define those who are children.

def male_female_child(passenger):
    age, sex = passenger
    
    if age < 16:
        return 'child'
    else:
        return sex
titanic_df['person'] = titanic_df[['Age', 'Sex']].apply(male_female_child, axis=1)
titanic_df[:10]

Lets do a factorplot of passengers splitted into sex, children and class

sns.factorplot('Pclass', data=titanic_df, kind='count', hue='person', order=[1,2,3],hue_order=['child','female','male'], aspect=2)

Count number of men, women and children

titanic_df['person'].value_counts()

Do the same as above, but split the passengers into either survived or not.

sns.factorplot('Pclass', data=titanic_df, kind='count', hue='person', col='Survived', order=[1,2,3], hue_order=['child','female','male'], aspect=1.25, size=5)

There are much more children in third class than there are in first and second class. However, one may expect that there woould be more children in 1st and 2nd class than there are in 3rd class.

kde plot, Distribution of Passengers’ Ages

Grouped by Gender
fig = sns.FacetGrid(titanic_df, hue='Sex', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.set(title='Distribution of Age Grouped by Gender')
fig.add_legend()
fig = sns.FacetGrid(titanic_df, hue='person', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.add_legend()
Grouped By Class
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)
fig.map(sns.kdeplot, 'Age', shade=True)
oldest = titanic_df['Age'].max()
fig.set(xlim=(0,oldest))
fig.set(title='Distribution of Age Grouped by Class')
fig.add_legend()

From the plot above, class 1 has a normal distribution. However, classes 2 and 3 have a skewed distribution towards 20 and 30-year old passengers.

What cabins did the Passengers stay in?
deck = titanic_df['Cabin'].dropna()
deck.head()

Grab the first letter of the cabin letter

d = []
for c in deck:
    d.append(c[0])
d[0:10]
from collections import Counter
Counter(d)

Now lets factorplot the cabins. First transfer the d list into a data frame. Then rename the column Cabin

cabin_df = DataFrame(d)
cabin_df.columns=['Cabin']
sns.factorplot('Cabin', data=cabin_df, kind='count', order=['A','B','C','D','E','F','G','T'], aspect=2,palette='winter_d')
#Drop the T cabin
cabin_df = cabin_df[cabin_df['Cabin'] != 'T']

Then replot the Cabins factorplot as above

sns.factorplot('Cabin', data=cabin_df, kind='count', order=['A','B','C','D','E','F','G'], aspect=2, palette='Greens_d')
# Below is a link to the list of matplotlib colormaps
url = 'http://matplotlib.org/api/pyplot_summary.html?highlight=colormaps#matplotlib.pyplot.colormaps'
import webbrowser
webbrowser.open(url)

Where did the passengers come from i.e. Where did the passengers land into the ship from?

sns.factorplot('Embarked', data=titanic_df, kind='count', hue='Pclass', hue_order=range(1,4), aspect=2,order = ['C','Q','S'])

From the figure above, one may conclude that almost all of the passengers who boarded from Queenstown were in third class. On the other hand, many who boarded from Cherbourg were in first class. The biggest portion of passengers who boarded the ship came from Southampton, in which 353 passengers were in third class, 164 in second class and 127 passengers were in first class. In such cases, one may need to look at the economic situation at these different towns at that period of time to understand why most passengers who boarded from Queens town were in third class for example.

titanic_df.Embarked.value_counts()

For tabulated values, use crosstab pandas method instead of the factorplot in seaborn

port = pd.crosstab(index=[titanic_df.Pclass], columns=[titanic_df.Embarked])
port.columns = [['Cherbourg','Queenstown','Southampton']]
port
port.index
port.columns
port.index=[['First','Second','Third']]
port

Who was alone and who was with parents or siblings?

titanic_df[['SibSp','Parch']].head()
# Alone dataframe i.e. the passenger has no siblings or parents
alone_df = titanic_df[(titanic_df['SibSp'] == 0) &amp; (titanic_df['Parch']==0)]
# Add Alone column
alone_df['Alone'] = 'Alone'
# Not alone data frame i.e. the passenger has either a sibling or a parent.
not_alone_df = titanic_df[(titanic_df['SibSp'] != 0) | (titanic_df['Parch']!=0)]
not_alone_df['Alone'] = 'With family'

# Merge the above dataframes
comb = [alone_df, not_alone_df]
# Merge and sort by index
titanic_df = pd.concat(comb).sort_index()
[len(alone_df), len(not_alone_df)]
alone_df.head()

Not Alone Dataframe

not_alone_df.head()
titanic_df.head()
""" Another way to perform the above
titanic_df['Alone'] = titanic_df.SibSp + titanic_df.Parch

titanic_df['Alone'].loc[titanic_df['Alone']>0] = 'With family'
titanic_df['Alone'].loc[titanic_df['Alone']==0] = 'Alone'"""
fg=sns.factorplot('Alone', data=titanic_df, kind='count', hue='Pclass', col='person', hue_order=range(1,4), palette='Blues')
fg.set_xlabels('Status')

From the figure above, it is clear that most children traveled with family in third class. For men, most traveled alone in third class. On the other hand, the number of female passengers who traveled either with family or alone among the second and third class is comparable. However, more women traveled with family than alone in first class.

Factors Affecting the Surviving

Now lets look at the factors that help someone survived the sinking. We start this analysis by adding a new
cloumn to the titanic data frame. Use the Survived column to map to the new column with factors 0:no and 1:yes
using the map method

titanic_df['Survivor'] = titanic_df.Survived.map({0:'no', 1:'yes'})
titanic_df.head()
Class Factor

Survived vs. class Grouped by gender

sns.factorplot('Pclass','Survived', hue='person', data=titanic_df, order=range(1,4),hue_order = ['child','female','male'])

From the figure above, being a male or a third class reduce the chance for one to survive.

sns.factorplot('Survivor', data=titanic_df, hue='Pclass', kind='count', palette='Pastel2', hue_order=range(1,4),col='person')
Age Factor

Linear plot of age vs. survived

sns.lmplot('Age', 'Survived', data=titanic_df)

There seems to be a general linear trend between age and the survived field. The plot shows that the older the passenger is, the less chance he/she would survive.

Survived vs. Age grouped by Sex
sns.lmplot('Age', 'Survived', data=titanic_df, hue='Sex')

Older women have higher rate of survival than older men as shown in the figure above. Also, older women has higher rate of survival than younger women; an opposite trend to the one for the male passengers.

Survived vs. Age grouped by class
sns.lmplot('Age', 'Survived', hue='Pclass', data=titanic_df, palette='winter', hue_order=range(1,4))

In all three classes, the chance to survive reduced as the passengers got older.

# Create a generation bin
generations = [10,20,40,60,80]
sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,x_bins=generations, hue_order=[1,2,3])

Deck Factor

titanic_df.columns
titanic_DF = titanic_df.dropna(subset=['Cabin'])
d[0:10]
len(titanic_DF), len(d)
titanic_DF['Deck'] = d
titanic_DF = titanic_DF[titanic_DF.Deck != 'T']
titanic_DF.head()
sns.factorplot('Deck', 'Survived', data=titanic_DF, order=['A','B','C','D','E','F','G'])

There does not seem to be any relation between deck and the survival rate as shown in the above figure!

Family Status Factor
sns.factorplot('Alone', 'Survived', data=titanic_df, palette='winter') #hue='person', #hue_order=['child', 'female', 'male'])

There seems that the survival rate diminishes significantly for those who were alone. However, lets check if a gender or age play a factor. From the figure below, one may conclude that the survival rate for women and children are much higher than that of men, as was concluded previously and as anticipated. However, the survival rate is not significant for either gender or for children who were with family versus who were alone. Moreover, the survival rate for women and children increases for those who were alone. For men, the survival rate diminishes slightly for those who were alone versus for those who were with family.

sns.factorplot('Alone', 'Survived', data=titanic_df, palette='winter', hue='person', hue_order=['child', 'female', 'male'])
# Lets split it by class now!
sns.factorplot('Alone', 'Survived', data=titanic_df, palette='summer', hue='person', hue_order=['child', 'female', 'male'], col='Pclass', col_order=[1,2,3])

Leave a Reply

%d bloggers like this: