Data Science Decision Tree Machine Learning Pandas Python Random Forest

Analysis on campus recruitment data

Campus recruitment is a strategy for sourcing, engaging and hiring young talent for internship and entry-level positions. College recruiting is typically a tactic for medium- to large-sized companies with high-volume recruiting needs, but can range from small efforts (like working with university career centers to source potential candidates) to large-scale operations (like visiting a wide array of colleges and attending recruiting events throughout the spring and fall semester). Campus recruitment often involves working with university career services centers and attending career fairs to meet in-person with college students and recent graduates.

Context of our Dataset: Our dataset revolves around the placement season of a Business School in India. Where it has various factors on candidates getting hired such as work experience, exam percentage etc. , Finally it contains the status of recruitment and remuneration details.

Kernel Goals

There are three primary goals of this kernel.

  • Do a exploratory analysis of the Recruitment dataset
  • Do an visualization analysis of the Recruitment dataset
  • Prediction: To predict whether a student got placed or not using classification models.

Importing libraries and exploring Data

Importing Libraries

Python is a fantastic language with a vibrant community that produces many amazing libraries. I am not a big fan of importing everything at once for the newcomers. So, I am going to introduce a few necessary libraries for now, and as we go on, we will keep unboxing new libraries when it seems appropriate.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn 
%matplotlib inline

Extracting dataset

#Loading the single csv file to a variable named 'placement'
placement=pd.read_csv("../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv")

Examining the dataset

placement_copy=placement.copy()
placement_copy.head()

sl_no
genderssc_pssc_bhsc_phsc_bhsc_sdegree_pdegree_tworkexetest_pspecialisationmba_pstatussalary
1M67.00Others91.00OthersCommerce58.00Sci&TechNo55.0Mkt&HR58.80Placed270000.0
2M79.33Central78.33OthersScience77.48Sci&TechYes86.5Mkt&Fin66.28Placed200000.0
3M65.00Central68.00CentralArts64.00Comm&MgmtNo75.0Mkt&Fin57.80Placed250000.0
4M56.00Central52.00CentralScience52.00Sci&TechNo66.0Mkt&HR59.43Not PlacedNaN
5M85.80Central73.60CentralCommerce73.30Comm&MgmtNo96.8Mkt&Fin55.50Placed425000.0

Inference

  • We have Gender and Educational qualification data
  • We have all the educational performance(score) data
  • We have the status of placement and salary details
  • We can expect null values in salary as candidates who weren’t placed would have no salary
  • Status of placement is our target variable rest of them are independent variable except salary
print ("The shape of the  data is (row, column):"+ str(placement.shape))
print (placement_copy.info())
The shape of the  data is (row, column):(215, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 215 entries, 0 to 214
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sl_no           215 non-null    int64  
 1   gender          215 non-null    object 
 2   ssc_p           215 non-null    float64
 3   ssc_b           215 non-null    object 
 4   hsc_p           215 non-null    float64
 5   hsc_b           215 non-null    object 
 6   hsc_s           215 non-null    object 
 7   degree_p        215 non-null    float64
 8   degree_t        215 non-null    object 
 9   workex          215 non-null    object 
 10  etest_p         215 non-null    float64
 11  specialisation  215 non-null    object 
 12  mba_p           215 non-null    float64
 13  status          215 non-null    object 
 14  salary          148 non-null    float64
dtypes: float64(6), int64(1), object(8)
memory usage: 25.3+ KB
None

We have 215 candidate details and there are mixed datatypes in each column. We have few missing values in the salary column as expected since those are the people who didn’t get hired.

#Looking at the datatypes of each factor
placement_copy.dtypes
sl_no               int64
gender             object
ssc_p             float64
ssc_b              object
hsc_p             float64
hsc_b              object
hsc_s              object
degree_p          float64
degree_t           object
workex             object
etest_p           float64
specialisation     object
mba_p             float64
status             object
salary            float64
dtype: object

We have 1 integer,5 float and 8 object datatypes in our datasetlinkcode

Checking for missing data

Datasets in the real world are often messy, However, this dataset is almost clean and simple. Lets analyze and see what we have here.

import missingno as msno 
msno.matrix(placement)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0c4586f2d0>
Missing Data

As per our inference, we can visualize the null values in salary. Let’s see the count

print('Data columns with null values:',placement_copy.isnull().sum(), sep = '\n')
Data columns with null values:
sl_no              0
gender             0
ssc_p              0
ssc_b              0
hsc_p              0
hsc_b              0
hsc_s              0
degree_p           0
degree_t           0
workex             0
etest_p            0
specialisation     0
mba_p              0
status             0
salary            67
dtype: int64

Inference

  • There are 67 null values in our data, which means 67 unhired candidates.
  • We can’t drop these values as this will provide a valuable information on why candidates failed to get hired.
  • We can’t impute it with mean/median values and it will go against the context of this dataset and it will show unhired candidates got salary.
  • Our best way to deal with these null values is to impute it with ‘0’ which shows they don’t have any income

Data Cleaning

Handling missing values

First lets focus on the missing data in review features, if we drop the rows which has null values we might sabotage some potential information from the dataset. So we have to impute values into the NaN records which leads us to accurate models. Since it is a salary feature, it is best to impute the records with ‘0’ for unhired candidates.

placement_copy['salary'].fillna(value=0, inplace=True)
print('Salary column with null values:',placement_copy['salary'].isnull().sum(), sep = '\n')
Salary column with null values:
0

Yayy ! we have cleared that Salary with zero null values. Now it’s time to drop unwanted features !

placement_copy.drop(['sl_no','ssc_b','hsc_b'], axis = 1,inplace=True) 
placement_copy.head()

gender
ssc_phsc_phsc_sdegree_pdegree_tworkexetest_pspecializationmba_pstatussalary
0M67.0091.00Commerce58.00Sci&TechNo55.0Mkt&HR58.80Placed270000.0
1M79.3378.33Science77.48Sci&TechYes86.5Mkt&Fin66.28Placed200000.0
2M65.0068.00Arts64.00Comm&MgmtNo75.0Mkt&Fin57.80Placed250000.0
3M56.0052.00Science52.00Sci&TechNo66.0Mkt&HR59.43Not Placed0.0
4M85.8073.60Commerce73.30Comm&MgmtNo96.8Mkt&Fin55.50Placed425000.0

We have dropped serial number as we have index as default and we have dropped the boards of school education as I believe it doesn’t matter for recruitmentlinkcode

Outliers

Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons.

Outliers can be very informative about the subject-area and data collection process. It’s essential to understand how outliers occur and whether they might happen again as a normal part of the process or study area. Unfortunately, resisting the temptation to remove outliers inappropriately can be difficult. Outliers increase the variability in your data, which decreases statistical power. Consequently, excluding outliers can cause your results to become statistically significant. In our case, let’s first visualize our data and decide on what to do with the outliers

plt.figure(figsize = (15, 10))
plt.style.use('seaborn-white')
ax=plt.subplot(221)
plt.boxplot(placement_copy['ssc_p'])
ax.set_title('Secondary school percentage')
ax=plt.subplot(222)
plt.boxplot(placement_copy['hsc_p'])
ax.set_title('Higher Secondary school percentage')
ax=plt.subplot(223)
plt.boxplot(placement_copy['degree_p'])
ax.set_title('UG Degree percentage')
ax=plt.subplot(224)
plt.boxplot(placement_copy['etest_p'])
ax.set_title('Employability percentage')
Text(0.5, 1.0, 'Employability percentage')
Employee Percentage

As you see, we have very less number of outliers in our features. Especially we have majority of the outliers in hsc percentage Let’s clear’em up!

Q1 = placement_copy['hsc_p'].quantile(0.25)
Q3 = placement_copy['hsc_p'].quantile(0.75)
IQR = Q3 - Q1    #IQR is interquartile range. 

filter = (placement_copy['hsc_p'] >= Q1 - 1.5 * IQR) & (placement_copy['hsc_p'] <= Q3 + 1.5 *IQR)
placement_filtered=placement_copy.loc[filter]
plt.figure(figsize = (15, 5))
plt.style.use('seaborn-white')
ax=plt.subplot(121)
plt.boxplot(placement_copy['hsc_p'])
ax.set_title('Before removing outliers(hsc_p)')
ax=plt.subplot(122)
plt.boxplot(placement_filtered['hsc_p'])
ax.set_title('After removing outliers(hsc_p)')
Text(0.5, 1.0, 'After removing outliers(hsc_p)')
after removing outliers

Voalá! We have removed the outliers

Data Visualizations

Count of categorical features- Count plot

plt.figure(figsize = (15, 7))
plt.style.use('seaborn-white')

#Specialisation
plt.subplot(234)
ax=sns.countplot(x="specialisation", data=placement_filtered, facecolor=(0, 0, 0, 0),
                 linewidth=5,edgecolor=sns.color_palette("magma", 3))
fig = plt.gcf()
fig.set_size_inches(10,10)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=12)

#Work experience
plt.subplot(235)
ax=sns.countplot(x="workex", data=placement_filtered, facecolor=(0, 0, 0, 0),
                 linewidth=5,edgecolor=sns.color_palette("cividis", 3))
fig = plt.gcf()
fig.set_size_inches(10,10)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=12)

#Degree type
plt.subplot(233)
ax=sns.countplot(x="degree_t", data=placement_filtered, facecolor=(0, 0, 0, 0),
                 linewidth=5,edgecolor=sns.color_palette("viridis", 3))
fig = plt.gcf()
fig.set_size_inches(10,10)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=12,rotation=20)

#Gender
plt.subplot(231)
ax=sns.countplot(x="gender", data=placement_filtered, facecolor=(0, 0, 0, 0),
                 linewidth=5,edgecolor=sns.color_palette("hot", 3))
fig = plt.gcf()
fig.set_size_inches(10,10)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=12)

#Higher secondary specialisation
plt.subplot(232)
ax=sns.countplot(x="hsc_s", data=placement_filtered, facecolor=(0, 0, 0, 0),
                 linewidth=5,edgecolor=sns.color_palette("rocket", 3))
fig = plt.gcf()
fig.set_size_inches(10,10)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=12)

#Status of recruitment
plt.subplot(236)
ax=sns.countplot(x="status", data=placement_filtered, facecolor=(0, 0, 0, 0),
                 linewidth=5,edgecolor=sns.color_palette("copper", 3))
fig = plt.gcf()
fig.set_size_inches(10,10)
ax.set_xticklabels(ax.get_xticklabels(),fontsize=12)
[Text(0, 0, 'Placed'), Text(0, 0, 'Not Placed')]

Inference

  • We have more male candidates than female
  • We have candidates who did commerce as their hsc course and as well as undergrad
  • Science background candidates are the second highest in both the cases
  • Candidates from Marketing and Finance dual specialization are high
  • Most of our candidates from our dataset don’t have any work experience
  • Most of our candidates from our dataset got placed in a company

Distribution Salary- Placed Students

sns.set(rc={'figure.figsize':(12,8)})
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={"height_ratios": (.15, .85)})

placement_placed = placement_filtered[placement_filtered.salary != 0]
sns.boxplot(placement_placed["salary"], ax=ax_box)
sns.distplot(placement_placed["salary"], ax=ax_hist)
 
# Remove x axis name for the boxplot
ax_box.set(xlabel='')
[Text(0.5, 0, '')]
Salary

Inference

  • Many candidates who got placed received package between 2L-4L PA
  • Only one candidate got around 10L PA
  • The average of the salary is a little more than 2LPA

Employability score vs Salary- Joint plot

sns.set(rc={'figure.figsize':(12,8)})
sns.set(style="white", color_codes=True)
sns.jointplot(x=placement_filtered["etest_p"], y=placement_filtered["salary"], kind='kde', color="skyblue")
<seaborn.axisgrid.JointGrid at 0x7f0c4517de50>

Inference

  • Most of the candidates scored around 60 percentage got a decent package of around 3 lakhs PA
  • Not many candidates received salary more than 4 lakhs PA
  • The bottom dense part shows the candidates who were not placed

Distribution of all percentages

plt.figure(figsize = (15, 7))
plt.style.use('seaborn-white')
plt.subplot(231)
sns.distplot(placement_filtered['ssc_p'])
fig = plt.gcf()
fig.set_size_inches(10,10)

plt.subplot(232)
sns.distplot(placement_filtered['hsc_p'])
fig = plt.gcf()
fig.set_size_inches(10,10)

plt.subplot(233)
sns.distplot(placement_filtered['degree_p'])
fig = plt.gcf()
fig.set_size_inches(10,10)

plt.subplot(234)
sns.distplot(placement_filtered['etest_p'])
fig = plt.gcf()
fig.set_size_inches(10,10)

plt.subplot(235)
sns.distplot(placement_filtered['mba_p'])
fig = plt.gcf()
fig.set_size_inches(10,10)

plt.subplot(236)
sns.distplot(placement_placed['salary'])
fig = plt.gcf()
fig.set_size_inches(10,10)

Distribution Percentage

Inference

  • All the distributions follow normal distribution except salary feature
  • Most of the candidates educational performances are between 60-80%
  • Salary distribution got outliers where few have got salary of 7.5L and 10L PA

Work experience Vs Placement Status

#Code forked from-https://www.kaggle.com/biphili/hospitality-in-era-of-airbnb
plt.style.use('seaborn-white')
f,ax=plt.subplots(1,2,figsize=(18,8))
placement_filtered['workex'].value_counts().plot.pie(explode=[0,0.05],autopct='%1.1f%%',ax=ax[0],shadow=True)
ax[0].set_title('Work experience')
sns.countplot(x = 'workex',hue = "status",data = placement_filtered)
ax[1].set_title('Influence of experience on placement')
plt.show()
work experience

Inference

  • We have nearly 66.2% of candidates who never had any work experience
  • Candidates who never had work experience have got hired more than the ones who had experience
  • We can conclude that work experience doesn’t influence a candidate in the recruitment process

MBA marks vs Placement Status- Does your academic score influence?

g = sns.boxplot(y = "status",x = 'mba_p',data = placement_filtered, whis=np.inf)
g = sns.swarmplot(y = "status",x = 'mba_p',data = placement_filtered, size = 7,color = 'black')
sns.despine()
g.figure.set_size_inches(12,8)
plt.show()
Academic Score

Inference
Comparitively there’s a slight difference between the percentage scores between both the groups, But still placed candidates still has an upper hand when it comes to numbers as you can see in the swarm. So as per the plot,percentage do influence the placement statuslinkcode

Does MBA percentage and Employability score correlate?

import plotly_express as px
gapminder=px.data.gapminder()
px.scatter(placement_filtered,x="mba_p",y="etest_p",color="status",facet_col="workex")
MBA Correlation

Inference

  • There is no relation between mba percentage and employability test
  • There are many candidates who haven’t got placed when they don’t have work experience
  • Most of the candidates who performed better in both tests have got placed.

Is there any gender bias while offering remuneration?

px.violin(placement_placed,y="salary",x="specialisation",color="gender",box=True,points="all")
Specialization

Inference

  • The top salaries were given to male
  • The average salary offered were also higher for male
  • More male candidates were placed compared to female candidates

linkcode

Correlation between academic percentages

sns.heatmap(placement_placed.corr(),annot=True,fmt='.1g',cmap='Greys')
<matplotlib.axes._subplots.AxesSubplot at 0x7f0c3db47910>
Heat Plot

Inference

  • Candidates who were good in their academics performed well throughout school,undergrad,mba and even employability test
  • These percentages don’t have any influence over their salary

linkcode

Distribution of our data

sns.pairplot(placement_filtered,vars=['ssc_p','hsc_p','degree_p','mba_p','etest_p'],hue="status")
<seaborn.axisgrid.PairGrid at 0x7f0c3db8de10>

Inference

  • Candidates who has high score in higher secondary and undergrad got placed
  • Whomever got high scores in their schools got placed
  • Comparing the number of students who got placed candidates who got good mba percentage and employability percentage

Preprocessing data for classficiation models

Now let’s welcome our data to the model.Before jumping onto creating models we have to prepare our dataset for the models. We dont have to perform imputation as we dont have any missing values but we have categorical variables which needs to be encoded.

Label Encoding

We have used label encoder function for the category which has only two types of classes

import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder

# Make copy to avoid changing original data 
object_cols=['gender','workex','specialisation','status']

# Apply label encoder to each column with categorical data
label_encoder = LabelEncoder()
for col in object_cols:
    placement_filtered[col] = label_encoder.fit_transform(placement_filtered[col])
placement_filtered.head()

gender
ssc_phsc_phsc_sdegree_pdegree_tworkexetest_pspecializationmba_pstatussalary
167.0091.00Commerce58.00Sci&Tech055.0158.801270000.0
179.3378.33Science77.48Sci&Tech186.5066.281200000.0
165.0068.00Arts64.00Comm&Mgmt075.0057.801250000.0
156.0052.00Science52.00Sci&Tech066.0159.4300.0
185.8073.60Commerce73.30Comm&Mgmt096.8055.501425000.0

One hot encoding

We have used dummies function for the category which has more than two types of classes.

dummy_hsc_s=pd.get_dummies(placement_filtered['hsc_s'], prefix='dummy')
dummy_degree_t=pd.get_dummies(placement_filtered['degree_t'], prefix='dummy')
placement_coded = pd.concat([placement_filtered,dummy_hsc_s,dummy_degree_t],axis=1)
placement_coded.drop(['hsc_s','degree_t','salary'],axis=1, inplace=True)
placement_coded.head()

gender
ssc_phsc_pdegree_pworkexetest_pspecialisationmba_pstatusdummy_Artsdummy_Commercedummy_Sciencedummy_Comm&Mgmtdummy_Othersdummy_Sci&Tech
167.0091.0058.00055.0158.801010001
179.3378.3377.48186.5066.281001001
165.0068.0064.00075.0057.801100100
156.0052.0052.00066.0159.430001001
185.8073.6073.30096.8055.501010100
feature_cols=['gender','ssc_p','hsc_p','hsc_p','workex','etest_p','specialisation','mba_p',
              'dummy_Arts','dummy_Commerce','dummy_Science','dummy_Comm&Mgmt','dummy_Others','dummy_Sci&Tech']
len(feature_cols)
14

Assigning the target(y) and predictor variable(X)

Our Target is to find whether the candidate is placed or not. We use rest of the features except ‘salary’ as this won’t contribute in prediction(i.e.) In real world scenario, students gets salary after they get placed, so we can’t use a future feature to predict something which happens in the present.

X=placement_coded.drop(['status'],axis=1)
y=placement_coded.status

Train and Test Split (80:20)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8,random_state=1)
print("Input Training:",X_train.shape)
print("Input Test:",X_test.shape)
print("Output Training:",y_train.shape)
print("Output Test:",y_test.shape)
Input Training: (165, 14)
Input Test: (42, 14)
Output Training: (165,)
Output Test: (42,)

Machine Learning models

Now let’s feed the models with our data Objective: To predict whether a student got placed or not

Logistic Regression

Let’s fit the model in logistic regression and figure out the accuracy of our model.

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
Accuracy of logistic regression classifier on test set: 0.81

Confusion matrix and Classification report

from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n",confusion_matrix)
from sklearn.metrics import classification_report
print("Classification Report:\n",classification_report(y_test, y_pred))
Confusion Matrix:
 [[ 8  7]
 [ 1 26]]
Classification Report:
               precision    recall  f1-score   support

           0       0.89      0.53      0.67        15
           1       0.79      0.96      0.87        27

    accuracy                           0.81        42
   macro avg       0.84      0.75      0.77        42
weighted avg       0.82      0.81      0.80        42

Insights:

  • The Confusion matrix result is telling us that we have 9+26 correct predictions and 1+6 incorrect predictions.
  • The Classification report reveals that we have 84% precision which means the accuracy that the model classifier not to label an instance positive that is actually negative and it is important to consider precision value because when you are hiring, you want to avoid Type I errors at all cost. They are culture killers. In hiring, a false positive is when you THINK an employee is a good fit, but in actuality they’re not.

linkcode

ROC Curve

Let’s check out the performance of our model through ROC curve

from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

From the ROC curve we can infer that our logistic model has classified the placed students correctly rather than predicting false positive. The more the ROC curve(blue) lies towards the top left side the better our model is. We can choose 0.8 or 0.9 for the threshold value which can reap us true positive resultslinkcode

Decision Tree

Let’s checkout how the model makes the decision using Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion="gini", max_depth=3)
dt = dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.7380952380952381

Woah 73% accurate with using gini index as criterion. I have tried entropy which has high accuracy but considers less features for splitting, so I shifted to gini which considered more features for splitting.

pip install pydotplus
Collecting pydotplus
  Downloading pydotplus-2.0.2.tar.gz (278 kB)
     |████████████████████████████████| 278 kB 7.6 MB/s 
Requirement already satisfied: pyparsing>=2.0.1 in /opt/conda/lib/python3.7/site-packages (from pydotplus) (2.4.7)
Building wheels for collected packages: pydotplus
  Building wheel for pydotplus (setup.py) ... - \ done
  Created wheel for pydotplus: filename=pydotplus-2.0.2-py3-none-any.whl size=24566 sha256=682f7fcb5c2103353d12080fb1ce22f6390ff700de34725d200be229d886c120
  Stored in directory: /root/.cache/pip/wheels/1e/7b/04/7387cf6cc9e48b4a96e361b0be812f0708b394b821bf8c9c50
Successfully built pydotplus
Installing collected packages: pydotplus
Successfully installed pydotplus-2.0.2
WARNING: You are using pip version 20.1; however, version 20.1.1 is available.
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
from sklearn.externals.six import StringIO  
from sklearn.tree import export_graphviz
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(dt, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'], precision=1)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Inference

  • We have 4 sets of placed and not placed students
  • It has been splitted based on ssc_p followed by hsc_p,mba_p and etest_p
  • We have minimised the depth to 3 to prevent it from overfitting
  • We still have few gini value(impurity) in classes of leaf node.
  • Pure classes show that they have been splitted under th criteria of ssc_p<=63.7 and e_testp<=825.

So the best splitting can be made possible through etest_p feature

Random Forest

Since one tree can’t produce accurate results lets use random forest to create a aggregation of trees and produce accurate results.

from sklearn.ensemble import RandomForestClassifier
rt=RandomForestClassifier(n_estimators=100)
rt.fit(X_train,y_train)
y_pred=rt.predict(X_test)
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.8333333333333334

We have an accuracy of 83%. Not bad. But let’s try check out important features and try to boost the precisionlinkcode

Looking at Feature Importance

Let’s see which feature influences more on making the decision and we should cut it off to make our model accurate.

feature_imp = pd.Series(rt.feature_importances_,index=feature_cols).sort_values(ascending=False)
# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
Text(0.5, 1.0, 'Visualizing Important Features')

As we see the school and undergrad specialisations have less influence in classifying the model. But it is really wierd to acknowledge ssc_p influencing more in classifyinglinkcode

Pruning out less important feature

Let’s cut off the less important feature and check for model accuracy.

X=placement_coded.drop(['status','dummy_Comm&Mgmt','dummy_Sci&Tech','dummy_Science','dummy_Commerce',
                        'dummy_Arts','dummy_Others'],axis=1)
y=placement_coded.status
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=0.8,random_state=1)
rt2=RandomForestClassifier(n_estimators=100)
rt2.fit(X_train,y_train)
y_pred=rt2.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
roc_value = roc_auc_score(y_test, y_pred)
roc_value
print("ROC Value:",roc_value)
Accuracy: 0.8095238095238095
ROC Value: 0.7333333333333334

Great. Now We have an accuracy of 81% and the ROC value 73% indicates the models have classified better without having much false positive predictions

K Nearest Neighbors

Let’s try out a lazy supervised classification algorithm. Our beloved, KNNlinkcode

Choosing a K value

Let’s decide on the K value

from sklearn.neighbors import KNeighborsClassifier
error_rate = []

for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

Error rate vs K-value

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Text(0, 0.5, 'Error Rate')

There are a lot of ups and downs in our graph. If we consider any value between 10-15 we may get an overfitted model. So let’s stick onto the first trough. Our K value is 5

from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
confusion_matrix = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n",confusion_matrix)
print("Classification Report:\n",classification_report(y_test, y_pred))
Confusion Matrix:
 [[ 6  9]
 [ 1 26]]
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.40      0.55        15
           1       0.74      0.96      0.84        27

    accuracy                           0.76        42
   macro avg       0.80      0.68      0.69        42
weighted avg       0.78      0.76      0.73        42

Insights:

  • Our model has precisely classified 86% of Not placed categories and 74% of Placed categories
  • To talk in numbers 26+6 correct classifications and 1+9 false negative and false positive classification.
  • We should be considering the precision value as our metric because the possibility of commiting False Positive is very crucial in recruitment

Naïve Bayes Classifier with Cross Validation

Let’s use Naïve Bayes model for our dataset. Since our outcome feature has 1,0(placed, not placed) we can go with Bernoulli Naïve Bayes algorithm and also let’s measure the accuracy with cross validation.

#Importing and fitting
from sklearn.naive_bayes import BernoulliNB 
from sklearn.model_selection import cross_val_score
gnb = BernoulliNB() 
gnb.fit(X_train, y_train) 
  
#Applying and predicting 
y_pred = gnb.predict(X_test) 
cv_scores = cross_val_score(gnb, X, y, 
                            cv=10,
                            scoring='precision')
print("Cross-validation precision: %f" % cv_scores.mean()
Cross-validation precision: 0.735883

Our cross validation precision is approximately 73.5%

Support Vector Machine

Let’s use SVM to classify our output feature

from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)
y_pred = svclassifier.predict(X_test)
confusion_matrix = confusion_matrix(y_test,y_pred)
print("Confusion Matrix:\n",confusion_matrix)
print("Classification Report:\n",classification_report(y_test,y_pred))
Confusion Matrix:
 [[ 9  6]
 [ 2 25]]
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.60      0.69        15
           1       0.81      0.93      0.86        27

    accuracy                           0.81        42
   macro avg       0.81      0.76      0.78        42
weighted avg       0.81      0.81      0.80        42

Inference

  • We have got 82% and 81% precision in classifying our model.
  • 9+25 correctly classified and 2+6 wrongly classified( False Negative & False Positive)

linkcode

XGBoost

Let’s try our the state of art ensemble model XGBoost. We have used RMSE metrics for model performance.

import xgboost as xgb
from sklearn.metrics import mean_squared_error
xg_reg = xgb.XGBClassifier(objective ='reg:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 0.577350

Great. The error value of our model is just 0.577. Now let’s use cross validation and try to minimise furtherlinkcode

XGBoost with Cross Validation

In this algorithm we are using DMatrix to convert our dataset into a matrix and produce the output in data frame. Algorithm inspired from DataCamp.

data_dmatrix = xgb.DMatrix(data=X,label=y)
params = {"objective":"reg:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
cv_results.head()

train-rmse-mean
train-rmse-stdtest-rmse-meantest-rmse-std
00.4860730.0002490.4883030.000572
10.4761710.0034450.480417 0.002367
20.4706510.0045920.475556 0.003135
30.4653500.0057610.471855 0.003667
40.4583250.0061550.466840 0.004404
print((cv_results["test-rmse-mean"]).tail(1))
49    0.414635
Name: test-rmse-mean, dtype: float64

Nice. We have reduced our model error to 0.41

Report Summary

From the analysis report on Campus Recruitment dataset here are my following conclusions

  • Educational percentages are highly influential for a candidate to get placed
  • Past work experience doesn’t influence much on your masters final placements
  • There are no gender discrimination while hiring, but higher packages were given to male
  • Academic percentages have no relation towards salary package.

For more Python related blogs Visit Us Geekycodes . Follow us on Instagram.

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

Leave a Reply

%d bloggers like this: