feature engineering Cover Pic

All Need Imports for the data

import pandas as pd
pd.options.display.max_colwidth = 80

import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVC # SVM model with kernels
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error


import warnings
warnings.filterwarnings('ignore')

Loading and Exploring Data

There are two files of students performance in two subjects: math and Portuguese (Portugal is the country the dataset is from). Important notice : description (later on, as DESCR) tells that “there are several (382) students that belong to both datasets”, so since data set about Portuguese is twice larger than about math lessons, I will be taking former.

student_por = pd.read_csv('/kaggle/input/student-performance-data-set/student-por.csv')
student_por.head()

school
sexageaddressfamsizePstatusMeduFeduMjobfamrelfreetimegoout Dalc WalchealthabsencesG1G2G3
0GPF18UGT3A44at_hometeacher434113401111
1GPF17UGT3T11at_homeother533113291111
2GPF15ULE3T11at_homeother4322336121312
3GPF15UGT3T42healthservices3221150141414
4GPF16UGT3T33otherother4321250111313

5 rows × 33 columns

# check missing values in variables

student_por.isnull().sum()
school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64
student_por.isnull().any()
school        False
sex           False
age           False
address       False
famsize       False
Pstatus       False
Medu          False
Fedu          False
Mjob          False
Fjob          False
reason        False
guardian      False
traveltime    False
studytime     False
failures      False
schoolsup     False
famsup        False
paid          False
activities    False
nursery       False
higher        False
internet      False
romantic      False
famrel        False
freetime      False
goout         False
Dalc          False
Walc          False
health        False
absences      False
G1            False
G2            False
G3            False
dtype: bool
student_por.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher      649 non-null    object
 21  internet    649 non-null    object
 22  romantic    649 non-null    object
 23  famrel      649 non-null    int64 
 24  freetime    649 non-null    int64 
 25  goout       649 non-null    int64 
 26  Dalc        649 non-null    int64 
 27  Walc        649 non-null    int64 
 28  health      649 non-null    int64 
 29  absences    649 non-null    int64 
 30  G1          649 non-null    int64 
 31  G2          649 non-null    int64 
 32  G3          649 non-null    int64 
dtypes: int64(16), object(17)
memory usage: 167.4+ KB

I know from DESCR, that G1 and G2 are grades for midterm exams, so they are a consequence of the last exam and they correlate a great deal with our target variable G3, so I won’t be making another column of average value for these three
After inspecting the dataset description I’m curious how health and absences values corelate. Perhaps, I could made one feature out of them. But before looking for correlation we should normalize these features, cause their ranges differ very much.
UPD: normalizing values didn’t help. Seems that normalizing or not, nothing changes… I should look it up. Nonetheless, I leave the code in one cell below just as a reminder to myself

copied = student_por.copy()

mean = 5.7
max_min = 75

def mean_normalization(x):
    return((x-mean)/max_min)

copied['absences'] = copied['absences'].apply(mean_normalization)
copied['health'] = copied['health'].apply(mean_normalization)

corr_matrix = copied.corr()

corr_matrix["absences"].sort_values(ascending=False)
absences      1.000000
Dalc          0.172952
Walc          0.156373
age           0.149998
failures      0.122779
goout         0.085374
Fedu          0.029859
traveltime   -0.008149
Medu         -0.008577
freetime     -0.018716
health       -0.030235
famrel       -0.089534
G3           -0.091379
studytime    -0.118389
G2           -0.124745
G1           -0.147149
Name: absences, dtype: float64
corr_matrix = student_por.corr()

corr_matrix["absences"].sort_values(ascending=False)
A little bit about correlation.

ince the dataset is not too large we can easily compute standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method. The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; when the coefficient is close to –1, it means that there is a strong negative correlation.Finally, coefficients close to 0 mean that there is no linear correlation.

The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”)

Let’s look at how much each numerical attributes correlates with G3 value

corr_matrix["G3"].sort_values(ascending=False)
G3            1.000000
G2            0.918548
G1            0.826387
studytime     0.249789
Medu          0.240151
Fedu          0.211800
famrel        0.063361
goout        -0.087641
absences     -0.091379
health       -0.098851
age          -0.106505
freetime     -0.122705
traveltime   -0.127173
Walc         -0.176619
Dalc         -0.204719
failures     -0.393316
Name: G3, dtype: float64
Apparently, G3

has correlation not only with G1 and G2 but also with studytime, failures, Dalc, Walc, traveltime, freetime, age, Medu (mother’s education) and Fedu (father’s education)
Another way to check for correlation between attributes is to use the pandas scatter_matrix() function, which plots every numerical attribute against every other numerical attribute. Since there are 16 numerical attributes, we would get 16×16 = 256 plots, which would not fit on a page—so let’s just focus on a few promising attributes that seem most correlated with G3

from pandas.plotting import scatter_matrix

# I don't take G2 and G1 into account, because they are an obvious choice
attributes = ["G3", "studytime", "Fedu", "failures", "Dalc", "Walc"] 

scatter_matrix(student_por[attributes], figsize=(16, 12))
Choosing features. The goal is to predict G3

And yet another way to check numeric data for correlations

import seaborn as sns

corr_matrix = student_por.corr()

plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, annot=True, cmap="Blues")
plt.title('Correlation Heatmap', fontsize=20)
Text(0.5, 1.0, 'Correlation Heatmap')

Judging by this heatmap and also by previous correlations matrices, studytime, failures, Dalc, Walc, traveltime, freetime, age, Medu and Fedu might really have an impact on G1-G3

Let’s now analyze categorical variables
#comparing sex with G3
sns.boxplot(x="sex", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f35681d4150>
#comparing school with G3
sns.boxplot(x="school", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563fcb450>
#comparing adress with G3
sns.boxplot(x="address", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563f5f450>
#comparing parent's jobs with G3
sns.boxplot(x="Mjob", y="G3", data=student_por)
sns.boxplot(x="Fjob", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563ee60d0>
#comparing famsize with G3
sns.boxplot(x="famsize", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563d9b810>
#comparing Pstatus with G3
sns.boxplot(x="Pstatus", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563d27e50>
#comparing reason with G3
sns.boxplot(x="reason", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563cb2710>
#comparing guardian with G3
sns.boxplot(x="guardian", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563bdb7d0>
#comparing schoolsup with G3
sns.boxplot(x="schoolsup", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563b75ed0>
#comparing famsup with G3
sns.boxplot(x="famsup", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563a83a10>
#comparing paid with G3
sns.boxplot(x="paid", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563a00890>
We can do similar boxplot for other features

After examining boxplots, I’ve come to a conclusion that the following numerical and categorical features have an inpact on G3 :

  • Numerical: studytime, failures, Dalc, Walc, traveltime, freetime, Medu and Fedu, G1, G2
  • Categorical: Sex, School, Address, Mjob + FJob, Reason, Guardian, Schoolsup, Higher, Internet

See dataset description for info about each feature

# making dataframe I'm gonna work with + target G3

features_chosen = ['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime',  'Medu', 'Fedu', 
                   'sex', 'school', 'address', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup', 
                   'higher', 'internet', 'G1', 'G2', 'G3']

student_reduced = student_por[features_chosen].copy()

student_reduced

I have given it a lot of thoughts and here is what I’m thinking.

The point of this notebook is to find G3 , of course by selecting the best model and the best features for that. And we are visualizing, analysing these features, such as traveltime from home to school or possible drinking problems or romantic affairs, family statuses and so on and so on … we are basically thinking of the things, that influence our grades. So, based on these thoughts, it would’ve been better to get rid off G1 and G2, since these are grades for first and second halves of the year respectively. And they are, as much as G3 reflections of the features choses. Instead of having three grades, we should make one mean G out of them.

student_reduced["G"]=(student_reduced["G1"]+student_reduced["G2"]+student_reduced["G3"])/3
# dropping initial grades and leaving mean 
student_reduced.drop(['G1', 'G2', 'G3'], axis=1, inplace=True)

But for now, I will leave them be

Another quick way to get a feel of the type of data we are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).

student_reduced.hist(bins=20, figsize=(20,15))
plt.show()

Looking at the data we can see string-valued features.
They are not arbitrary texts: these are a limited number of possible values, each of which represents a category. So these attributes are categorical attributes. Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s OneHotEncoder class, because it’s one of the best when working with categorical nominal variables. And for numerical values I will use StandardScaler. These two function I will put in one pipeline.
As far as I know, all but the last estimator must be transformers (i.e., they must have a fit_transform() method)

from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import StandardScaler
    from sklearn.compose import ColumnTransformer
features_cat = ['sex','school','address','Mjob','Fjob','reason','schoolsup','guardian','higher','internet']
features_num = ['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime', 'Medu', 'Fedu']


full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), features_num), 
    ("encoder", OneHotEncoder(), features_cat),
])


X_train_prepared = full_pipeline.fit_transform(X_train)

UPD: instead of this pipeline I thought of better way to transform my features. Anyways, for the sake of my experiments, I will be leaving the above discussed pipeline here in code-block:

get_dummies() method from pandas yields every values from every categorical feature as a column name and assigns 1 to instances where this value is True and 0 to instances where it is not. This method affects only categorical features

features_cat = ['sex','school','address','Mjob','Fjob','reason','schoolsup','guardian','higher','internet']

student_reduced_cat = pd.get_dummies(student_reduced, columns = features_cat)
student_reduced_cat

studytime
failuresDalcWalctraveltimefreetimeMeduFeduGreason_reputationschoolsup_noschoolsup_yesguardian_fatherguardian_motherguardian_otherhigher_nohigher_yesinternet_nointernet_yes
0201123447.333333100101001100
12011131110.333333101010001011
22023131112.333333100101001011
33011124214.000000101001001011
42012133312.333333101010001100
6443112142310.333333101001001011
6452011133115.333333101001001011
6462011211110.666667101001001100
6471034243110.000000001001001011
6481034343210.666667001001001011

649 rows × 38 columns

student_reduced_cat.columns
Index(['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime',
       'Medu', 'Fedu', 'G1', 'G2', 'G3', 'sex_F', 'sex_M', 'school_GP',
       'school_MS', 'address_R', 'address_U', 'Mjob_at_home', 'Mjob_health',
       'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home',
       'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher',
       'reason_course', 'reason_home', 'reason_other', 'reason_reputation',
       'schoolsup_no', 'schoolsup_yes', 'guardian_father', 'guardian_mother',
       'guardian_other', 'higher_no', 'higher_yes', 'internet_no',
       'internet_yes'],
      dtype='object')

Predict and Target variables

X = np.array(student_reduced_cat.drop(['G3'],1))
y = np.array(student_reduced_cat['G3'])
Scaling numerical variables
scaler = StandardScaler()

X = scaler.fit_transform(X)
X.shape
(649, 39)

Before looking at the data any further, I need to create a test set, put it aside, and never look at it. (c) Aurélien Geron

X_train, X_test,y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)

X_train.shape, X_test.shape
((493, 39), (156, 39))

I guess we have a sufficient number of instances in dataset for each stratum, so no need in Stratified sampling

Selecting and Training the Model

I’ll try Linear Regression with regularization

from sklearn.linear_model import SGDRegressor

sgd_reg = SGDRegressor(penalty="l2") # specifying Ridge Regression

sgd_reg.fit(X_train, y_train)
SGDRegressor()
accuracy=sgd_reg.score(X_test,y_test)  
accuracy
0.8630675285309015
Accuracy of 0.86 is really good.

But perhaps the model underfits or overfits.

There are a few ways to find that out:

  • Learning curves – these are plots of the model’s performance on the training set and the validation set as a function of the training set size
  • Cross-validation – if a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. If it performs poorly on both, then it is underfitting.
Learning Curves
def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2) 
    train_errors, val_errors = [], []
    
    for m in range(1, len(X_train)):
        model.fit(X_train[:m], y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val) 
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict)) 
        val_errors.append(mean_squared_error(y_val, y_val_predict))
        
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train") 
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
Estimating SGDRegressor’s geberalization performance
sgd_reg_curves = SGDRegressor(penalty='l2') 

plot_learning_curves(sgd_reg_curves, X, y)

From what I can understand, looking at the Learning curve, the model is fine

Better Evaluation Using Cross-Validation

Scikit-Learn’s cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.

scores = cross_val_score(sgd_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10) 

sgd_reg_scores = np.sqrt(-scores)
sgd_reg_scores
array([1.17959516, 1.67437116, 1.88229932, 1.52199586, 1.60085582,
       1.11392725, 1.24186602, 0.94190442, 1.07220833, 0.82125016])

Let’s look at the results

def display_scores(scores):
    print('Scores:', scores)
    print('Std.  :', scores.std())
    print('Mean  :', scores.mean())
    
display_scores(sgd_reg_scores)
Scores: [1.17959516 1.67437116 1.88229932 1.52199586 1.60085582 1.11392725
 1.24186602 0.94190442 1.07220833 0.82125016]
Std.  : 0.32872370834150016
Mean  : 1.3050273502279368

So, as a conclusion I must say, that Linear Regression with regularization works fine

Important Notice for college students

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

For more Programming related blogs Visit Us Geekycodes . Follow us on Instagram.

Leave a Reply