All Need Imports for the data
import pandas as pd
pd.options.display.max_colwidth = 80
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import SGDRegressor
from sklearn.svm import SVC # SVM model with kernels
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')
Loading and Exploring Data
There are two files of students performance in two subjects: math and Portuguese (Portugal is the country the dataset is from). Important notice : description (later on, as DESCR) tells that “there are several (382) students that belong to both datasets”, so since data set about Portuguese is twice larger than about math lessons, I will be taking former.
student_por = pd.read_csv('/kaggle/input/student-performance-data-set/student-por.csv')
student_por.head()
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | … | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | … | 4 | 3 | 4 | 1 | 1 | 3 | 4 | 0 | 11 | 11 |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | … | 5 | 3 | 3 | 1 | 1 | 3 | 2 | 9 | 11 | 11 |
2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | … | 4 | 3 | 2 | 2 | 3 | 3 | 6 | 12 | 13 | 12 |
3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | … | 3 | 2 | 2 | 1 | 1 | 5 | 0 | 14 | 14 | 14 |
4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | … | 4 | 3 | 2 | 1 | 2 | 5 | 0 | 11 | 13 | 13 |
5 rows × 33 columns
# check missing values in variables
student_por.isnull().sum()
school 0 sex 0 age 0 address 0 famsize 0 Pstatus 0 Medu 0 Fedu 0 Mjob 0 Fjob 0 reason 0 guardian 0 traveltime 0 studytime 0 failures 0 schoolsup 0 famsup 0 paid 0 activities 0 nursery 0 higher 0 internet 0 romantic 0 famrel 0 freetime 0 goout 0 Dalc 0 Walc 0 health 0 absences 0 G1 0 G2 0 G3 0 dtype: int64
student_por.isnull().any()
school False sex False age False address False famsize False Pstatus False Medu False Fedu False Mjob False Fjob False reason False guardian False traveltime False studytime False failures False schoolsup False famsup False paid False activities False nursery False higher False internet False romantic False famrel False freetime False goout False Dalc False Walc False health False absences False G1 False G2 False G3 False dtype: bool
student_por.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 649 entries, 0 to 648 Data columns (total 33 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 school 649 non-null object 1 sex 649 non-null object 2 age 649 non-null int64 3 address 649 non-null object 4 famsize 649 non-null object 5 Pstatus 649 non-null object 6 Medu 649 non-null int64 7 Fedu 649 non-null int64 8 Mjob 649 non-null object 9 Fjob 649 non-null object 10 reason 649 non-null object 11 guardian 649 non-null object 12 traveltime 649 non-null int64 13 studytime 649 non-null int64 14 failures 649 non-null int64 15 schoolsup 649 non-null object 16 famsup 649 non-null object 17 paid 649 non-null object 18 activities 649 non-null object 19 nursery 649 non-null object 20 higher 649 non-null object 21 internet 649 non-null object 22 romantic 649 non-null object 23 famrel 649 non-null int64 24 freetime 649 non-null int64 25 goout 649 non-null int64 26 Dalc 649 non-null int64 27 Walc 649 non-null int64 28 health 649 non-null int64 29 absences 649 non-null int64 30 G1 649 non-null int64 31 G2 649 non-null int64 32 G3 649 non-null int64 dtypes: int64(16), object(17) memory usage: 167.4+ KB
I know from DESCR, that G1 and G2 are grades for midterm exams, so they are a consequence of the last exam and they correlate a great deal with our target variable G3, so I won’t be making another column of average value for these three
After inspecting the dataset description I’m curious how health and absences values corelate. Perhaps, I could made one feature out of them. But before looking for correlation we should normalize these features, cause their ranges differ very much.
UPD: normalizing values didn’t help. Seems that normalizing or not, nothing changes… I should look it up. Nonetheless, I leave the code in one cell below just as a reminder to myself
copied = student_por.copy()
mean = 5.7
max_min = 75
def mean_normalization(x):
return((x-mean)/max_min)
copied['absences'] = copied['absences'].apply(mean_normalization)
copied['health'] = copied['health'].apply(mean_normalization)
corr_matrix = copied.corr()
corr_matrix["absences"].sort_values(ascending=False)
absences 1.000000 Dalc 0.172952 Walc 0.156373 age 0.149998 failures 0.122779 goout 0.085374 Fedu 0.029859 traveltime -0.008149 Medu -0.008577 freetime -0.018716 health -0.030235 famrel -0.089534 G3 -0.091379 studytime -0.118389 G2 -0.124745 G1 -0.147149 Name: absences, dtype: float64
corr_matrix = student_por.corr()
corr_matrix["absences"].sort_values(ascending=False)
A little bit about correlation.
ince the dataset is not too large we can easily compute standard correlation coefficient (also called Pearson’s r) between every pair of attributes using the corr() method. The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; when the coefficient is close to –1, it means that there is a strong negative correlation.Finally, coefficients close to 0 mean that there is no linear correlation.
The correlation coefficient only measures linear correlations (“if x goes up, then y generally goes up/down”). It may completely miss out on nonlinear relationships (e.g., “if x is close to 0, then y generally goes up”)
Let’s look at how much each numerical attributes correlates with G3 value
corr_matrix["G3"].sort_values(ascending=False)
G3 1.000000 G2 0.918548 G1 0.826387 studytime 0.249789 Medu 0.240151 Fedu 0.211800 famrel 0.063361 goout -0.087641 absences -0.091379 health -0.098851 age -0.106505 freetime -0.122705 traveltime -0.127173 Walc -0.176619 Dalc -0.204719 failures -0.393316 Name: G3, dtype: float64
Apparently, G3
has correlation not only with G1 and G2 but also with studytime, failures, Dalc, Walc, traveltime, freetime, age, Medu (mother’s education) and Fedu (father’s education)
Another way to check for correlation between attributes is to use the pandas scatter_matrix() function, which plots every numerical attribute against every other numerical attribute. Since there are 16 numerical attributes, we would get 16×16 = 256 plots, which would not fit on a page—so let’s just focus on a few promising attributes that seem most correlated with G3
from pandas.plotting import scatter_matrix
# I don't take G2 and G1 into account, because they are an obvious choice
attributes = ["G3", "studytime", "Fedu", "failures", "Dalc", "Walc"]
scatter_matrix(student_por[attributes], figsize=(16, 12))

Choosing features. The goal is to predict G3
And yet another way to check numeric data for correlations
import seaborn as sns
corr_matrix = student_por.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, annot=True, cmap="Blues")
plt.title('Correlation Heatmap', fontsize=20)
Text(0.5, 1.0, 'Correlation Heatmap')

Judging by this heatmap and also by previous correlations matrices, studytime, failures, Dalc, Walc, traveltime, freetime, age, Medu and Fedu might really have an impact on G1-G3
Let’s now analyze categorical variables
#comparing sex with G3
sns.boxplot(x="sex", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f35681d4150>

#comparing school with G3
sns.boxplot(x="school", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563fcb450>

#comparing adress with G3
sns.boxplot(x="address", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563f5f450>

#comparing parent's jobs with G3
sns.boxplot(x="Mjob", y="G3", data=student_por)
sns.boxplot(x="Fjob", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563ee60d0>

#comparing famsize with G3
sns.boxplot(x="famsize", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563d9b810>

#comparing Pstatus with G3
sns.boxplot(x="Pstatus", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563d27e50>

#comparing reason with G3
sns.boxplot(x="reason", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563cb2710>

#comparing guardian with G3
sns.boxplot(x="guardian", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563bdb7d0>

#comparing schoolsup with G3
sns.boxplot(x="schoolsup", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563b75ed0>

#comparing famsup with G3
sns.boxplot(x="famsup", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563a83a10>

#comparing paid with G3
sns.boxplot(x="paid", y="G3", data=student_por)
<matplotlib.axes._subplots.AxesSubplot at 0x7f3563a00890>

We can do similar boxplot for other features
After examining boxplots, I’ve come to a conclusion that the following numerical and categorical features have an inpact on G3 :
- Numerical: studytime, failures, Dalc, Walc, traveltime, freetime, Medu and Fedu, G1, G2
- Categorical: Sex, School, Address, Mjob + FJob, Reason, Guardian, Schoolsup, Higher, Internet
See dataset description for info about each feature
# making dataframe I'm gonna work with + target G3
features_chosen = ['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime', 'Medu', 'Fedu',
'sex', 'school', 'address', 'Mjob', 'Fjob', 'reason', 'guardian', 'schoolsup',
'higher', 'internet', 'G1', 'G2', 'G3']
student_reduced = student_por[features_chosen].copy()
student_reduced

I have given it a lot of thoughts and here is what I’m thinking.
The point of this notebook is to find G3 , of course by selecting the best model and the best features for that. And we are visualizing, analysing these features, such as traveltime from home to school or possible drinking problems or romantic affairs, family statuses and so on and so on … we are basically thinking of the things, that influence our grades. So, based on these thoughts, it would’ve been better to get rid off G1 and G2, since these are grades for first and second halves of the year respectively. And they are, as much as G3 reflections of the features choses. Instead of having three grades, we should make one mean G out of them.
student_reduced["G"]=(student_reduced["G1"]+student_reduced["G2"]+student_reduced["G3"])/3
# dropping initial grades and leaving mean
student_reduced.drop(['G1', 'G2', 'G3'], axis=1, inplace=True)
But for now, I will leave them be
Another quick way to get a feel of the type of data we are dealing with is to plot a histogram for each numerical attribute. A histogram shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).
student_reduced.hist(bins=20, figsize=(20,15))
plt.show()

Looking at the data we can see string-valued features.
They are not arbitrary texts: these are a limited number of possible values, each of which represents a category. So these attributes are categorical attributes. Most Machine Learning algorithms prefer to work with numbers, so let’s convert these categories from text to numbers. For this, we can use Scikit-Learn’s OneHotEncoder class, because it’s one of the best when working with categorical nominal variables. And for numerical values I will use StandardScaler. These two function I will put in one pipeline.
As far as I know, all but the last estimator must be transformers (i.e., they must have a fit_transform() method)
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
features_cat = ['sex','school','address','Mjob','Fjob','reason','schoolsup','guardian','higher','internet']
features_num = ['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime', 'Medu', 'Fedu']
full_pipeline = ColumnTransformer([
("num", StandardScaler(), features_num),
("encoder", OneHotEncoder(), features_cat),
])
X_train_prepared = full_pipeline.fit_transform(X_train)
UPD: instead of this pipeline I thought of better way to transform my features. Anyways, for the sake of my experiments, I will be leaving the above discussed pipeline here in code-block:
get_dummies() method from pandas yields every values from every categorical feature as a column name and assigns 1 to instances where this value is True and 0 to instances where it is not. This method affects only categorical features
features_cat = ['sex','school','address','Mjob','Fjob','reason','schoolsup','guardian','higher','internet']
student_reduced_cat = pd.get_dummies(student_reduced, columns = features_cat)
student_reduced_cat
studytime | failures | Dalc | Walc | traveltime | freetime | Medu | Fedu | G | … | reason_reputation | schoolsup_no | schoolsup_yes | guardian_father | guardian_mother | guardian_other | higher_no | higher_yes | internet_no | internet_yes | |||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 0 | 1 | 1 | 2 | 3 | 4 | 4 | 7.333333 | 1 | … | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
1 | 2 | 0 | 1 | 1 | 1 | 3 | 1 | 1 | 10.333333 | 1 | … | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 |
2 | 2 | 0 | 2 | 3 | 1 | 3 | 1 | 1 | 12.333333 | 1 | … | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
3 | 3 | 0 | 1 | 1 | 1 | 2 | 4 | 2 | 14.000000 | 1 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
4 | 2 | 0 | 1 | 2 | 1 | 3 | 3 | 3 | 12.333333 | 1 | … | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
644 | 3 | 1 | 1 | 2 | 1 | 4 | 2 | 3 | 10.333333 | 1 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
645 | 2 | 0 | 1 | 1 | 1 | 3 | 3 | 1 | 15.333333 | 1 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
646 | 2 | 0 | 1 | 1 | 2 | 1 | 1 | 1 | 10.666667 | 1 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
647 | 1 | 0 | 3 | 4 | 2 | 4 | 3 | 1 | 10.000000 | 0 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
648 | 1 | 0 | 3 | 4 | 3 | 4 | 3 | 2 | 10.666667 | 0 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 |
649 rows × 38 columns
student_reduced_cat.columns
Index(['studytime', 'failures', 'Dalc', 'Walc', 'traveltime', 'freetime', 'Medu', 'Fedu', 'G1', 'G2', 'G3', 'sex_F', 'sex_M', 'school_GP', 'school_MS', 'address_R', 'address_U', 'Mjob_at_home', 'Mjob_health', 'Mjob_other', 'Mjob_services', 'Mjob_teacher', 'Fjob_at_home', 'Fjob_health', 'Fjob_other', 'Fjob_services', 'Fjob_teacher', 'reason_course', 'reason_home', 'reason_other', 'reason_reputation', 'schoolsup_no', 'schoolsup_yes', 'guardian_father', 'guardian_mother', 'guardian_other', 'higher_no', 'higher_yes', 'internet_no', 'internet_yes'], dtype='object')
Predict and Target variables
X = np.array(student_reduced_cat.drop(['G3'],1))
y = np.array(student_reduced_cat['G3'])
Scaling numerical variables
scaler = StandardScaler()
X = scaler.fit_transform(X)
X.shape
(649, 39)
Before looking at the data any further, I need to create a test set, put it aside, and never look at it. (c) Aurélien Geron
X_train, X_test,y_train, y_test = train_test_split(X, y, test_size=0.24, random_state=42)
X_train.shape, X_test.shape
((493, 39), (156, 39))
I guess we have a sufficient number of instances in dataset for each stratum, so no need in Stratified sampling
Selecting and Training the Model
I’ll try Linear Regression with regularization
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l2") # specifying Ridge Regression
sgd_reg.fit(X_train, y_train)
SGDRegressor()
accuracy=sgd_reg.score(X_test,y_test)
accuracy
0.8630675285309015
Accuracy of 0.86 is really good.
But perhaps the model underfits or overfits.
There are a few ways to find that out:
- Learning curves – these are plots of the model’s performance on the training set and the validation set as a function of the training set size
- Cross-validation – if a model performs well on the training data but generalizes poorly according to the cross-validation metrics, then your model is overfitting. If it performs poorly on both, then it is underfitting.
Learning Curves
def plot_learning_curves(model, X, y):
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
train_errors, val_errors = [], []
for m in range(1, len(X_train)):
model.fit(X_train[:m], y_train[:m])
y_train_predict = model.predict(X_train[:m])
y_val_predict = model.predict(X_val)
train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
val_errors.append(mean_squared_error(y_val, y_val_predict))
plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
Estimating SGDRegressor’s geberalization performance
sgd_reg_curves = SGDRegressor(penalty='l2')
plot_learning_curves(sgd_reg_curves, X, y)

From what I can understand, looking at the Learning curve, the model is fine
Better Evaluation Using Cross-Validation
Scikit-Learn’s cross-validation features expect a utility function (greater is better) rather than a cost function (lower is better), so the scoring function is actually the opposite of the MSE (i.e., a negative value), which is why the preceding code computes -scores before calculating the square root.
scores = cross_val_score(sgd_reg, X_train, y_train, scoring="neg_mean_squared_error", cv=10)
sgd_reg_scores = np.sqrt(-scores)
sgd_reg_scores
array([1.17959516, 1.67437116, 1.88229932, 1.52199586, 1.60085582, 1.11392725, 1.24186602, 0.94190442, 1.07220833, 0.82125016])
Let’s look at the results
def display_scores(scores):
print('Scores:', scores)
print('Std. :', scores.std())
print('Mean :', scores.mean())
display_scores(sgd_reg_scores)
Scores: [1.17959516 1.67437116 1.88229932 1.52199586 1.60085582 1.11392725 1.24186602 0.94190442 1.07220833 0.82125016] Std. : 0.32872370834150016 Mean : 1.3050273502279368
So, as a conclusion I must say, that Linear Regression with regularization works fine
Important Notice for college students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com
For more Programming related blogs Visit Us Geekycodes . Follow us on Instagram.