Blog Data Science Machine Learning Python

Precision and Recall with Scikit Learn

In this post we will demonstrate how to use SciKit Learn to calculate Precision and Recall of different machine learning in Python

Precision refers to the ratio of correct answers. So in Binary example, we would determine precision by dividing true positive by true positive +false positive.

and recall refers to the ratio of true examples that are correctly identified. So we define recall by dividing true positives by true positive + false negatives

Now we will be going to calculate Precision and Recall with example in Python of Iris Data

from sklearn import svm,datasets
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
df = pd.read_csv("../input/iris/Iris.csv")
# display the first 5 rows
df.head()

Id
SepalLengthCmSepalWidthCmPetalLengthCmPetalWidthCmSpecies
15.13.51.40.2Iris-setosa
24.93.01.40.2Iris-setosa
34.73.21.30.2Iris-setosa
44.63.11.50.2Iris-setosa
55.03.61.40.2Iris-setosa

# main characteristics of the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB

The dataframe has 150 non-null values. It has 6 variables, all of them are in the right data type. the first variable “Id” seems to be redundant and unnecessary for the our analysis, we can drop it and keep the rest of variables.

# drop Id, axis = 1: tells python to drop the entire column
# Do not run this cell more than once
df = df.drop("Id", axis = 1)
df.head()

SepalLengthCm
SepalWidthCmPetalLengthCmPetalWidthCmSpecies
5.13.51.40.2Iris-setosa
4.93.01.40.2Iris-setosa
4.73.21.30.2Iris-setosa
4.63.11.50.2Iris-setosa
5.03.61.40.2Iris-setosa
# summary statistics
df.describe()

From the summary statistics we can notice that Sepal leafs are wider and longer than Petal leafs, this can be clearly demonstrated in the following image:




# How many species in our dataframe?
# is the data balanced?
df["Species"].value_counts()
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

The data is clean and balanced with exactly the same number of flowers per species: 50 flowers. but why do we care about the balance between number of observations per class?

Imbalanced classifications pose a challenge for predictive modeling as most of the machine learning algorithms used for classification were designed around the assumption of an equal number of examples for each class.

For example, an imbalanced multiclass classification problem may have 80 percent examples in the first class, 18 percent in the second class, and 2 percent in a third class.

The minority class is harder to predict because there are few examples of this class, by definition. This means it is more challenging for a model to learn the characteristics of examples from this class, and to differentiate examples from this class from the majority class (or classes).

This is a problem because typically, the minority class is more important and therefore the problem is more sensitive to classification errors for the minority class than the majority class.

More detailed explanation can be found here.

Data visualization and explanatory data analysis

linkcode

This section focuses on how to produce and analyze charts that meets the best practices in both academia and industry. we will try to meet the following criteria in each graph:

  1. Chose the right graph that suits the variable type: to display the distribution of categorical variables we might opt for count or bar plot. As for continuous variables we might go with a histogram. If we wan to study the distribution of a continuous variable per each calss of other categorical variable we can use a box plots or a kde plot with hue parameter… etc.
  2. Maximize Dagt-Ink Ration: it equals to the ink used to display the data devided by the total ink used in the graph. Try not to use so many colors without a good reason for that. Aviod using backround colors, or borders or any other unnecessary decorations.
  3. Use clear well written Titles, labels, and tick marks.
fig, axes = plt.subplots(2, 2, figsize=(10,5), dpi = 100)
fig.suptitle('Distribution of (sepal length, sepal width, petal length, petal width) per Species')

# Distribution of sepal length per Species
sns.kdeplot(ax = axes[0,0], data = df, x = 'SepalLengthCm', hue = "Species", alpha = 0.5, shade = True)
axes[0,0].set_xlabel("Sepal Length CM")
axes[0,0].get_legend().remove()

# Distribution of sepal width per Species
sns.kdeplot(ax = axes[0,1], data = df, x = 'SepalWidthCm', hue = "Species", alpha = 0.5, shade = True)
axes[0,1].set_xlabel("Sepal width CM")
axes[0,1].get_legend().remove()

# Distribution of petal length per Species
sns.kdeplot(ax = axes[1,0], data = df, x = 'PetalLengthCm', hue = "Species", alpha = 0.5, shade = True)
axes[1,0].set_xlabel("Petal Length CM")
axes[1,0].get_legend().remove()

# Distribution of petal width per Species
sns.kdeplot(ax = axes[1,1], data = df, x = 'PetalWidthCm', hue = "Species", alpha = 0.5, shade = True)
axes[1,1].set_xlabel("Petal Width CM")

plt.tight_layout()

Main conclusions from the graph:

  1. Setosa is easily separable from the other species, this means that the model will be able to classify it accurately.
  2. Petal length and width is expected to be better predictors of Species than Sepal lenght and width.

Both conclusions can be demonstrated in the following picture where Setosa is clearly different from other sepcies especially when it comes to its petal leefs, it has a very small sepal width and length comapred to other species.

# Scatter plot od petal length vs petal width
plt.figure(figsize = (7, 3), dpi = 100)
sns.scatterplot(data = df, x = 'PetalLengthCm', y = 'PetalWidthCm', hue = "Species")
plt.title("Species clusters based on Sepal length and width")
plt.xlabel("Petal Length Cm")
plt.ylabel("Petal Width Cm")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
# Scatter plot od sepal length vs petal width
plt.figure(figsize = (7, 3), dpi = 100)
sns.scatterplot(data = df, x = 'SepalLengthCm', y = 'SepalWidthCm', hue = "Species")
plt.title("Species clusters based on Sepal length and width")
plt.xlabel("Sepal Length Cm")
plt.ylabel("Sepal Width Cm")
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
#box plots
fig, axes = plt.subplots(2, 2, figsize=(10,5), dpi = 100)

#Mean Sepal Length
sns.boxplot(ax = axes[0,0], data = df, x = "Species", y = 'SepalLengthCm')
axes[0,0].set_xlabel(None)
axes[0,0].set_ylabel(None)
axes[0,0].set_title("Mean Sepal Length")


#Mean Sepal Width
sns.boxplot(ax = axes[0,1], data = df, x = "Species", y = 'SepalWidthCm')
axes[0,1].set_xlabel(None)
axes[0,1].set_ylabel(None)
axes[0,1].set_title("Mean Sepal Width")

#Mean Petal Length
sns.boxplot(ax = axes[1,0], data = df, x = "Species", y = 'PetalLengthCm')
axes[1,0].set_xlabel(None)
axes[1,0].set_ylabel(None)
axes[1,0].set_title("Mean Petal Length")

#Mean Petal Width
sns.boxplot(ax = axes[1,1], data = df, x = "Species", y = 'PetalWidthCm')
axes[1,1].set_xlabel(None)
axes[1,1].set_ylabel(None)
axes[1,1].set_title("Mean Petal Width")

plt.tight_layout()
plt.subplots_adjust(hspace=0.5)

Scatter and box plots confirmed the aforementioned conclusion, setosa is easily separable based on petal length and width.

# Correlation map
plt.figure(figsize = (8, 4), dpi = 100)
sns.heatmap(df.corr(), annot = True, cmap = "viridis", vmin = -1, vmax = 1)
plt.title("Correlation map between variables")
#plt.xticks(rotation = 90)
plt.show()

Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It’s a common tool for describing simple relationships without making a statement about cause and effect.

Correlation coefficient ranges between -1 (perfect negative correlation) and 1 (perfect positive correlation). As you can notice, there is a strong positive correlation between petal width and length on one hand and sepal length on the other hand.

Feature engineering: Data prep for the model

linkcode

In this section we will make sure that the data is well prepared for training the model. We will:

  1. Seprate the dependent variable from the independent ones.
  2. Perform a train test split
  3. Scale the data (feature scaling).
# 1. Seprate the dependent variable from the independent ones.

X = df.drop("Species", axis = 1)
y = df["Species"]

Why train test split ?

We need to split the data into two parts:

  1. Training part, we will use it to train the model.
  2. Test part: this is unseen data (the model has never seen it before), we will use it the test the real performance of the model.

Why we need to test on unseen data? why we do not simply train the model on the whole data and then reuse some of it for evaluation? because this will be like giving the student the answers before entering the exam, the model will be very familiar with the evaluation data because he has seen them before and he will get a full mark. In order for the test to be real, the model has to be evaluated on unseen data.

# 2. Perform a train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)

Why feature scaling?

Real Life Datasets have many features with a wide range of values like for example let’s consider the house price prediction dataset. It will have many features like no. of. bedrooms, square feet area of the house, etc

As you can guess, the no. of bedrooms will vary between 1 and 5, but the square feet area will range from 500-2000. This is a huge difference in the range of both features.

Many machine learning algorithms that are using Euclidean distance as a metric to calculate the similarities will fail to give a reasonable recognition to the smaller feature, in this case, the number of bedrooms, which in the real case can turn out to be an actually important metric.

To aviod this problem we need to scale the features so that they all have the same scale, i.e the same range of values. We can normalize all features so that have values between (-1, 1) or standardize them to have values between (0, 1).

The important thing to note here is that feature scaling does not affect the relative importance of features, scaled features will still have the same orginal information and importance relative to each other, this can be clearly demonstated from the image below: despite feature scaling they are still strawberry and apple, they did not lose their meaning.

# 3. Feature scaling
from sklearn.preprocessing import StandardScaler # import the scaler
scaler = StandardScaler() # initiate it
Scaled_X_train = scaler.fit_transform(X_train) #fit the parameters and use it to trannsform the traning data
Scaled_X_test = scaler.transform(X_test) #transform the test data

Have you noticed that we used .fit_transform() with the traning data and only used .transform() with the test data? we did it to aviod data leakage. Read more about it from here

Model building

We will use logistic regression, but the same methodology can be applied to any other classifier.

# Logestic Regression 
from sklearn.linear_model import LogisticRegression # import the classifier
log_model = LogisticRegression() #initiate it
log_model.fit(Scaled_X_train, y_train) #fit the model to the training data
LogisticRegression()
  1. Accuracy score: the fraction of predictions our model got right (number of correct predictions devided by total number of predictions).
  2. Classification report: used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. The report shows the main classification metrics precision, recall and f1-score on a per-class basis. Precision: What percent of your predictions were correct? – Recall: What percent of the positive cases did you catch? – F1 score: What percent of positive predictions were correct?.

For more info, click here and here.

# creating predictions 
y_pred = log_model.predict(Scaled_X_test)
# import evaluation metrics 
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix, classification_report
# create the confusion matrix
confusion_matrix(y_test, y_pred)
array([[10,  0,  0],
       [ 0, 12,  0],
       [ 0,  1,  7]])
# plot the confusion matrix
fig, ax = plt.subplots(dpi = 120)
plot_confusion_matrix(log_model, Scaled_X_test, y_test, ax = ax);
# measure the accuracy of our model
acc_score = accuracy_score(y_test, y_pred)
round(acc_score, 2)
# generate the classification report 
print(classification_report(y_test, y_pred)) # Hint: try it without using the print() method
               precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        10
Iris-versicolor       0.92      1.00      0.96        12
 Iris-virginica       1.00      0.88      0.93         8

       accuracy                           0.97        30
      macro avg       0.97      0.96      0.96        30
   weighted avg       0.97      0.97      0.97        30

As we expected before, the model did a perfect job predicting Setosa. It only misclassified one observation as versicolor, where in fact it is virginica. However, the model performance is near perfect and we could not have done better than that.linkcode

Model optimization: hyper parameter tuning

Hyperparameter tuning is the process of determining the right combination of parameters that allows us to maximize model performance. We will try different values for each parameter and choose the ones that give us the best predictions.

# import GridSearchCV
from sklearn.model_selection import GridSearchCV 

# set the range of paprameters
penalty = ['l1', 'l2', 'elasticnet']
C = np.logspace(0,20,50)
solver = ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
multi_class = ['ovr', 'multinomial']
l1_ratio = np.linspace(0, 1, 20)

# build the parameter grid
param_grid = {
   'penalty': penalty,
    'C': C,
    'solver': solver,
    'multi_class': multi_class, 
    'l1_ratio': l1_ratio
}

# initiate and fit the Grid Search Model
grid_model = GridSearchCV(log_model, param_grid = param_grid)
grid_model.fit(Scaled_X_train, y_train)
GridSearchCV(estimator=LogisticRegression(),
             param_grid={'C': array([1.00000000e+00, 2.55954792e+00, 6.55128557e+00, 1.67683294e+01,
       4.29193426e+01, 1.09854114e+02, 2.81176870e+02, 7.19685673e+02,
       1.84206997e+03, 4.71486636e+03, 1.20679264e+04, 3.08884360e+04,
       7.90604321e+04, 2.02358965e+05, 5.17947468e+05, 1.32571137e+06,
       3.39322177e+06, 8.68511374e+06, 2.22299648e+0...
                         'l1_ratio': array([0.        , 0.05263158, 0.10526316, 0.15789474, 0.21052632,
       0.26315789, 0.31578947, 0.36842105, 0.42105263, 0.47368421,
       0.52631579, 0.57894737, 0.63157895, 0.68421053, 0.73684211,
       0.78947368, 0.84210526, 0.89473684, 0.94736842, 1.        ]),
                         'multi_class': ['ovr', 'multinomial'],
                         'penalty': ['l1', 'l2', 'elasticnet'],
                         'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag',
                                    'saga']})
# best parameters 
grid_model.best_params_
{'C': 16.768329368110084,
 'l1_ratio': 0.10526315789473684,
 'multi_class': 'multinomial',
 'penalty': 'l1',
 'solver': 'saga'}

Model re-evaluation

We will evaluate the optimized version of our model and see if it does better than the base model.

# creating predictions 
y_pred = grid_model.predict(Scaled_X_test)

# plot the confusion matrix
fig, ax = plt.subplots(dpi = 120)
plot_confusion_matrix(grid_model, Scaled_X_test, y_test, ax = ax);
# measure the accuracy of our model
acc_score = accuracy_score(y_test, y_pred)
round(acc_score, 2)
1.0
# generate the classification report 
print(classification_report(y_test, y_pred)) # Hint: try it without using the print() method
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        10
Iris-versicolor       1.00      1.00      1.00        12
 Iris-virginica       1.00      1.00      1.00         8

       accuracy                           1.00        30
      macro avg       1.00      1.00      1.00        30
   weighted avg       1.00      1.00      1.00        30

The optimized model did a completely perfect job. I correctly classified all the examples in the test data. The accuracy of the model is 100 percent. Accuracy improved from 97 percent for the base model to 100 percent for the optimized model.

Congratulations! you have made it to the end of the tutorial. Please leave your feedback and suggestions of improvement.

For more Python related blogs Visit Us Geekycodes . Follow us on Instagram.

If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com

Leave a Reply

%d bloggers like this: