CONTEXT
Access to safe drinking-water is essential to health, a basic human right and a component of effective policy for health protection. This is important as a health and development issue at a national, regional and local level. In some regions, it has been shown that investments in water supply and sanitation can yield a net economic benefit, since the reductions in adverse health effects and health care costs outweigh the costs of undertaking the interventions.
dataframed = pd.read_csv('/kaggle/input/water-potability/water_potability.csv');
dataframed.head()
pH | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability |
---|---|---|---|---|---|---|---|---|---|
NaN | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 | 0 |
3.716080 | 129.422921 | 18630.057858 | 6.635246 | NaN | 592.885359 | 15.180013 | 56.329076 | 4.500656 | 0 |
8.099124 | 224.236259 | 19909.541732 | 9.275884 | NaN | 418.606213 | 16.868637 | 66.420093 | 3.055934 | 0 |
8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 | 0 |
9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 | 0 |
The Dataset consists of the following :
- 3276 rows
- 10 columns
print('Shape of the Dataset =', dataframed.shape)
Shape of the Dataset = (3276, 10)
METADATA
1. pH value: PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.
2. Hardness: Hardness is mainly caused by calcium and magnesium salts. These salts are dissolved from geologic deposits through which water travels. The length of time water is in contact with hardness producing material helps determine how much hardness there is in raw water. Hardness was originally defined as the capacity of water to precipitate soap caused by Calcium and Magnesium.
3. Solids (Total dissolved solids – TDS): Water has the ability to dissolve a wide range of inorganic and some organic minerals or salts such as potassium, calcium, sodium, bicarbonates, chlorides, magnesium, sulfates etc. These minerals produced un-wanted taste and diluted color in appearance of water. This is the important parameter for the use of water. The water with high TDS value indicates that water is highly mineralized. Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.
4. Chloramines: Chlorine and chloramine are the major disinfectants used in public water systems. Chloramines are most commonly formed when ammonia is added to chlorine to treat drinking water. Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.
5. Sulfate: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.
6. Conductivity: Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.
7. Organic_carbon: Total Organic Carbon (TOC) in source waters comes from decaying natural organic matter (NOM) as well as synthetic sources. TOC is a measure of the total amount of carbon in organic compounds in pure water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.
8. Trihalomethanes: THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.
9. Turbidity: The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.
10. Potability: Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.
dataframed.columns.values.tolist()
['ph', 'Hardness', 'Solids', 'Chloramines', 'Sulfate', 'Conductivity', 'Organic_carbon', 'Trihalomethanes', 'Turbidity', 'Potability']
dataframed.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3276 entries, 0 to 3275 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ph 2785 non-null float64 1 Hardness 3276 non-null float64 2 Solids 3276 non-null float64 3 Chloramines 3276 non-null float64 4 Sulfate 2495 non-null float64 5 Conductivity 3276 non-null float64 6 Organic_carbon 3276 non-null float64 7 Trihalomethanes 3114 non-null float64 8 Turbidity 3276 non-null float64 9 Potability 3276 non-null int64 dtypes: float64(9), int64(1) memory usage: 256.1 KB
dataframed.describe()
ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | Potability | |
---|---|---|---|---|---|---|---|---|---|---|
count | 2785.000000 | 3276.000000 | 3276.000000 | 3276.000000 | 2495.000000 | 3276.000000 | 3276.000000 | 3114.000000 | 3276.000000 | 3276.000000 |
mean | 7.080795 | 196.369496 | 22014.092526 | 7.122277 | 333.775777 | 426.205111 | 14.284970 | 66.396293 | 3.966786 | 0.390110 |
std | 1.594320 | 32.879761 | 8768.570828 | 1.583085 | 41.416840 | 80.824064 | 3.308162 | 16.175008 | 0.780382 | 0.487849 |
min | 0.000000 | 47.432000 | 320.942611 | 0.352000 | 129.000000 | 181.483754 | 2.200000 | 0.738000 | 1.450000 | 0.000000 |
25% | 6.093092 | 176.850538 | 15666.690297 | 6.127421 | 307.699498 | 365.734414 | 12.065801 | 55.844536 | 3.439711 | 0.000000 |
50% | 7.036752 | 196.967627 | 20927.833607 | 7.130299 | 333.073546 | 421.884968 | 14.218338 | 66.622485 | 3.955028 | 0.000000 |
75% | 8.062066 | 216.667456 | 27332.762127 | 8.114887 | 359.950170 | 481.792304 | 16.557652 | 77.337473 | 4.500320 | 1.000000 |
max | 14.000000 | 323.124000 | 61227.196008 | 13.127000 | 481.030642 | 753.342620 | 28.300000 | 124.000000 | 6.739000 | 1.000000 |
dataframed.dtypes
ph float64 Hardness float64 Solids float64 Chloramines float64 Sulfate float64 Conductivity float64 Organic_carbon float64 Trihalomethanes float64 Turbidity float64 Potability int64 dtype: object
dataframed.isnull().sum()
ph 491 Hardness 0 Solids 0 Chloramines 0 Sulfate 781 Conductivity 0 Organic_carbon 0 Trihalomethanes 162 Turbidity 0 Potability 0 dtype: int64
Columns with missing values :
- pH value: PH is an important parameter in evaluating the acid–base balance of water. It is also the indicator of acidic or alkaline condition of water status. WHO has recommended maximum permissible limit of pH from 6.5 to 8.5. The current investigation ranges were 6.52–6.83 which are in the range of WHO standards.
- Sulfate: Sulfates are naturally occurring substances that are found in minerals, soil, and rocks. They are present in ambient air, groundwater, plants, and food. The principal commercial use of sulfate is in the chemical industry. Sulfate concentration in seawater is about 2,700 milligrams per liter (mg/L). It ranges from 3 to 30 mg/L in most freshwater supplies, although much higher concentrations (1000 mg/L) are found in some geographic locations.
- Trihalomethanes: THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.
import missingno as msno
msno.matrix(dataframed)
plt.show()

Missing Values : PH Value
Scientists measure the hardness of water using a pH scale, which measures the hydrogen-ion concentration in the liquid. Water with a low pH is more acidic, while water with a higher pH is harder or more alkaline, meaning it is able to neutralize acids.
The pH scale measures substances on a scale from 1 to 14, with 7 being neutral.
Reference : Howell, E. (2013, March 26). What makes water hard?. LiveScience. https://www.livescience.com/34462-water-hard-ph.html
print('Conditonal Statements to fill in the Missing Values of PH Value Column')
print("\n")
print('if Potability = 0 and Hardness <= 150')
condition_1_mean_ph = dataframed[(dataframed['Potability'] == 0) & (dataframed['Hardness'] <= 150)][['ph']].mean()
print("PH VALUE : {:.4f}".format(float(condition_1_mean_ph)))
print("\n")
print('if Potability = 0 and Hardness > 150')
condition_2_mean_ph = dataframed[(dataframed['Potability'] == 0) & (dataframed['Hardness'] > 150)][['ph']].mean()
print("PH VALUE : {:.4f}".format(float(condition_2_mean_ph)))
print("\n")
print('if Potability = 1 and Hardness <= 150')
condition_3_mean_ph = dataframed[(dataframed['Potability'] == 0) & (dataframed['Hardness'] <= 150)][['ph']].mean()
print("PH VALUE : {:.4f}".format(float(condition_3_mean_ph)))
print("\n")
print('if Potability = 1 and Hardness > 150')
condition_4_mean_ph = dataframed[(dataframed['Potability'] == 0) & (dataframed['Hardness'] > 150)][['ph']].mean()
print("PH VALUE : {:.4f}".format(float(condition_4_mean_ph)))
Conditonal Statements to fill in the Missing Values of PH Value Column if Potability = 0 and Hardness <= 150 PH VALUE : 6.7220 if Potability = 0 and Hardness > 150 PH VALUE : 7.1125 if Potability = 1 and Hardness <= 150 PH VALUE : 6.7220 if Potability = 1 and Hardness > 150 PH VALUE : 7.1125
for x in range(0, len(dataframed)) :
if (pd.isnull(dataframed['ph'][x]) == True) :
if ((dataframed['Potability'][x] == 0) & (dataframed['Hardness'][x] <= 150)) : dataframed['ph'][x] = condition_1_mean_ph
elif ((dataframed['Potability'][x] == 0) & (dataframed['Hardness'][x] > 150)) : dataframed['ph'][x] = condition_2_mean_ph
elif ((dataframed['Potability'][x] == 1) & (dataframed['Hardness'][x] <= 150)) : dataframed['ph'][x] = condition_3_mean_ph
elif ((dataframed['Potability'][x] == 1) & (dataframed['Hardness'][x] > 150)) : dataframed['ph'][x] = condition_4_mean_ph
Missing Values : Sulfate
Seawater contains roughly 2,700 milligrammes per litre (mg/L) of sulphate. Most freshwater supplies have concentrations of 3 to 30 mg/L, while some geographic regions have significantly greater quantities (1000 mg/L).
Note : Sadly the Dataset only contains data of Sulfate ranging from 129 to 481 (mg/L) so we will just get mean of the rows that are potable and not potable which has a difference of 2 (mg/L)
print('Conditonal Statements to fill in the Missing Values of Sulfate Column')
print("\n")
print('if Potability = 0')
condition_1_mean_sulfate = dataframed[(dataframed['Potability'] == 0)][['Sulfate']].mean()
print("Sulfate : {:.4f}".format(float(condition_1_mean_sulfate)))
print("\n")
print('if Potability = 1')
condition_2_mean_sulfate = dataframed[(dataframed['Potability'] == 1)][['Sulfate']].mean()
print("Sulfate : {:.4f}".format(float(condition_2_mean_sulfate)))
Conditonal Statements to fill in the Missing Values of Sulfate Column if Potability = 0 Sulfate : 334.5643 if Potability = 1 Sulfate : 332.5670
for x in range(0, len(dataframed)) :
if (pd.isnull(dataframed['Sulfate'][x]) == True) :
if (dataframed['Potability'][x] == 0) : dataframed['Sulfate'][x] = condition_1_mean_sulfate
else : dataframed['Sulfate'][x] = condition_2_mean_sulfate
Missing Values : Trihalomethanes
THMs are chemicals which may be found in water treated with chlorine. The concentration of THMs in drinking water varies according to the level of organic material in the water, the amount of chlorine required to treat the water, and the temperature of the water that is being treated. THM levels up to 80 ppm is considered safe in drinking water.
dataframed['Trihalomethanes'].fillna(value = dataframed['Trihalomethanes'].mean() , inplace = True)
EDA
Types of Rows :
- Overall Water Samples
- Potable Water Samples
- Not Potable Water Samples
Types of Rows : Overall Water Samples
The Range of Values among different 10 columns
print("The Range of Values among different 10 columns")
print("Ph : {:.4f}".format(min(dataframed['ph'])) + " - {:.4f}".format(max(dataframed['ph'])))
print('\n')
print("Hardness : {:.4f}".format(min(dataframed['Hardness'])) + " - {:.4f}".format(max(dataframed['Hardness'])))
print('\n')
print("Solids : {:.4f}".format(min(dataframed['Solids'])) + " - {:.4f}".format(max(dataframed['Solids'])))
print('\n')
print("Chloramines : {:.4f}".format(min(dataframed['Chloramines'])) + " - {:.4f}".format(max(dataframed['Chloramines'])))
print('\n')
print("Sulfate : {:.4f}".format(min(dataframed['Sulfate'])) + " - {:.4f}".format(max(dataframed['Sulfate'])))
print('\n')
print("Conductivity : {:.4f}".format(min(dataframed['Conductivity'])) + " - {:.4f}".format(max(dataframed['Conductivity'])))
print('\n')
print("Organic Carbon : {:.4f}".format(min(dataframed['Organic_carbon'])) + " - {:.4f}".format(max(dataframed['Organic_carbon'])))
print('\n')
print("Trihalomethanes : {:.4f}".format(min(dataframed['Trihalomethanes'])) + " - {:.4f}".format(max(dataframed['Trihalomethanes'])))
print('\n')
print("Turbidity : {:.4f}".format(min(dataframed['Turbidity'])) + " - {:.4f}".format(max(dataframed['Turbidity'])))
print('\n')
print("Potability : {:.4f}".format(min(dataframed['Potability'])) + " - {:.4f}".format(max(dataframed['Potability'])))
The Range of Values among different 10 columns Ph : 0.0000 - 14.0000 Hardness : 47.4320 - 323.1240 Solids : 320.9426 - 61227.1960 Chloramines : 0.3520 - 13.1270 Sulfate : 129.0000 - 481.0306 Conductivity : 181.4838 - 753.3426 Organic Carbon : 2.2000 - 28.3000 Trihalomethanes : 0.7380 - 124.0000 Turbidity : 1.4500 - 6.7390 Potability : 0.0000 - 1.0000
Distribution of Values among different 10 columns
plt.figure(figsize=(16,14))
for i,col in enumerate(dataframed.columns):
plt.subplot(4,3,i+1)
sns.kdeplot(data=dataframed[col])
plt.tight_layout()

Pearson correlation coefficient Heatmap
This kind of correlation expresses a value from 1 to -1 to represent the strength of relativity among two different variables.
sns.set(rc={'figure.figsize':(11.7,8.27)})
Heated = sns.heatmap(dataframed.corr("pearson"),vmin=-1, vmax=1,cmap='viridis',annot=True, square=True)
Heated.set(xlabel = "Pearson Correlation Coefficient Heatmap")
[Text(0.5, 56.016874999999985, 'Pearson Correlation Coefficient Heatmap')]

sns.pairplot(dataframed,hue='Potability')
<seaborn.axisgrid.PairGrid at 0x7aaf20f258d0>

Check for Outliers among the different 10 columns
plt.figure(figsize=(16,14))
for i,col in enumerate(dataframed.columns):
plt.subplot(4,3,i+1)
sns.boxplot(data=dataframed[col])
plt.tight_layout()

Distribution of rows in terms of PH Level
count_plotted = sns.histplot(x = 'ph', data = dataframed, kde = True, color = 'dodgerblue')
count_plotted.set(xlabel = "PH Level")
count_plotted.set(xlim = (0.0000, 14.0000))
[(0.0, 14.0)]

Outliers : Ph Level
boxxer_plot = sns.boxplot(x = 'ph', data = dataframed, color = 'dodgerblue')
boxxer_plot.set(xlim = (0.0000, 14.0000))
boxxer_plot.set(xlabel = 'PH Level')
[Text(0.5, 0, 'PH Level')]

How Many Outliers in Ph Level Column?
numpy_array = np.array(dataframed['ph'])
percentile_1, percentile_2 = np.percentile(numpy_array, [25, 75])
bounds = percentile_2 - percentile_1
lower_bound = percentile_1 - 1.5 * bounds
upper_bound = percentile_2 + 1.5 * bounds
outliers = numpy_array[(numpy_array < lower_bound) | (numpy_array > upper_bound)]
print("The Outliers are values less than {:.4f}".format(lower_bound), "or greater than {:.4f}".format(upper_bound))
print("There are" ,len(outliers), "outliers.")
The Outliers are values less than 3.8891 or greater than 10.2586 There are 142 outliers.
Distribution of rows in terms of Hardness
count_plotted = sns.histplot(x = 'Hardness', data = dataframed, kde = True, color = 'dodgerblue')
count_plotted.set(xlabel = "Hardness")
count_plotted.set(xlim = (47.4320, 323.1240))
[(47.432, 323.124)]

Outliers : Hardness
boxxer_plot = sns.boxplot(x = 'Hardness', data = dataframed, color = 'dodgerblue')
boxxer_plot.set(xlim = (47.4320, 323.1240))
[(47.432, 323.124)]

How Many Outliers in Hardness Column?
numpy_array = np.array(dataframed['Hardness'])
percentile_1, percentile_2 = np.percentile(numpy_array, [25, 75])
bounds = percentile_2 - percentile_1
lower_bound = percentile_1 - 1.5 * bounds
upper_bound = percentile_2 + 1.5 * bounds
outliers = numpy_array[(numpy_array < lower_bound) | (numpy_array > upper_bound)]
print("The Outliers are values less than {:.4f}".format(lower_bound), "or greater than {:.4f}".format(upper_bound))
print("There are" ,len(outliers), "outliers.")
The Outliers are values less than 117.1252 or greater than 276.3928 There are 83 outliers.
Column : Solids (Total dissolved solids – TDS)
Distribution of rows in terms of Solids
count_plotted = sns.histplot(x = 'Solids', data = dataframed, kde = True, color = 'dodgerblue')
count_plotted.set(xlabel = "Solids")
count_plotted.set(xlim = (320.9426, 61227.1960))
[(320.9426, 61227.196)]

Outliers : Solids
boxxer_plot = sns.boxplot(x = 'Solids', data = dataframed, color = 'dodgerblue')
boxxer_plot.set(xlim = (320.9426, 61227.1960))
[(320.9426, 61227.196)]

How Many Outliers in Solids Column?
numpy_array = np.array(dataframed['Solids'])
percentile_1, percentile_2 = np.percentile(numpy_array, [25, 75])
bounds = percentile_2 - percentile_1
lower_bound = percentile_1 - 1.5 * bounds
upper_bound = percentile_2 + 1.5 * bounds
outliers = numpy_array[(numpy_array < lower_bound) | (numpy_array > upper_bound)]
print("The Outliers are greater than {:.4f}".format(upper_bound))
print("There are" ,len(outliers), "outliers.")
The Outliers are greater than 44831.8699 There are 47 outliers
Distribution of rows in terms of Chloramines
count_plotted = sns.histplot(x = 'Chloramines', data = dataframed, kde = True, color = 'dodgerblue')
count_plotted.set(xlabel = "Chloramines")
count_plotted.set(xlim = (0.3520, 13.1270))
[(0.352, 13.127)]

How Many Outliers in Chloramines Column?
numpy_array = np.array(dataframed['Chloramines'])
percentile_1, percentile_2 = np.percentile(numpy_array, [25, 75])
bounds = percentile_2 - percentile_1
lower_bound = percentile_1 - 1.5 * bounds
upper_bound = percentile_2 + 1.5 * bounds
outliers = numpy_array[(numpy_array < lower_bound) | (numpy_array > upper_bound)]
print("The Outliers are values less than {:.4f}".format(lower_bound), "or greater than {:.4f}".format(upper_bound))
print("There are" ,len(outliers), "outliers.")
The Outliers are values less than 3.1462 or greater than 11.0961 There are 61 outliers.
Distribution of rows in terms of Sulfate
count_plotted = sns.histplot(x = 'Sulfate', data = dataframed, kde = True, color = 'dodgerblue')
count_plotted.set(xlabel = "Sulfate")
count_plotted.set(xlim = (129.0000, 481.0306))
[(129.0, 481.0306)]

Defining X
Should contain the following :
- PH VALUE
- HARDNESS
- SOLIDS
- CHLORAMINES
- SULFATE
- CONDUCTIVITY
- ORGANIC_CARBON
- TRIHALOMETHANES
- TURBIDITY
X = dataframed.iloc[:, :-1]
X.head()
ph | Hardness | Solids | Chloramines | Sulfate | Conductivity | Organic_carbon | Trihalomethanes | Turbidity | |
---|---|---|---|---|---|---|---|---|---|
0 | 7.112512 | 204.890455 | 20791.318981 | 7.300212 | 368.516441 | 564.308654 | 10.379783 | 86.990970 | 2.963135 |
1 | 3.716080 | 129.422921 | 18630.057858 | 6.635246 | 334.564290 | 592.885359 | 15.180013 | 56.329076 | 4.500656 |
2 | 8.099124 | 224.236259 | 19909.541732 | 9.275884 | 334.564290 | 418.606213 | 16.868637 | 66.420093 | 3.055934 |
3 | 8.316766 | 214.373394 | 22018.417441 | 8.059332 | 356.886136 | 363.266516 | 18.436524 | 100.341674 | 4.628771 |
4 | 9.092223 | 181.101509 | 17978.986339 | 6.546600 | 310.135738 | 398.410813 | 11.558279 | 31.997993 | 4.075075 |
Defining y
Should contain the following :
- POTABILITY
y = dataframed.iloc[:, -1:]
y.head()
Potability |
---|
0 |
0 |
0 |
0 |
0 |
Training and Testing
Supervised AL used :
- K-Nearest Neighbors
- Decision Tree
- Random Forest
- Logistic Regression
- Support Vector Machine
- Gaussian Naive Bayes
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score , classification_report, ConfusionMatrixDisplay,precision_score,recall_score, f1_score,roc_auc_score,roc_curve
Split The Data for Training and Testing Purposes
Test Sizes
- 20%
- 30%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
Shape of Data (20%)
print('Shape of Data (20%)')
print("X_train shape : ", X_train.shape)
print("y_train shape : ", y_train.shape)
print("X_test shape : ", X_test.shape)
print("y_test shape : ", y_test.shape)
Shape of Data (20%) X_train shape : (2620, 9) y_train shape : (2620, 1) X_test shape : (656, 9) y_test shape : (656, 1)
AL : K-Nearest Neighbors (20%)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X_train, y_train.values.ravel())
training_prediction = knn.predict(X_train)
testing_prediction = knn.predict(X_test)
#Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
#Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
testing_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
testing_precision = precision_score(y_test, testing_prediction, average = 'weighted')
testing_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : K-Nearest Neighbors (20%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : K-Nearest Neighbors (20%) Training Model Performance Check Accuracy Score : 0.7618 F1 Score : 0.7299 Precision Score : 0.8291 Recall Score : 0.7618 Testing Model Performance Check Accuracy Score : 0.5899 F1 Score : 0.5284 Precision Score : 0.5284 Recall Score : 0.5899
'''from sklearn.model_selection import GridSearchCV
leaf_size = list(range(1,50))
n_neighbors = list(range(1,30))
p=[1,2]
#Convert to dictionary
hyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)
#Create new KNN object
knn_2 = KNeighborsClassifier()
#Use GridSearch
clf = GridSearchCV(knn_2, hyperparameters, cv=10)
#Fit the model
best_model = clf.fit(X_train, y_train)
#Print The value of best Hyperparameters
print('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])
print('Best p:', best_model.best_estimator_.get_params()['p'])
print('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])
'''
"from sklearn.model_selection import GridSearchCV\n\nleaf_size = list(range(1,50))\nn_neighbors = list(range(1,30))\np=[1,2]\n#Convert to dictionary\nhyperparameters = dict(leaf_size=leaf_size, n_neighbors=n_neighbors, p=p)\n#Create new KNN object\nknn_2 = KNeighborsClassifier()\n#Use GridSearch\nclf = GridSearchCV(knn_2, hyperparameters, cv=10)\n#Fit the model\nbest_model = clf.fit(X_train, y_train)\n#Print The value of best Hyperparameters\nprint('Best leaf_size:', best_model.best_estimator_.get_params()['leaf_size'])\nprint('Best p:', best_model.best_estimator_.get_params()['p'])\nprint('Best n_neighbors:', best_model.best_estimator_.get_params()['n_neighbors'])\n"
AL : Decision Tree (20%)
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
Decision_Tree = DecisionTreeClassifier(max_depth = 5)
Decision_Tree.fit(X_train, y_train.values.ravel())
training_prediction = Decision_Tree.predict(X_train)
testing_prediction = Decision_Tree.predict(X_test)
# Visualize Decision Tree
plt.figure(figsize = (25,20))
tree.plot_tree(Decision_Tree,
feature_names = dataframed.columns.tolist()[:-1],
class_names = ["0", "1"],
filled = True,
precision = 5)
plt.show()
#Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
#Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
testing_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
testing_precision = precision_score(y_test, testing_prediction, average = 'weighted')
testing_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Decision Tree (20%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score {:.4f}'.format(training_accuracy))
print('F1 Score {:.4f}'.format(training_f1))
print('Precision Score {:.4f}'.format(training_precision))
print('Recall Score {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score {:.4f}'.format(testing_accuracy))
print('F1 Score {:.4f}'.format(testing_f1))
print('Precision Score {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))

AL : Decision Tree (20%) Training Model Performance Check Accuracy Score 0.7485 F1 Score 0.7289 Precision Score 0.7650 Recall Score 0.7485 Testing Model Performance Check Accuracy Score 0.7210 F1 Score 0.7005 Precision Score 0.7190 Recall Score : 0.7210
AL : Random Forest (20%)
from sklearn.ensemble import RandomForestClassifier
Random_Forest = RandomForestClassifier()
Random_Forest.fit(X_train, y_train.values.ravel())
training_prediction = Random_Forest.predict(X_train)
testing_prediction = Random_Forest.predict(X_test)
#Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
#Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
testing_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
testing_precision = precision_score(y_test, testing_prediction, average = 'weighted')
testing_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Random Forest (20%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Random Forest (20%) Training Model Performance Check Accuracy Score : 1.0000 F1 Score : 1.0000 Precision Score : 1.0000 Recall Score : 1.0000 Testing Model Performance Check Accuracy Score : 0.7591 F1 Score : 0.7488 Precision Score : 0.7570 Recall Score : 0.7591
AL : Logistic Regression (20%)
from sklearn.linear_model import LogisticRegression
Logistic_Regression = LogisticRegression()
Logistic_Regression.fit(X_train, y_train.values.ravel())
training_prediction = Logistic_Regression.predict(X_train)
testing_prediction = Logistic_Regression.predict(X_test)
# Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
# Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
training_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
training_precision = precision_score(y_test, testing_prediction, average = 'weighted')
training_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Logistic Regression (20%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Logistic Regression (20%) Training Model Performance Check Accuracy Score : 0.6061 F1 Score : 0.4846 Precision Score : 0.3944 Recall Score : 0.6280 Testing Model Performance Check Accuracy Score : 0.6280 F1 Score : 0.7488 Precision Score : 0.7570 Recall Score : 0.7591
AL : Support Vector Machine (20%)
from sklearn import svm
SVM = svm.SVC()
SVM.fit(X_train, y_train.values.ravel())
training_prediction = SVM.predict(X_train)
testing_prediction = SVM.predict(X_test)
# Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
# Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
training_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
training_precision = precision_score(y_test, testing_prediction, average = 'weighted')
training_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Support Vector Machine (20%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Support Vector Machine (20%) Training Model Performance Check Accuracy Score : 0.6053 F1 Score : 0.4846 Precision Score : 0.3944 Recall Score : 0.6280 Testing Model Performance Check Accuracy Score : 0.6280 F1 Score : 0.7488 Precision Score : 0.7570 Recall Score : 0.7591
AL : Gaussian Naive Bayes (20%)
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
NB.fit(X_train, y_train.values.ravel())
training_prediction = NB.predict(X_train)
testing_prediction = NB.predict(X_test)
# Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
# Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
training_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
training_precision = precision_score(y_test, testing_prediction, average = 'weighted')
training_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Gaussian Naive Bayes (20%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Gaussian Naive Bayes (20%) Training Model Performance Check Accuracy Score : 0.6302 F1 Score : 0.5825 Precision Score : 0.5981 Recall Score : 0.6296 Testing Model Performance Check Accuracy Score : 0.6296 F1 Score : 0.7488 Precision Score : 0.7570 Recall Score : 0.7591
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
Shape of Data (30%) X_train shape : (2293, 9) y_train shape : (2293, 1) X_test shape : (983, 9) y_test shape : (983, 1)
AL : K-Nearest Neighbors (30%)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X_train, y_train.values.ravel())
training_prediction = knn.predict(X_train)
testing_prediction = knn.predict(X_test)
#Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
#Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
testing_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
testing_precision = precision_score(y_test, testing_prediction, average = 'weighted')
testing_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : K-Nearest Neighbors (30%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : K-Nearest Neighbors (30%) Training Model Performance Check Accuracy Score : 0.7505 F1 Score : 0.7150 Precision Score : 0.8236 Recall Score : 0.7505 Testing Model Performance Check Accuracy Score : 0.6002 F1 Score : 0.5405 Precision Score : 0.5464 Recall Score : 0.6002
AL : Decision Tree (30%)
from sklearn.tree import DecisionTreeClassifier
Decision_Tree = DecisionTreeClassifier(max_depth = 5)
Decision_Tree.fit(X_train, y_train.values.ravel())
training_prediction = Decision_Tree.predict(X_train)
testing_prediction = Decision_Tree.predict(X_test)
# Visualize Decision Tree
plt.figure(figsize = (25,20))
tree.plot_tree(Decision_Tree,
feature_names = dataframed.columns.tolist()[:-1],
class_names = ["0", "1"],
filled = True,
precision = 5)
plt.show()
#Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
#Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
testing_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
testing_precision = precision_score(y_test, testing_prediction, average = 'weighted')
testing_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Decision Tree (30%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score {:.4f}'.format(training_accuracy))
print('F1 Score {:.4f}'.format(training_f1))
print('Precision Score {:.4f}'.format(training_precision))
print('Recall Score {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score {:.4f}'.format(testing_accuracy))
print('F1 Score {:.4f}'.format(testing_f1))
print('Precision Score {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))

AL : Decision Tree (30%) Training Model Performance Check Accuracy Score 0.7719 F1 Score 0.7660 Precision Score 0.7717 Recall Score 0.7719 Testing Model Performance Check Accuracy Score 0.7406 F1 Score 0.7330 Precision Score 0.7352 Recall Score : 0.7406
AL : Random Forest (30%)
from sklearn.ensemble import RandomForestClassifier
Random_Forest = RandomForestClassifier()
Random_Forest.fit(X_train, y_train.values.ravel())
training_prediction = Random_Forest.predict(X_train)
testing_prediction = Random_Forest.predict(X_test)
#Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
#Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
testing_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
testing_precision = precision_score(y_test, testing_prediction, average = 'weighted')
testing_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Random Forest (30%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Random Forest (30%) Training Model Performance Check Accuracy Score : 1.0000 F1 Score : 1.0000 Precision Score : 1.0000 Recall Score : 1.0000 Testing Model Performance Check Accuracy Score : 0.7762 F1 Score : 0.7671 Precision Score : 0.7757 Recall Score : 0.7762
AL : Logistic Regression (30%)
from sklearn.linear_model import LogisticRegression
Logistic_Regression = LogisticRegression()
Logistic_Regression.fit(X_train, y_train.values.ravel())
training_prediction = Logistic_Regression.predict(X_train)
testing_prediction = Logistic_Regression.predict(X_test)
# Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
# Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
training_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
training_precision = precision_score(y_test, testing_prediction, average = 'weighted')
training_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Logistic Regression (30%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Logistic Regression (30%) Training Model Performance Check Accuracy Score : 0.6027 F1 Score : 0.4841 Precision Score : 0.3940 Recall Score : 0.6277 Testing Model Performance Check Accuracy Score : 0.6277 F1 Score : 0.7671 Precision Score : 0.7757 Recall Score : 0.7762
AL : Support Vector Machine (30%)
from sklearn import svm
SVM = svm.SVC()
SVM.fit(X_train, y_train.values.ravel())
training_prediction = SVM.predict(X_train)
testing_prediction = SVM.predict(X_test)
# Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
# Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
training_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
training_precision = precision_score(y_test, testing_prediction, average = 'weighted')
training_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Support Vector Machine (30%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Support Vector Machine (30%) Training Model Performance Check Accuracy Score : 0.6023 F1 Score : 0.4841 Precision Score : 0.3940 Recall Score : 0.6277 Testing Model Performance Check Accuracy Score : 0.6277 F1 Score : 0.7671 Precision Score : 0.7757 Recall Score : 0.7762
AL : Gaussian Naive Bayes (30%)
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
NB.fit(X_train, y_train.values.ravel())
training_prediction = NB.predict(X_train)
testing_prediction = NB.predict(X_test)
# Training Metrics
training_accuracy = accuracy_score(y_train, training_prediction)
training_f1 = f1_score(y_train, training_prediction, average = 'weighted')
training_precision = precision_score(y_train, training_prediction, average = 'weighted')
training_recall = recall_score(y_train, training_prediction, average = 'weighted')
# Testing Metrics
testing_accuracy = accuracy_score(y_test, testing_prediction)
training_f1 = f1_score(y_test, testing_prediction, average = 'weighted')
training_precision = precision_score(y_test, testing_prediction, average = 'weighted')
training_recall = recall_score(y_test, testing_prediction, average = 'weighted')
print('AL : Gaussian Naive Bayes (30%)')
print('\n')
print('Training Model Performance Check')
print('Accuracy Score : {:.4f}'.format(training_accuracy))
print('F1 Score : {:.4f}'.format(training_f1))
print('Precision Score : {:.4f}'.format(training_precision))
print('Recall Score : {:.4f}'.format(training_recall))
print('\n')
print('Testing Model Performance Check')
print('Accuracy Score : {:.4f}'.format(testing_accuracy))
print('F1 Score : {:.4f}'.format(testing_f1))
print('Precision Score : {:.4f}'.format(testing_precision))
print('Recall Score : {:.4f}'.format(testing_recall))
AL : Gaussian Naive Bayes (30%) Training Model Performance Check Accuracy Score : 0.6263 F1 Score : 0.5812 Precision Score : 0.6023 Recall Score : 0.6328 Testing Model Performance Check Accuracy Score : 0.6328 F1 Score : 0.7671 Precision Score : 0.7757 Recall Score : 0.7762
Important Notice for college students
If you’re a college student and have skills in programming languages, Want to earn through blogging? Mail us at geekycomail@gmail.com
For more Programming related blogs Visit Us Geekycodes. Follow us on Instagram