Machine Learning

How to handle categorical data in machine learning

How to handle categorical data in machine learning

Understanding Categorical Data and its Importance in Machine Learning Categorical data is a type of data that can be divided into distinct groups or categories. In machine learning, it is common to encounter categorical data in the form of labels, such as a classification problem where the output is a set of predefined categories. Handling categorical data is an important step in preprocessing your data for machine learning, as the algorithms used in machine learning often require numerical input. One of the most common ways to handle categorical data is through encoding. Encoding involves converting categorical data into a numerical…
Read More
Random Forest Algorithm

Random Forest Algorithm

Random Forest is a robust machine-learning algorithm that is used for both classification and regression tasks. It is a type of ensemble learning method, which means that it combines multiple decision trees to create a more accurate and stable model. The mathematical intuition behind Random Forest is rooted in the concept of decision trees and bagging. A decision tree is a tree-like structure in which the internal nodes represent the feature(s) of the data, the branches represent the decision based on those features, and the leaves represent the output or class label. Each internal node in a decision tree represents…
Read More
Decision Tree

Decision Tree

Decision tree algorithms are a type of supervised learning algorithm used to solve both regression and classification problems. The goal is to create a model that predicts the value of a target variable based on several input variables. Decision trees use a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. The model is based on a decision tree that can be used to map out all possible outcomes of a decision. A decision tree algorithm works by breaking down a dataset into smaller and smaller subsets while at the same time, an…
Read More
Support Vector Machine

Support Vector Machine

Support Vector Machines (SVM) is a supervised machine learning algorithm that can be used for classification or regression tasks. The goal of the SVM algorithm is to find the hyperplane in an N-dimensional space that maximally separates the two classes. Mathematical Intuition Support Vector Machines (SVMs) are a type of supervised machine learning algorithm that can be used for classification or regression tasks. The goal of an SVM is to find the hyperplane in a high-dimensional space that maximally separates the different classes. Imagine we have two classes of data points, represented by circles and rectangles The SVM algorithm will…
Read More
Steps to Create a Tensorflow Model

Steps to Create a Tensorflow Model

There are 3 fundamental steps to creating a model Create a Model -> Connect the layers of NN yourself by using Sequential or Functional API or import a previously built model(Transfer Learning)Compile a Model -> Define how a model's performance should be measured(metrics) and how to improve it by using an optimizer(Adam, SGD, etc.)Fit a Model -> Model tries to find a pattern in the data. Sequential and Functional API Sequential Model: A Sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor. A Sequential model is not…
Read More
How to deal with outliers

How to deal with outliers

In this Notebook, we will describe how to deal with outliers #Importing the dataset import pandas as p import numpy as n import matplotlib.pyplot as plt import seaborn as sns from sklearn.datasets import load_boston import warnings warnings.filterwarnings('ignore') boston=load_boston() #it is stored as dictionary df= p.DataFrame(boston['data'],columns=boston['feature_names']) df.head() sns.distplot(df['RM']) #As we can see outliers sns.boxplot(df['RM']) Trimming outliers from the dataset def outliers(data): IQR=data.quantile(0.75)-data.quantile(0.25) lr=data.quantile(0.25)-(1.5*IQR) #lower range hr=data.quantile(0.70)+(1.5*IQR) #higher range return data.loc[~(n.where(data<lr,True,n.where(data>hr,True,False)))] outliers(df['RM']) #as we csn there is no outliers sns.boxplot(outliers(df['RM'])) #We can find outlier with using mean and standard deviation in case of IQR def outliers(data,k): lr=data.mean()-(data.std()*k) #where n is number hr=data.mean()+(data.std()*k)…
Read More
What is data leakage in Machine Learning

What is data leakage in Machine Learning

When training a machine learning model, we normally prefer selecting a generalized model which is performing well both on training and validation/test data. However, there can be a situation where the model performs well during testing but fails to achieve the same level of performance with real-world (production data) usage. For example, your model is giving 95% accuracy on test data but as soon as it productized and acts on real data, it fails to achieve the same or nearby performance. Such a discrepancy between test performance and real-world performance is often referred to as Leakage. What is Train/Test bleed?…
Read More
How to do Feature Encoding and Exploratory Data Analysis

How to do Feature Encoding and Exploratory Data Analysis

Categorical variables are those values that are selected from a group of categories or labels. For example, the variable Gender with the values of male or female is categorical, and so is the variable marital status with the values of never married, married, divorced, or widowed. In some categorical variables, the labels have an intrinsic order, for example, in the variable Student's grade, the values of A, B, C, or Fail are ordered, A being the highest grade and Fail the lowest. These are called ordinal categorical variables. Variables in which the categories do not have an intrinsic order are…
Read More
8 Essential Machine Learning Terms You must Know

8 Essential Machine Learning Terms You must Know

Data Wrangling Data Wrangling is the process of gathering, selecting, cleaning, structuring, and enriching raw data into the desired format for better decision-making in less time. If you want to create an efficient ETL pipeline(Extract, transform, and load) or create beautiful data visualizations, you should be prepared to do a lot of data wrangling-springboard. Data Imputation Data Imputation is the substitution of estimated values for missing or inconsistent data items(fields). The substituted values are intended to create a data record that does not fail edits. The most common technique is mean imputation, where you take the mean of the existing…
Read More
What are Bias and Variance in Machine Learning

What are Bias and Variance in Machine Learning

As machine learning is increasingly used in applications, machine learning algorithms have gained more scrutiny. With larger data sets, various implementations, algorithms, and learning requirements, it has become even more complex to create and evaluate ML models since all those factors directly impact the overall accuracy and learning outcome of the model. This is further skewed by false assumptions, noise, and outliers. Machine learning models cannot be a black box. The user needs to be fully aware of their data and algorithms to trust the outputs and outcomes. Any issues in the algorithm or polluted data set can negatively impact the ML model. The main…
Read More