The motivation behind random forest or ensemble models in general in layman’s terms, Let’s say we have a question/problem to solve we bring 100 people and ask each of them the question/problem and record their solution. Next, we prepare a solution which is a combination/ a mixture of all the solutions provided by these 100 people.
We will find that the aggregated solution will be close to the actual solution. This is known as the “Wisdom of the crowd” and this is the motivation behind Random Forests.
We take weak learners (ML models) specifically, Decision Trees in the case of Random Forest & aggregate their results to get good predictions by removing dependency on a particular set of features. In regression, we take the mean and for Classification, we take the majority vote of the classifiers.
A random forest is generally better than a decision tree, however, you should note that no algorithm is better than the other it will always depend on the use case & the dataset [Check the No Free Lunch Theorem.]. Reasons why random forests allow for stronger prediction than individual decision trees:
- Decision trees are prone to overfit whereas random forest generalizes better on unseen data as it is using randomness in feature selection as well as during sampling of the data. Therefore, random forests have lower variance compared to that the decision tree without substantially increasing the error due to bias.
- Generally, ensemble models like Random Forest perform better as they are aggregations of various models (Decision Trees in the case of Random Forest), using the concept of the “Wisdom of the crowd.”
Clean data is subjective.
Data quality must be defined first in order to have clean data.
Before you clean, ask:
🧭 How does the business define clean data?
🧭 What are the use cases of the data? Is specific cleaning required for each?
🧭 How do data quality standards change at each stage in a data process?
🧭 What is considered a missing value?
🧭 Are NULLs acceptable, and what do they mean?
Jumping into cleaning without defining clean data? A quick way to make dirty data. Using it in ML models will make it exponentially dirty.
Define data quality first. Think from the business perspective. Then clean your data. You’ll help yourself later.