Decision tree algorithms are a type of supervised learning algorithm used to solve both regression and classification problems. The goal is to create a model that predicts the value of a target variable based on several input variables. Decision trees use a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. The model is based on a decision tree that can be used to map out all possible outcomes of a decision.
A decision tree algorithm works by breaking down a dataset into smaller and smaller subsets while at the same time, an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches and each branch represents a value of the attribute and the leaf node represents a decision.

Mathematical Intuition
The mathematical intuition behind decision trees is based on the concept of entropy and information gain. Entropy measures the impurity of the input set, while information gain measures how much the entropy decreases after splitting the set based on a certain feature.
More formally, let’s say we have a training set D of size N, with M classes or labels. Let’s also define p(i) as the proportion of data points in D belonging to class i. The entropy of D is then:
Entropy(D) = -sum(p(i) * log2(p(i)))
The entropy of a pure set is 0 since all the data points in the set belong to the same class. On the other hand, a set with an equal number of data points belonging to different classes has the highest possible entropy.
To split the set based on a certain feature, we need to calculate the entropy of each subset resulting from the split and weigh it by the size of the subset. The information gain of the split is then the difference between the entropy of the input set and the weighted average entropy of the subsets.
Information gain = Entropy(D) – sum(Size(Di)/Size(D) * Entropy(Di))
The feature with the highest information gain is chosen to split the set at each step of the tree-building process. This process is repeated recursively for each subset, until some stopping criteria are met, such as a maximum tree depth or a minimum number of data points in the subset.
There are several criteria that can be used to stop the splitting process of a decision tree, such as:
- Maximum tree depth: This criterion specifies a maximum depth for the tree. Once the tree reaches this depth, no further splits are performed. This is useful to prevent overfitting and to avoid creating a tree that is too complex.
- Minimum number of data points in a leaf node: This criterion specifies a minimum number of data points that a leaf node must contain. If a split result in a leaf node with fewer data points than the specified minimum, the split is not performed. This helps prevent overfitting and ensures that each leaf node contains enough data to make a reliable prediction.
- Minimum improvement in entropy: This criterion specifies a minimum improvement in entropy that must be achieved by a split in order for it to be performed. If the entropy of a subset does not decrease by at least the specified amount after a split, the split is not performed. This helps prevent the tree from overfitting to the training data.
- A maximum number of leaf nodes: This criterion specifies a maximum number of leaf nodes that the tree can contain. Once the tree reaches this number of leaf nodes, no further splits are performed.
There is no single criterion that works best for all cases, and the choice of stopping criteria will depend on the specific problem and the trade-off between model complexity and accuracy.

Pruning
Pruning is a technique used to improve the generalization ability of a decision tree by reducing its size. The idea is to remove some of the branches of the tree that do not contribute much to the accuracy of the model, in order to avoid overfitting.
There are two main types of pruning techniques: pre-pruning and post-pruning.
Pre-pruning consists of setting some stopping criteria for the tree-building process, such as a maximum tree depth or a minimum number of data points in a leaf node. This prevents the tree from growing too large and helps reduce overfitting.
On the other hand, post-pruning consists of trimming the branches of an already-grown tree. There are several algorithms that can be used for this purpose, such as reduced error pruning and minimum error pruning. These algorithms remove the branches that contribute the least to the accuracy of the tree, based on some criteria such as the error rate or the complexity of the tree.
Pruning can be a computationally expensive process, especially for large trees. However, it can significantly improve the generalization ability of the model and make it more robust to unseen data.
Closing Remarks
I hope this helps! Let me know if you have any questions or would like further clarification on any of the concepts.
For the practical implementation of the Decision Tree algorithm for both classification and regression problem statements, please visit my GitHub.