Decision Trees are also capable of performing regression tasks. Let’s build a regression tree using Scikit-Learn’s DecisionTreeRegressor class, training it on a noisy quadratic dataset with max_depth=2:

from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)

The resulting tree is represented below

A decision tree for regression tasks showing conditions and values at each node, including mean squared error (mse) and sample sizes.

This tree looks very similar to the classification tree you built earlier. The main difference is that instead of predicting a class in each node, it predicts a value. For example, suppose you want to make a prediction for a new instance with x1 = 0.6. You traverse the tree starting at the root, and you eventually reach the leaf node that predicts value=0.1106. This prediction is simply the average target value of the 110 training instances associated to this leaf node. This prediction results in a Mean Squared Error (MSE) equal to 0.0151 over these 110 instances.
This model’s predictions are represented on the left of Figure 6-5. If you set max_depth=3, you get the predictions represented on the right. Notice how the predicted value for each region is always the average target value of the instances in that region. The algorithm splits each region in a way that makes most training instances as close as possible to that predicted value.

Comparison of regression tree predictions with max_depth=2 on the left and max_depth=3 on the right, showing average target values represented by red lines and training instances as blue points.

The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE. Equation below shows the cost function that the algorithm tries to minimize.

Mathematical representation of the cost function used in regression trees to minimize Mean Squared Error (MSE).
CART cost function for regression

Just like for classification tasks, Decision Trees are prone to overfitting when dealing with regression tasks. Without any regularization (i.e., using the default hyperparameters), you get the predictions on the left of Figure below. It is obviously overfitting the training set very badly. Just setting min_samples_leaf=10 results in a much more reasonable model, represented on the right of Figure below.

Comparison of regression predictions with no restrictions on the left and with min_samples_leaf=10 on the right, showing how changes in parameters influence the model's fit to the data.
Regularizing a Decision Tree regressor

Instability

Hopefully by now you are convinced that Decision Trees have a lot going for them: they are simple to understand and interpret, easy to use, versatile, and powerful. However they do have a few limitations. First, as you may have noticed, Decision Trees love orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation. For example, Figure below shows asimple linearly separable dataset: on the left, a Decision Tree can split it easily, while on the right, after the dataset is rotated by 45°, the decision boundary looks unnecessarily convoluted. Although both Decision Trees fit the training set perfectly, it is very likely that the model on the right will not generalize well. One way to limit this problem is to use PCA (see Chapter 8), which often results in a better orientation of the training data.

Comparison of two decision boundary plots for a regression analysis. The left plot shows a simple linear decision boundary dividing two classes, while the right plot depicts a more complex, piecewise function illustrating non-linear predictions.
Sensitivity to training set rotation

More generally, the main issue with Decision Trees is that they are very sensitive to small variations in the training data. For example, if you just remove the widest IrisVersicolor from the iris training set (the one with petals 4.8 cm long and 1.8 cm wide) and train a new Decision Tree, you may get the model represented below. As you can see, it looks very different from the previous Decision Tree. Actually, since the training algorithm used by Scikit-Learn is stochastic6 you may get very different models even on the same training data (unless you set the random_state hyperparameter).

Scatter plot showing petal length on the x-axis and petal width on the y-axis, depicting different regions classified by a decision tree with depth levels labeled as Depth=0 and Depth=1.
Sensitivity to training set details

Random Forests can limit this instability by averaging predictions over many trees, as we will see in the next upcoming posts..

Leave a Reply

Discover more from Geeky Codes

Subscribe now to keep reading and get access to the full archive.

Continue reading