Introduction
In the dynamic landscape of machine learning, Multilayer Perceptrons (MLPs) emerge as formidable tools capable of handling both regression and classification tasks with finesse. Whether you’re predicting housing prices or sorting emails, understanding how to tailor MLP architectures and activations is pivotal for optimizing performance.
Regression MLPs
Crafting an MLP architecture for regression tasks demands careful consideration. With a single output neuron, predictions for continuous values, such as house prices, can be accurately generated. For multivariate regression, where multiple output dimensions are involved, additional output neurons are required, corresponding to each dimension.
Tailoring MLPs for regression involves strategic choices regarding activation functions. While omitting activations allows for unrestricted value output, employing ReLU or softplus activations can ensure positivity in predictions. Alternatively, logistic or hyperbolic tangent functions, coupled with appropriate label scaling, confine predictions within desired ranges.
During training, selecting an appropriate loss function is crucial. While mean squared error is a common choice, mean absolute error or Huber loss can be advantageous in scenarios with outliers, offering faster convergence and enhanced robustness.
The Huber loss is quadratic when the error is smaller than a threshold δ (typically 1), but linear when the error is larger than δ. This makes it less sensitive to outliers than the mean squared error, and it is often more precise and converges faster than the mean absolute error.

Classification MLPs
MLPs seamlessly adapt to classification tasks, whether binary or multiclass. In binary classification, a single output neuron utilizing the logistic activation function provides probabilities, simplifying the prediction of positive class likelihood. Extending to multilabel binary classification, such as categorizing emails as spam or urgent, each relevant class necessitates an output neuron with logistic activation.
For multiclass classification scenarios, where instances belong exclusively to one class from many options, employing an output neuron per class paired with softmax activation ensures precise probability distributions. This adherence to exclusivity requirements facilitates accurate decision-making.
Note that the output probabilities do not necessarily add up to one. This lets the model output any combination of labels: you can have non-urgent ham, urgent ham, non-urgent spam, and perhaps even urgent spam (although that would probably be an error). If each instance can belong only to a single class, out of 3 or more possible classes (e.g., classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the somax activation function for the whole output layer. The softmax function will ensure that all the estimated probabilities are between 0 and 1 and that they add up to one (which is required if the classes are exclusive). This is called multiclass classification.

Regarding the loss function, since we are predicting probability distributions, the cross-entropy (also called the log loss) is generally a good choice.
Optimizing Efficiency:
Efficient utilization of MLPs entails selecting appropriate loss functions tailored to specific tasks. For classification tasks, leveraging cross-entropy (log loss) fosters optimal performance, particularly when dealing with probability distributions.
Conclusion
Before we go on, I recommend you do some coding practice, at the end of this chapter. You will play with various neural network architectures and visualize their outputs using the TensorFlow Playground. This will be very useful to better understand MLPs, for example the effects of all the hyperparameters (number of layers and neurons, activation functions, and more).