An MLP is composed of one (passthrough) input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer (seeFigure below). The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

The signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a feedforward neural net‐ work (FNN).
When an ANN contains a deep stack of hidden layers, it is called a deep neural network (DNN). The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations. However, many people talk about Deep Learning whenever neural networks are involved (even shallow ones).
For many years researchers struggled to find a way to train MLPs, without success. But in 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams published a groundbreaking paper introducing the backpropagation training algorithm, which is still used today. In short, it is simply Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network’s error with regards to every single model parameter. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.
Automatically computing gradients is called automatic dierentiation, or autodi. There are various autodiff techniques, with different pros and cons. The one used by backpropagation is called reverse-mode autodi. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g., connection weights) and few outputs (e.g., one loss). If you want to learn more about autodiff, check out ???.
Let’s run through this algorithm in a bit more detail:
- It handles one mini-batch at a time (for example containing 32 instances each),
and it goes through the full training set multiple times. Each pass is called an
epoch - Each mini-batch is passed to the network’s input layer, which just sends it to the
first hidden layer. The algorithm then computes the output of all the neurons in
this layer (for every instance in the mini-batch). The result is passed on to the
next layer, its output is computed and passed to the next layer, and so on until we
get the output of the last layer, the output layer. This is the forward pass: it is
exactly like making predictions, except all intermediate results are preserved
since they are needed for the backward pass. - Next, the algorithm measures the network’s output error (i.e., it uses a loss func‐
tion that compares the desired output and the actual output of the network, and
returns some measure of the error). - Then it computes how much each output connection contributed to the error.
This is done analytically by simply applying the chain rule (perhaps the most fun‐
damental rule in calculus), which makes this step fast and precise. - The algorithm then measures how much of these error contributions came from
each connection in the layer below, again using the chain rule—and so on until
the algorithm reaches the input layer. As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the name of the algorithm). - Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed
This algorithm is so important, it’s worth summarizing it again: for each training instance the backpropagation algorithm first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).
In order for this algorithm to work properly, the authors made a key change to the MLP’s architecture: they replaced the step function with the logistic function, σ(z) = 1 / (1 + exp(–z)). This was essential because the step function contains only flat segments, so there is no gradient to work with (Gradient Descent cannot move on a flat surface), while the logistic function has a well-defined nonzero derivative everywhere, allowing Gradient Descent to make some progress at every step. In fact, the backpropagation algorithm works well with many other activation functions, not just the logistic function. Two other popular activation functions are:
The hyperbolic tangent function tanh(z) = 2σ(2z) – 1
Just like the logistic function it is S-shaped, continuous, and differentiable, but its output value ranges from –1 to 1 (instead of 0 to 1 in the case of the logistic function), which tends to make each layer’s output more or less centered around 0 at the beginning of training. This often helps speed up convergence.
The Rectified Linear Unit function:
ReLU(z) = max(0, z) It is continuous but unfortunately not differentiable at z = 0 (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for z < 0. However, in practice it works very well and has the advantage of being fast to compute11. Most importantly, the fact that it does not have a maximum output value also helps reduce some issues during Gradient Descent .
These popular activation functions and their derivatives are represented in Figure below. But wait! Why do we need activation functions in the first place? Well, if you chain several linear transformations, all you get is a linear transformation. For example, say f(x) = 2 x + 3 and g(x) = 5 x – 1, then chaining these two linear functions gives you another linear function: f(g(x)) = 2(5 x – 1) + 3 = 10 x + 1. So if you don’t have some non-linearity between layers, then even a deep stack of layers is equivalent to a single layer: you cannot solve very complex problems with that.

Conclusion
Okay! So now you know where neural nets came from, what their architecture is and how to compute their outputs, and you also learned about the backpropagation algorithm. But what exactly can you do with them, We will see it in next upcoming tutorials.