Several kinds of neural activation units are used in neural networks, depending on the architecture and the problem at hand. We will discuss the most commonly used activation functions, as these play an important role in determining the network architecture and performance. Linear and sigmoid unit activation functions were primarily used in artificial neural networks until rectified linear units (ReLUs), invented by Hinton et al., revolutionized the performance of neural networks.
Neural activation units
Linear activation units
A linear activation unit outputs the total input to the neuron that is attenuated, as shown in the following graph:
If x is the total input to the linear activation unit, then the output, y, can be represented as follows:
Sigmoid activation units
The output of the sigmoid activation unit, y, as a function of its total input, x, is expressed as follows:
Since the sigmoid activation unit response is a nonlinear function, as shown in the following graph, it is used to introduce nonlinearity in the neural network:
Any complex process in nature is generally nonlinear in its input-output relation, and hence, we need nonlinear activation functions to model them through neural networks. The output probability of a neural network for a two-class classification is generally given by the output of a sigmoid neural unit, since it outputs values from zero to one. The output probability can be represented as follows:
Here, x represents the total input to the sigmoid unit in the output layer.
The hyperbolic tangent activation function
The output, y, of a hyperbolic tangent activation function (tanh) as a function of its total input, x, is given as follows:
The tanh activation function outputs values in the range [-1, 1], as you can see in the following graph:
One thing to note is that both the sigmoid and the tanh activation functions are linear within a small range of the input, beyond which the output saturates. In the saturation zone, the gradients of the activation functions (with respect to the input) are very small or close to zero; this means that they are very prone to the vanishing gradient problem. As you will see later on, neural networks learn from the backpropagation method, where the gradient of a layer is dependent on the gradients of the activation units in the succeeding layers, up to the final output layer. Therefore, if the units in the activation units are working in the saturation region, much less of the error is backpropagated to the early layers of the neural network. Neural networks minimize the prediction error in order to learn the weights and biases (W) by utilizing the gradients. This means that, if the gradients are small or vanish to zero, then the neural network will fail to learn these weights properly.
Rectified linear unit (ReLU)
The output of a ReLU is linear when the total input to the neuron is greater than zero, and the output is zero when the total input to the neuron is negative. This simple activation function provides nonlinearity to a neural network, and, at the same time, it provides a constant gradient of one with respect to the total input. This constant gradient helps to keep the neural network from developing saturating or vanishing gradient problems, as seen in activation functions, such as sigmoid and tanh activation units. The ReLU function output (as shown in Figure 1.8) can be expressed as follows:
The ReLU activation function can be plotted as follows:
One of the constraints for ReLU is its zero gradients for negative values of input. This may slow down the training, especially at the initial phase. Leaky ReLU activation functions (as shown in Figure 1.9) can be useful in this scenario, where the output and gradients are nonzero, even for negative values of the input. A leaky ReLU output function can be expressed as follows:
The parameter is to be provided for leaky ReLU activation functions, whereas for a parametric ReLU, is a parameter that the neural network will learn through training. The following graph shows the output of the leaky ReLU activation function:
The softmax activation unit
The softmax activation unit is generally used to output the class probabilities, in the case of a multi-class classification problem. Suppose that we are dealing with an n class classification problem, and the total input corresponding to the classes is given by the following:
In this case, the output probability of the kth class of the softmax activation unit is given by the following formula:
There are several other activation functions, mostly variations of these basic versions. We will discuss them as we encounter them in the different projects that we will cover in the following chapters.