Getting started with activation functions
If we only use linear activation functions, a neural network would represent a large collection of linear combinations. However, the power of neural networks lies in their ability to model complex nonlinear behavior. We briefly introduced the non-linear activation functions sigmoid and ReLU in the previous recipes, and there are many more popular nonlinear functions, such as ELU, Leaky ReLU, TanH, and Maxout.
There is no rule as to activation works best for the units. Deep learning is a new field and most results are obtained by trial and error instead of mathematical proofs. For the output unit, we use a single output unit and a linear activation for regression tasks. For classification tasks with n classes, we use n output nodes and a softmax activation function. The softmax function forces the network to output probabilities between 0 and 1 for mutually exclusive classes and the probabilities sum up to 1. For binary classification, we can...