We will now look at how we can implement a softmax layer. As we have already discussed, a sigmoid layer is used for assigning labels to a class—that is, if you want to have multiple nonexclusive characteristics that you want to infer from an input, you should use a sigmoid layer. A softmax layer is used when you only want to assign a single class to a sample by inference—this is done by computing a probability for each possible class (with probabilities over all classes, of course, summing to 100%). We can then select the class with the highest probability to give the final classification.
Now, let's see exactly what the softmax layer does—given a set of a collection of N real numbers (c0, ..., cN-1) , we first compute the sum of the exponential function on each number (), and then calculate the exponential of each...