In this section, we will discuss how to train a multi-layer perceptron. Recall from Chapter 5, From Simple Linear Regression to Multiple Linear Regression that we can use gradient descent to minimize a real-valued function, C,  of many variables. Assume that C is a function of two variables v1 and v2. To understand how to change the variables to minimize C, we need a small change in the variables to produce a small change in the output. We will represent a change in the value of v1 with Δv1, a change in the value of v2 with Δv2, and a change in the value of C with ΔC. The relation between ΔC and changes to variables is given by:
∂C/∂v1 is the partial derivative of C with respect to v1. For convenience, we will represent Δv1 and...