Besides the state of a neuron, synaptic weight is considered, which influences the connection within the network. Each weight has a numerical value indicated by Wij, which is the synaptic weight connecting neuron i to neuron j.
Synaptic weight: This concept evolved from biology and refers to the strength or amplitude of a connection between two nodes, corresponding in biology to the amount of influence the firing of one neuron has on another.
For each neuron (also known as, unit) i, an input vector can be defined by xi= (x1, x2,...xn) and a weight vector can be defined by wi= (wi1, wi2,...win). Now, depending on the position of a neuron, the weights and the output function determine the behavior of an individual neuron. Then during forward propagation, each unit in the hidden layer gets the following signal:
Nevertheless, among the weights, there is also a special type of weight called bias unit b. Technically, bias units aren't connected to any previous layer, so they don't have true activity. But still, the bias b value allows the neural network to shift the activation function to the left or right. Now, taking the bias unit into consideration, the modified network output can be formulated as follows:
The preceding equation signifies that each hidden unit gets the sum of inputs multiplied by the corresponding weight—summing junction. Then the resultant in the summing junction is passed through the activation function, which squashes the output as depicted in the following figure:
Artificial neuron model
Now, a tricky question: how do we initialize the weights? Well, if we initialize all weights to the same value (for example, 0 or 1), each hidden neuron will get exactly the same signal. Let's try to break it down:
- If all weights are initialized to 1, then each unit gets a signal equal to the sum of the inputs
- If all weights are 0, which is even worse, every neuron in a hidden layer will get zero signal
For network weight initialization, Xavier initialization is nowadays used widely. It is similar to random initialization but often turns out to work much better since it can automatically determine the scale of initialization based on the number of input and output neurons.
Interested readers should refer to this publication for detailed info: Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks: proceedings of the 13th international conference on Artificial Intelligence and Statistics (AISTATS) 2010, Chia Laguna Resort, Sardinia, Italy; Volume 9 of JMLR: W&CP.
You may be wondering whether you can get rid of random initialization while training a regular DNN (for example, MLP or DBN). Well, recently, some researchers have been talking about random orthogonal matrix initializations that perform better than just any random initialization for training DNNs.
When it comes to initializing the biases, we can initialize them to be zero. But setting the biases to a small constant value such as 0.01 for all biases ensures that all Rectified Linear Unit (ReLU) units can propagate some gradient. However, it neither performs well nor shows consistent improvement. Therefore, sticking with zero is recommended.