In this section, we will look at the old and new techniques for organizing data. We will also gain some intuition as to what results our model may produce until it is ready for production. By the end of this section, we will have explored how neural networks are implemented to obtain a really high performance rate.
Organizing data and applications
Organizing your data
Just like any other network, a neural network depends on data. Previously, we used datasets containing 1,000 to 100,000 rows of data. Even in cases where more data was added, the low computational power of the systems would not allow us to organize this kind of data efficiently.
We always begin with training our network, which implies that we in fact need a training dataset that should consist of 60% of the total data in the dataset. This is a very important step, as here is where the neural network learns the values of the weights present in the dataset. The second phase is to see how well the network does with data that it has never seen before, which consists of 20% of the data in the dataset. This dataset is known as a cross-validation dataset. The aim of this phase is to see how the the network generalizes data that it was not trained for.
Based on the performance in this phase, we can vary the parameters to attain the best possible output. This phase will continue until we have achieved optimum performance. The remaining data in the dataset can now be used as the test dataset. The reason for having this dataset is to have a completely unbiased evaluation of the network. This is basically to understand the behavior of the network toward data that it has not seen before and not been optimized for.
If we were to have a visual representation of the organization of data as described previously, here is what it would look like as in the following diagram:
This configuration was well known and widely used until recent years. Multiple variations to the percentages of data allocated to each dataset also existed. In the more recent era of deep learning, two things have changed substantially:
- Millions of rows of data are present in the datasets that we are currently using.
- The computational power of our processing systems has increased drastically because of advanced GPUs.
Due to these reasons, neural networks now have a deeper, bigger, and more complex architecture. Here is how our data will be organized:
The training dataset has increased immensely; observe that 96% of the data will be used as a dataset, while 2% will be required for the development dataset and the remaining 2% for the testing dataset. In fact, it is even possible to have 99% of the data used to train the model and the remaining 1% of data to be divided between the training and the development datasets. At some points, it's OK to not have a test dataset at all. The only time we need to have a test dataset is when we need to have a completely unbiased evaluation. Through the course of this chapter, we shall hardly use the test dataset.
Notice how the cross-validation dataset becomes the development dataset. The functionality of the dataset does not vary.
Bias and variance
During the training of the neural network, the model may undergo various symptoms. One of them is high bias value. This leads to a high error rate on our training dataset and therefore, consecutively, a similar error on the development dataset. What this tells us is that our network has not learned how to solve the problem or to find the pattern. Graphically, we can represent the model like this:
This graph depicts senior boundaries and the errors caused by the model, where it marks the red dots as green squares and vice versa.
We may also have to worry about the high variance problem. Assume that the neural network does a great job during the training phase and figures out a really complex decision bundle, almost perfectly. But, when the same model uses the development testing dataset, it performs poorly, having a high error rate and an output that is not different from the high bias error graph:
If we look at the previous graph, it looks like a neural network really learned to be specific to the training dataset, and when it encounters examples it hasn't seen before, it doesn't know how to categorize our data.
The final unfortunate case is where our neural network may have both of these two symptoms together. This is a case where we see a high error rate on the training dataset and a double or higher error rate for the development dataset. This can be depicted graphically as follows:
The first line is the decision boundary from the training dataset, and the second line is the decision boundary for the development dataset, which is even worse.
Computational model efficiency
Neural networks are currently learning millions of weights. Millions of weights mean millions of multiplications. This makes it essential to find a highly efficient model to do this multiplication, and that is done by using matrices. The following diagram depicts how weights are placed in a matrix:
The weight matrix here has one row and four columns, and the inputs are in another matrix. These inputs can be the outputs of the previous hidden layer.
To find the output, we need to simply perform a simple multiplication of these two matrices. This means that is the multiplication of the row and the column.
To make it more complex, let us vary our neural network to have one more hidden layer.
Having a new hidden layer will change our matrix as well. All the weights from the hidden layer 2 will be added as a second row to the matrix. The value is the multiplication of the second row of the matrix with the column containing the input values:
Notice now how and can be actually calculated in parallel, because they don't have any dependents, so really, the multiplication of the first row with the inputs column is not dependent on the multiplication of the second row with the inputs column.
To make this more complex, we can have another set of examples that will affect the matrix as follows:
We now have four sets and we can actually calculate each of them in parallel. Consider , which is the result of the multiplication of the first row with the first input column, while this is the multiplication of the second row of weights with the second column of the input.
In standard computers, we currently have 16 of these operations carried out in parallel. But the biggest gain here is when we use GPUs, because GPUs enable us to execute from 100 to 1,000 of these operations in parallel. One of the reasons that deep learning has been taking off recently is because of GPUs offering really great computational power.