By building a handwritten digit recognizer in a Java application, we will practically implement most of the techniques and optimizations learned so far. The application is built using the open source Java framework, Deeplearning4j. The dataset used is the classic MNIST database of handwritten digits. (http://yann.lecun.com/exdb/mnist/). The training dataset is oversized, having 60,000 images, while the test data set contains 10,000 images. The images are 28 x 28 in size and grayscale in terms of terms.
As a part of the application that we will be creating in this section, we will implement a graphical user interface, where you can draw digits and get a neural network to recognize the digit.
Jumping straight into the code, let's observe how to implement a neural network in Java. We begin with the parameters; the first one is the output. Since we have 0 to 9 digits, we have 10 classes:
/**
* Number prediction classes.
* We have 0-9 digits so 10 classes in total.
*/
private static final int OUTPUT = 10;
We have the mini-batch size, which is the number of images we see before updating the weights or the number of images we'll process in parallel:
/**
* Mini batch gradient descent size or number of matrices processed in parallel.
* For CORE-I7 16 is good for GPU please change to 128 and up
*/
private static final int MINI_BATCH_SIZE = 16;// Number of training epochs
/**
* Number of total traverses through data.
* with 5 epochs we will have 5/@MINI_BATCH_SIZE iterations or weights updates
*/
private static final int EPOCHS = 5;
When we consider this for a CPU, the batch size of 16 is alright, but for GPUs, this needs to change according to the GPU power. One epoch is fulfilled when we traverse all the data.
The learning rate is quite important, because having a very low value will slow down the learning, and having bigger values of the learning rate will cause the neural network to actually diverge:
/**
* The alpha learning rate defining the size of step towards the minimum
*/
private static final double LEARNING_RATE = 0.01;
To understand this in detail, in the latter half of this section, we will simulate a case where we diverge by changing the learning rate. Fortunately, as part of this example, we need not handle the reading, transferring, or normalizing of the pixels from the MNIST dataset. We do not need to concern ourselves with transforming the data to a one-dimensional vector to fit to the neural network. This is because everything is encapsulated and offered by Deeplearning4j.
Under the object dataset iterator, we need to specify the batch size and whether we are going to use it for training or testing, which will help classify whether we need to load 60,000 images from the training dataset, or 10,000 from the testing dataset:
public void train() throws Exception {
/*
Create an iterator using the batch size for one iteration
*/
log.info("Load data....");
DataSetIterator mnistTrain = new MnistDataSetIterator(MINI_BATCH_SIZE, true, SEED);
/*
Construct the neural neural
*/
log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(SEED)
.learningRate(LEARNING_RATE)
.weightInit(WeightInit.XAVIER)
//NESTEROVS is referring to gradient descent with momentum
.updater(Updater.NESTEROVS)
.list()
Let's get started with building a neural network. We've specified the learning rate, and initialized the weight according to Xavier, which we have learned in the previous sections. The updater in the code is actually just the optimization algorithm for updating the weights with a gradient descent. The NESTEROVS is basically the gradient descent with momentum that we're already familiar with.
Let's look into the code to understand the updater better. We look at the two formulas that are actually not different from what we have already explored.
We configure the input layer, hidden layers, and the output. Configuration of the input layer is quite easy; we just need to multiply the width and the weight and we have this one-dimensional vector size. The next step in the code is to define the hidden layers. We have two hidden layers, actually: one with 128 neurons and one with 64 neurons, both having an activation function because of its high efficiency.
Just to switch things up a bit, we could try out different values, especially those defined by the MNIST dataset web page. Despite that, the values chosen here are quite efficient, with less training time and good accuracy.
The output layer, which uses the softmax, because we need ten classes and not 2, we also have the cost function. The details for this may vary from what we have seen previously. This function measures the performance of the hypothetical values against the real values.
We then initialize and define the function, as we want to see the cost function for every 100 iterations. The model.fit (minstTrain) is very important, because this actually works iteration by iteration, as defined by many, it traverses all the data. After this, we have executed one epoch and the neural network has learned to use good weights.