In this article, we will see how convolutional layers work and how to use them. We will also see how you can build your own convolutional neural network in Keras to build better, more powerful deep neural networks and solve computer vision problems. We will also see how we can improve this network using data augmentation.
For a better understanding of the concepts, we will be taking a well-known dataset CIFAR-10. This dataset was created by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton.
The following article has been taken from the book Deep Learning Quick Reference, written by Mike Bernico.
The CIFAR-10 dataset is made up of 60,000 32 x 32 color images that belong to 10 classes, with 6,000 images per class. We'll be using 50,000 images as a training set, 5,000 images as a validation set, and 5,000 images as a test set.
The input tensor layer for the convolutional neural network will be (N, 32, 32, 3), which we will pass to the build_network function. The following code is used to build the network:
def build_network(num_gpu=1, input_shape=None): inputs = Input(shape=input_shape, name="input")
The output of this model will be a class prediction, from 0-9. We will use a 10-node softmax. We will use the following code to define the output:
output = Dense(10, activation="softmax", name="softmax")(d2)
Earlier, we used categorical cross-entropy as the loss function for a multi-class classifier. This is just another multiclass classifier and we can continue using categorical cross-entropy as our loss function, and accuracy as a metric. We've moved on to using images as input, but luckily our cost function and metrics remain unchanged.
We're going to use two convolutional layers, with batch normalization, and max pooling. This is going to require us to make quite a few choices, which of course we could choose to search as hyperparameters later. It's always better to get something working first though. As the popular computer scientist and mathematician Donald Knuth would say, premature optimization is the root of all evil. We will use the following code snippet to define the two convolutional blocks:
# convolutional block 1 conv1 = Conv2D(64, kernel_size=(3,3), activation="relu", name="conv_1")(inputs) batch1 = BatchNormalization(name="batch_norm_1")(conv1) pool1 = MaxPooling2D(pool_size=(2, 2), name="pool_1")(batch1)
# convolutional block 2
conv2 = Conv2D(32, kernel_size=(3,3), activation="relu", name="conv_2")(pool1)
batch2 = BatchNormalization(name="batch_norm_2")(conv2)
pool2 = MaxPooling2D(pool_size=(2, 2), name="pool_2")(batch2)
So, clearly, we have two convolutional blocks here, that consist of a convolutional layer, a batch normalization layer, and a pooling layer.
In the first block, I'm using 64 3 x 3 filters with relu activations. I'm using valid (no) padding and a stride of 1. Batch normalization doesn't require any parameters and it isn't really trainable. The pooling layer is using 2 x 2 pooling windows, valid padding, and a stride of 2 (the dimension of the window).
The second block is very much the same; however, I'm halving the number of filters to 32.
While there are many knobs we could turn in this architecture, the one I would tune first is the kernel size of the convolutions. Kernel size tends to be an important choice. In fact, some modern neural network architectures such as Google's inception, allow us to use multiple filter sizes in the same convolutional layer.
After two rounds of convolution and pooling, our tensors have gotten relatively small and deep. After pool_2, the output dimension is (n, 6, 6, 32).
We have, in these convolutional layers, hopefully extracted relevant image features that this 6 x 6 x 32 tensor represents. To classify images, using these features, we will connect this tensor to a few fully connected layers, before we go to our final output layer.
In this example, I'll use a 512-neuron fully connected layer, a 256-neuron fully connected layer, and finally, the 10-neuron output layer. I'll also be using dropout to help prevent overfitting, but only a very little bit! The code for this process is given as follows for your reference:
from keras.layers import Flatten, Dense, Dropout # fully connected layers flatten = Flatten()(pool2) fc1 = Dense(512, activation="relu", name="fc1")(flatten) d1 = Dropout(rate=0.2, name="dropout1")(fc1) fc2 = Dense(256, activation="relu", name="fc2")(d1) d2 = Dropout(rate=0.2, name="dropout2")(fc2)
I haven't previously mentioned the flatten layer above. The flatten layer does exactly what its name suggests. It flattens the n x 6 x 6 x 32 tensor into an n x 1152 vector. This will serve as an input to the fully connected layers.
Many cloud computing platforms can provision instances that include multiple GPUs. As our models grow in size and complexity you might want to be able to parallelize the workload across multiple GPUs. This can be a somewhat involved process in native TensorFlow, but in Keras, it's just a function call.
Build your model, as normal, as shown in the following code:
model = Model(inputs=inputs, outputs=output)
Then, we just pass that model to keras.utils.multi_gpu_model, with the help of the following code:
model = multi_gpu_model(model, num_gpu)
In this example, num_gpu is the number of GPUs we want to use.
Putting the model together, and incorporating our new cool multi-GPU feature, we come up with the following architecture:
def build_network(num_gpu=1, input_shape=None): inputs = Input(shape=input_shape, name="input")
# convolutional block 1
conv1 = Conv2D(64, kernel_size=(3,3), activation="relu",
name="conv_1")(inputs)
batch1 = BatchNormalization(name="batch_norm_1")(conv1)
pool1 = MaxPooling2D(pool_size=(2, 2), name="pool_1")(batch1)
# convolutional block 2
conv2 = Conv2D(32, kernel_size=(3,3), activation="relu",
name="conv_2")(pool1)
batch2 = BatchNormalization(name="batch_norm_2")(conv2)
pool2 = MaxPooling2D(pool_size=(2, 2), name="pool_2")(batch2)
# fully connected layers
flatten = Flatten()(pool2)
fc1 = Dense(512, activation="relu", name="fc1")(flatten)
d1 = Dropout(rate=0.2, name="dropout1")(fc1)
fc2 = Dense(256, activation="relu", name="fc2")(d1)
d2 = Dropout(rate=0.2, name="dropout2")(fc2)
# output layer
output = Dense(10, activation="softmax", name="softmax")(d2)
# finalize and compile
model = Model(inputs=inputs, outputs=output)
if num_gpu > 1:
model = multi_gpu_model(model, num_gpu)
model.compile(optimizer='adam', loss='categorical_crossentropy',
metrics=["accuracy"])
return model
We can use this to build our model:
model = build_network(num_gpu=1, input_shape=(IMG_HEIGHT, IMG_WIDTH, CHANNELS))
And then we can fit it, as you'd expect:
model.fit(x=data["train_X"], y=data["train_y"], batch_size=32, epochs=200, validation_data=(data["val_X"], data["val_y"]), verbose=1, callbacks=callbacks)
As we train this model, you will notice that overfitting is an immediate concern. Even with a relatively modest two convolutional layers, we're already overfitting a bit.
You can see the effects of overfitting from the following graphs:
It's no surprise, 50,000 observations is not a lot of data, especially for a computer vision problem. In practice, computer vision problems benefit from very large datasets. In fact, Chen Sun showed that additional data tends to help computer vision models linearly with the log of the data volume in https://arxiv.org/abs/1707.02968. Unfortunately, we can't really go find more data in this case. But maybe we can make some. Let's talk about data augmentation next.
Data augmentation is a technique where we apply transformations to an image and use both the original image and the transformed images to train on. Imagine we had a training set with a cat in it:
If we were to apply a horizontal flip to this image, we'd get something that looks like this:
This is exactly the same image, of course, but we can use both the original and transformation as training examples. This isn't quite as good as two separate cats in our training set; however, it does allow us to teach the computer that a cat is a cat regardless of the direction it's facing.
In practice, we can do a lot more than just a horizontal flip. We can vertically flip, when it makes sense, shift, and randomly rotate images as well. This allows us to artificially amplify our dataset and make it seem bigger than it is. Of course, you can only push this so far, but it's a very powerful tool in the fight against overfitting when little data exists.
Not so long ago, the only way to do image augmentation was to code up the transforms and apply them randomly to the training set, saving the transformed images to disk as we went (uphill, both ways, in the snow). Luckily for us, Keras now provides an ImageDataGenerator class that can apply transformations on the fly as we train, without having to hand code the transformations.
We can create a data generator object from ImageDataGenerator by instantiating it like this:
def create_datagen(train_X): data_generator = ImageDataGenerator( rotation_range=20, width_shift_range=0.02, height_shift_range=0.02, horizontal_flip=True) data_generator.fit(train_X) return data_generator
In this example, I'm using both shifts, rotation, and horizontal flips. I'm using only very small shifts. Through experimentation, I found that larger shifts were too much and my network wasn't actually able to learn anything. Your experience will vary as your problem does, but I would expect larger images to be more tolerant of shifting. In this case, we're using 32 pixel images, which are quite small.
If you haven't used a generator before, it works like an iterator. Every time you call the ImageDataGenerator .flow() method, it will produce a new training minibatch, with random transformations applied to the images it was fed.
The Keras Model class comes with a .fit_generator() method that allows us to fit with a generator rather than a given dataset:
model.fit_generator(data_generator.flow(data["train_X"], data["train_y"], batch_size=32), steps_per_epoch=len(data["train_X"]) // 32, epochs=200, validation_data=(data["val_X"], data["val_y"]), verbose=1, callbacks=callbacks)
Here, we've replaced the traditional x and y parameters with the generator. Most importantly, notice the steps_per_epoch parameter. You can sample with replacement any number of times from the training set, and you can apply random transformations each time. This means that we can use more mini batches each epoch than we have data. Here, I'm going to only sample as many batches as I have observations, but that isn't required. We can and should push this number higher if we can.
Before we wrap things up, let's look at how beneficial image augmentation is in this case:
As you can see, just a little bit of image augmentation really helped us out. Not only is our overall accuracy higher, but our network is overfitting much slower. If you have a computer vision problem with just a little bit of data, image augmentation is something you'll want to do.
We saw the benefits and ease of training a convolutional neural network from scratch using Keras and then improving that network using data augmentation.
If you found the above article to be useful, make sure you check out the book Deep Learning Quick Reference for more information on modeling and training various different types of deep neural networks with ease and efficiency.
Top 5 Deep Learning Architectures
CapsNet: Are Capsule networks the antidote for CNNs kryptonite?
What is a CNN?