An image can be considered as a combination of style and content. The artistic style transfer technique transforms an image to look like a painting with a specific painting style. We will see how to code this idea up. The loss function will compare the generated image with the content of the photo and style of the painting. Hence, the optimization is carried out for the image pixel, rather than for the weights of the network. Two values are calculated by comparing the content of the photo with the generated image followed by the style of the painting and the generated image.
Since pixels are not a good choice, we will use the CNN features of various layers, as they are a better representation of the content. The initial layers have high-frequency such as edges, corners, and textures but the later layers represent objects, and hence are better for content. The latter layer can compare the object to object better than the pixel. But for this, we need to first import the required libraries, using the following code:
import numpy as np from
PIL import Image
from scipy.optimize import fmin_l_bfgs_b
from scipy.misc import imsave
from vgg16_avg import VGG16_Avg
from keras import metrics
from keras.models import Model
from keras import backend as K
Now, let's load the required image, using the following command:
content_image = Image.open(work_dir + 'bird_orig.png')
We will use the following image for this instance:
As we are using the VGG architecture for extracting the features, the mean of all the ImageNet images has to be subtracted from all the images, as shown in the following code:
imagenet_mean = np.array([123.68, 116.779, 103.939], dtype=np.float32)
def subtract_imagenet_mean(image):
return (image - imagenet_mean)[:, :, :, ::-1]
Note that the channels are different. The preprocess function takes the generated image and subtracts the mean and then reverses the channel. The deprocess function reverses that effect because of the preprocessing step, as shown in the following code:
def add_imagenet_mean(image, s):
return np.clip(image.reshape(s)[:, :, :, ::-1] + imagenet_mean, 0,
255)
First, we will see how to create an image with the content from another image. This is a process of creating an image from random noise. The content used here is the sum of the
activation in some layer. We will minimize the loss of the content between the random noise and image, which is termed as the content loss. This loss is similar to pixel-wise loss but applied on layer activations, hence will capture the content leaving out the noise. Any CNN architecture can be used to do forward inference of content image and random noise. The activations are taken and the mean squared error is calculated, comparing the activations of these two outputs.
The pixel of the random image is updated while the CNN weights are frozen. We will freeze the VGG network for this case. Now, the VGG model can be loaded. Generative images are very sensitive to subsampling techniques such as max pooling. Getting back the pixel values from max pooling is not possible. Hence, average pooling is a smoother method than max pooling.
The function to convert VGG model with average pooling is used for loading the model, as shown here:
vgg_model = VGG16_Avg(include_top=False)
Note that the weights are the same for this model as the original, even though the pooling type has been changed. The ResNet and Inception models are not suited for this because of their inability to provide various abstractions. We will take the activations from the last convolutional layer of the VGG model namely block_conv1, while the model was frozen. This is the third last layer from the VGG, with a wide receptive field. The code for the same is given here for your reference:
content_layer = vgg_model.get_layer('block5_conv1').output
Now, a new model is created with a truncated VGG, till the layer that was giving good
features. Hence, the image can be loaded now and can be used to carry out the forward
inference, to get the actually activated layers. A TensorFlow variable is created to capture the activation, using the following code:
content_model = Model(vgg_model.input, content_layer)
content_image_array =
subtract_imagenet_mean(np.expand_dims(np.array(content_image), 0))
content_image_shape = content_image_array.shape
target = K.variable(content_model.predict(content_image_array))
Let's define an evaluator class to compute the loss and gradients of the image. The following class returns the loss and gradient values at any point of the iteration:
class ConvexOptimiser(object):
def __init__(self, cost_function, tensor_shape):
self.cost_function = cost_function
self.tensor_shape = tensor_shape
self.gradient_values = None
def loss(self, point):
loss_value, self.gradient_values =
self.cost_function([point.reshape(self.tensor_shape)])
return loss_value.astype(np.float64)
def gradients(self, point):
return self.gradient_values.flatten().astype(np.float64)
Loss function can be defined as the mean squared error between the values of activations at specific convolutional layers. The loss will be computed between the layers of generated image and the original content photo, as shown here:
mse_loss = metrics.mean_squared_error(content_layer, target)
The gradients of the loss can be computed by considering the input of the model, as shown:
grads = K.gradients(mse_loss, vgg_model.input)
The input to the function is the input of the model and the output will be the array of loss
and gradient values as shown:
cost_function = K.function([vgg_model.input], [mse_loss]+grads)
This function is deterministic to optimize, and hence SGD is not required:
optimiser = ConvexOptimiser(cost_function, content_image_shape)
This function can be optimized using a simple optimizer, as it is convex and hence is deterministic. We can also save the image at every step of the iteration. We will define it in such a way that the gradients are accessible, as we are using the scikit-learn's optimizer, for the final optimization. Note that this loss function is convex and so, a simple optimizer is good enough for the computation. The optimizer can be defined using the following code:
def optimise(optimiser, iterations, point, tensor_shape, file_name):
for i in range(iterations):
point, min_val, info = fmin_l_bfgs_b(optimiser.loss,
point.flatten(),
fprime=optimiser.gradients, maxfun=20)
point = np.clip(point, -127, 127)
print('Loss:', min_val)
imsave(work_dir + 'gen_'+file_name+'_{i}.png',
add_imagenet_mean(point.copy(), tensor_shape)[0])
return point
The optimizer takes loss function, point, and gradients, and returns the updates. A random image needs to be generated so that the content loss will be minimized, using the following code:
def generate_rand_img(shape):
return np.random.uniform(-2.5, 2.5, shape)/1
generated_image = generate_rand_img(content_image_shape)
Here is the random image that is created:
The optimization can be run for 10 iterations to see the results, as shown:
iterations = 10
generated_image = optimise(optimiser, iterations, generated_image,
content_image_shape, 'content')
If everything goes well, the loss should print as shown here, over the iterations:
Current loss value: 73.2010421753
Current loss value: 22.7840042114
Current loss value: 12.6585302353
Current loss value: 8.53817081451
Current loss value: 6.64649534225
Current loss value: 5.56395864487
Current loss value: 4.83072710037
Current loss value: 4.32800722122
Current loss value: 3.94804215431
Current loss value: 3.66387653351
Here is the image that is generated and now, it almost looks like a bird. The optimization can be run for further iterations to have this done:
An optimizer took the image and updated the pixels so that the content is the same. Though the results are worse, it can reproduce the image to a certain extent with the content. All the images through iterations give a good intuition on how the image is generated. There is no batching involved in this process. In the next section, we will see how to create an image in the style of a painting.
After creating an image that has the content of the original image, we will see how to create an image with just the style. Style can be thought of as a mix of colour and texture of an image. For that purpose, we will define style loss. First, we will load the image and convert it to an array, as shown in the following code:
style_image = Image.open(work_dir + 'starry_night.png')
style_image = style_image.resize(np.divide(style_image.size,
3.5).astype('int32'))
Here is the style image we have loaded:
Now, we will preprocess this image by changing the channels, using the following code:
style_image_array = subtract_imagenet_mean(np.expand_dims(style_image,
0)[:, :, :, :3])
style_image_shape = style_image_array.shape
For this purpose, we will consider several layers, like we have done in the following code:
model = VGG16_Avg(include_top=False, input_shape=shp[1:])
outputs = {l.name: l.output for l in model.layers}
Now, we will take multiple layers as an array output of the first four blocks, using the following code:
layers = [outputs['block{}_conv1'.format(o)] for o in range(1,3)]
A new model is now created, that can output all those layers and assign the target variables, using the following code:
layers_model = Model(model.input, layers)
targs = [K.variable(o) for o in layers_model.predict(style_arr)]
Style loss is calculated using the Gram matrix. The Gram matrix is the product of a matrix and its transpose. The activation values are simply transposed and multiplied. This matrix is then used for computing the error between the style and random images. The Gram matrix loses the location information but will preserve the texture information. We will define the Gram matrix using the following code:
def grammian_matrix(matrix):
flattened_matrix = K.batch_flatten(K.permute_dimensions(matrix, (2, 0,
1)))
matrix_transpose_dot = K.dot(flattened_matrix,
K.transpose(flattened_matrix))
element_count = matrix.get_shape().num_elements()
return matrix_transpose_dot / element_count
As you might be aware now, it is a measure of the correlation between the pair of columns. The height and width dimensions are flattened out. This doesn't include any local pieces of information, as the coordinate information is disregarded. Style loss computes the mean squared error between the Gram matrix of the input image and the target, as shown in the following code
def style_mse_loss(x, y):
return metrics.mse(grammian_matrix(x), grammian_matrix(y))
Now, let's compute the loss by summing up all the activations from the various layers, using the following code:
style_loss = sum(style_mse_loss(l1[0], l2[0]) for l1, l2 in
zip(style_features, style_targets))
grads = K.gradients(style_loss, vgg_model.input)
style_fn = K.function([vgg_model.input], [style_loss]+grads)
optimiser = ConvexOptimiser(style_fn, style_image_shape)
We then solve it as the same way we did before, by creating a random image. But this time, we will also apply a Gaussian filter, as shown in the following code:
generated_image = generate_rand_img(style_image_shape)
The random image generated will look like this:
The optimization can be run for 10 iterations to see the results, as shown below:
generated_image = optimise(optimiser, iterations, generated_image,
style_image_shape)
If everything goes well, the solver should print the loss values similar to the following:
Current loss value: 5462.45556641
Current loss value: 189.738555908
Current loss value: 82.4192581177
Current loss value: 55.6530838013
Current loss value: 37.215713501
Current loss value: 24.4533748627
Current loss value: 15.5914745331
Current loss value: 10.9425945282
Current loss value: 7.66888141632
Current loss value: 5.84042310715
Here is the image that is generated:
Here, from a random noise, we have created an image with a particular painting style without any location information. In the next section, we will see how to combine both—the content and style loss.
Now we know how to reconstruct an image, as well as how to construct an image that captures the style of an original image. The obvious idea may be to just combine these two approaches by weighting and adding the two loss functions, as shown in the following code:
w,h = style.size
src = img_arr[:,:h,:w]
Like before, we're going to grab a sequence of layer outputs to compute the style loss. However, we still only need one layer output to compute the content loss. How do we know which layer to grab? As we discussed earlier, the lower the layer, the more exact the content reconstruction will be. In merging content reconstruction with style, we might expect that a looser reconstruction of the content will allow more room for the style to affect (re: inspiration). Furthermore, a later layer ensures that the image looks like the same subject, even if it doesn't have the same details. The following code is used for this process:
style_layers = [outputs['block{}_conv2'.format(o)] for o in range(1,6)]
content_name = 'block4_conv2'
content_layer = outputs[content_name]
Now, a separate model for style is created with required output layers, using the following code:
style_model = Model(model.input, style_layers)
style_targs = [K.variable(o) for o in style_model.predict(style_arr)]
We will also create another model for the content with the content layer, using the
following code:
content_model = Model(model.input, content_layer)
content_targ = K.variable(content_model.predict(src))
Now, the merging of the two approaches is as simple as merging their respective loss
functions. Note that as opposed to our previous functions, this function is producing three
separate types of outputs:
One way for us to tune how the reconstructions mix is by changing the factor on the content
loss, which we have here as 1/10. If we increase that denominator, the style will have a
larger effect on the image, and if it's too large, the original content of the image will be
obscured by an unstructured style. Likewise, if it is too small then the image will not have
enough style. We will use the following code for this process:
style_wgts = [0.05,0.2,0.2,0.25,0.3]
The loss function takes both style and content layers, as shown here:
loss = sum(style_loss(l1[0], l2[0])*w
for l1,l2,w in zip(style_layers, style_targs, style_wgts))
loss += metrics.mse(content_layer, content_targ)/10
grads = K.gradients(loss, model.input)
transfer_fn = K.function([model.input], [loss]+grads)
evaluator = Evaluator(transfer_fn, shp)
We will run the solver for 10 iterations as before, using the following code:
iterations=10
x = rand_img(shp)
x = solve_image(evaluator, iterations, x)
The loss values should be printed as shown here:
Current loss value: 2557.953125
Current loss value: 732.533630371
Current loss value: 488.321166992
Current loss value: 385.827178955
Current loss value: 330.915924072
Current loss value: 293.238189697
Current loss value: 262.066864014
Current loss value: 239.34185791
Current loss value: 218.086700439
Current loss value: 203.045211792
These results are remarkable. Each one of them does a fantastic job of recreating the original image in the style of the artist. The generated image will look like the following:
We will now conclude the style transfer section. This operation is really slow but can work with any images. In the next section, we will see how to use a similar idea to create a superresolution network. There are several ways to make this better, such as:
Any image can be converted to artistic style by training a CNN to output
such an image. |
To summarize, we learned to implement to transfer style from one image to another while preserving the content as is.
You read an excerpt from a book written by Rajalingappaa Shanmugamani titled Deep Learning for Computer Vision. In this book, you will learn how to model and train advanced neural networks to implement a variety of Computer Vision tasks.