We have all the essential components and utilities needed to build our model now. As we mentioned earlier, we will be using an encoder-decoder deep learning model architecture to build our image-captioning system.
The following code helps us build the architecture for this model, where we take pairs of image features and caption sequences as input to predict the next possible word in the caption at each time-step:
from keras.models import Sequential, Model from keras.layers import LSTM, Embedding, TimeDistributed, Dense, RepeatVector, Activation, Flatten, concatenate DENSE_DIM = 256 EMBEDDING_DIM = 256 MAX_CAPTION_SIZE = max_caption_size VOCABULARY_SIZE = vocab_size image_model = Sequential() image_model.add(Dense(DENSE_DIM, input_dim=4096, activation='relu')) image_model.add(RepeatVector(MAX_CAPTION_SIZE...