English words have to be converted into embeddings for caption generation. An embedding is nothing but a vector or numerical representation of words or images. It is useful if words are converted to a vector form such that arithmetic can be performed using the vectors.
Such an embedding can be learned by two methods, as shown in the following figure:

The CBOW method learns the embedding by predicting a word given the surrounding words. The Skip-gram method predicts the surrounding words given a word, which is the reverse of CBOW. Based on the history, a target word can be trained, as shown in the following figure:

Once trained, the embedding can be visualized as follows:

This type of embedding can be used to perform vector arithmetic of words. This concept of word embedding will be helpful throughout this chapter.
...