English words have to be converted into embeddings for caption generation. An embedding is nothing but a vector or numerical representation of words or images. It is useful if words are converted to a vector form such that arithmetic can be performed using the vectors.
Such an embedding can be learned by two methods, as shown in the following figure:
![](https://static.packt-cdn.com/products/9781788398060/graphics/assets/6d11a280-a213-4e63-80a6-2ec7971c36b3.png)
The CBOW method learns the embedding by predicting a word given the surrounding words. The Skip-gram method predicts the surrounding words given a word, which is the reverse of CBOW. Based on the history, a target word can be trained, as shown in the following figure:
![](https://static.packt-cdn.com/products/9781788398060/graphics/assets/918d0423-dbd7-4279-87d3-9574d66af009.png)
Once trained, the embedding can be visualized as follows:
![](https://static.packt-cdn.com/products/9781788398060/graphics/assets/63a42ddb-81c2-4564-9f58-f9fd2d53ce4b.png)
This type of embedding can be used to perform vector arithmetic of words. This concept of word embedding will be helpful throughout this chapter.
...