English words have to be converted into embeddings for caption generation. An embedding is nothing but a vector or numerical representation of words or images. It is useful if words are converted to a vector form such that arithmetic can be performed using the vectors.
Such an embedding can be learned by two methods, as shown in the following figure:
data:image/s3,"s3://crabby-images/5da64/5da6476489c43286e9029ef5f472e6e98adebd36" alt=""
The CBOW method learns the embedding by predicting a word given the surrounding words. The Skip-gram method predicts the surrounding words given a word, which is the reverse of CBOW. Based on the history, a target word can be trained, as shown in the following figure:
data:image/s3,"s3://crabby-images/35506/355068523a071919c94fc3e037785f107c899396" alt=""
Once trained, the embedding can be visualized as follows:
data:image/s3,"s3://crabby-images/bf570/bf57036a08b0791168885fbb02684fa254675dc2" alt=""
This type of embedding can be used to perform vector arithmetic of words. This concept of word embedding will be helpful throughout this chapter.
...