In the last section, we briefly learned about word embedding without implementing it. In this section, we will download a dataset called IMDB, which contains reviews, and build a sentiment classifier which calculates whether a review's sentiment is positive, negative, or unknown. In the process of building, we will also train word embedding for the words present in the IMDB dataset. We will use a library called torchtext that makes a lot of processes such as downloading, text vectorization, and batching much easier. Training a sentiment classifier will involve the following steps:
- Downloading IMDB data and performing text tokenization
- Building a vocabulary
- Generating batches of vectors
- Creating a network model with embeddings
- Training the model