Training a deep learning language model with a standalone text dataset
In the previous sections, we trained a language model and a text classifier using the curated text dataset IMDb. In this section and the next section, we will train a language model and a text classifier using a standalone text dataset, the Kaggle Coronavirus tweets NLP – Text Classification dataset described here: https://www.kaggle.com/datatattle/covid-19-nlp-text-classification. This dataset includes a selection of tweets related to the Covid-19 pandemic, along with categorization for the tweets according to the following five categories:
- Extremely negative
- Negative
- Neutral
- Positive
- Extremely positive
The goal of the language model trained on this dataset is to predict the subsequent words in a Covid-related tweet given a starting phrase. The goal of the text classification model trained on this dataset, as described in the Training a deep learning text classifier with a...