Performing text preprocessing
As I described in the Preface, this book will use the sampled AG’s corpus of news articles made public by Zhang, Zhao, and LeCun [3]. This dataset is a smaller collection that sampled news articles on “world,” “sports,” “business,” and “science” [4]. It has been used extensively in many NLP modeling projects and is available in Kaggle, PyTorch, Huggingface, and TensorFlow. The data has four classes – class “1” is news about “world affairs," class “2” is news about “sports,” class “3” is about “business,” and class “4” is about “science/tech.” Let’s print out two records for each class just to understand the text data. The code is like this:
import pandas as pdpd.set_option(‘display.max_colwidth’, -1) path = “/content/gdrive/My Drive/data/gensim...