Providing a quick overview of a dataset
To show you how to process a corpus of documents with the aim of extracting relevant information, we will be using a dataset derived from a well-known benchmark in the field of NLP: the so-called Reuters-21578. The original dataset includes a set of 21,578 news articles that were published in the financial Reuters newswire in 1987, which were assembled and indexed in categories. The original dataset has a very skewed distribution, with some categories appearing only in the training set or in the test set. For this reason, we will use a modified version, known as ApteMod, also referred to as Reuters-21578 Distribution 1.0, that has a smaller skew distribution and consistent labels between the training and test datasets.
Even though these articles are a bit outdated, the dataset has been used in a plethora of papers on NLP and still represents a dataset that's often used for benchmarking algorithms.
Indeed, Reuters-21578 contains enough...