Getting the dataset and evaluation baseline ready
Classifying texts is a classic NLP problem. This NLP task involves assigning a value to a text, for example, a topic or a sentiment, and any such task requires evaluation. In this recipe, we will load a dataset, prepare it for processing, and create an evaluation baseline. The recipe builds on some of the recipes from Chapter 3, Representing Text: Capturing Semantics, where we used different tools to represent text in a computer-readable form.
Getting ready
For most recipes in this chapter, we will use the BBC News dataset, which contains text from five topics: business, entertainment, politics, sport, and tech. The dataset is located in the bbc-text.csv
file in this chapter's GitHub directory.
In this recipe, we will need two additional packages, numpy
and sklearn
. Install them using pip
:
pip install numpy pip install sklearn
How to do it…
In this recipe, we will be classifying just two of the five topics...