In the previous case studies, we analyzed datasets that were structured, that is, contained variables and their corresponding values. In this case study, we will be working on a dataset that has text as input, and the expected output is one of the 46 possible topics that the text is related to.
Categorizing news articles into topics
Getting ready
To understand the intuition of performing text analysis, let's consider the Reuters dataset, where each news article is classified into one of the 46 possible topics.
We will adopt the following strategy to perform our analysis:
- Given that a dataset could contain thousands of unique words, we will shortlist the words that we shall consider.
- For this specific exercise, we shall...