Creating a categorized text corpus
If you have a large corpus of text, you might want to categorize it into separate sections. This can be helpful for organization, or for text classification, which is covered in Chapter 7, Text Classification. The brown
corpus, for example, has a number of different categories, as shown in the following code:
>>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
In this recipe, we'll learn how to create our own categorized text corpus.
Getting ready
The easiest way to categorize a corpus is to have one file for each category. The following are two excerpts from the movie_reviews
corpus:
movie_pos.txt:
the thin red line is flawed but it provokes .
movie_neg.txt:
a big-budget and glossy production can not make up for a lack of spontaneity that permeates their tv show...