Setting up a custom corpus
A corpus is a collection of text documents, and corpora is the plural of corpus. This comes from the Latin word for body; in this case, a body of text. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.
Getting ready
You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:\nltk_data
on Windows, and /usr/share/nltk_data
on Linux, Unix, and Mac OS X.
How to do it...
NLTK defines a list of data directories, or paths, in nltk.data.path
. Our custom corpora must be within one of these paths so it can be found by NLTK. In order to avoid conflict with the official data package, we'll create a custom nltk_data
directory in our home directory. The following is some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path
:
>>>...