Tokenizing text into sentences
Tokenization is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. We'll start with sentence tokenization, or splitting a paragraph into a list of sentences.
Getting ready
Installation instructions for NLTK are available at http://nltk.org/install.html and the latest version at the time of writing this is Version 3.0b1. This version of NLTK is built for Python 3.0 or higher, but it is backwards compatible with Python 2.6 and higher. In this book, we will be using Python 3.3.2. If you've used earlier versions of NLTK (such as version 2.0), note that some of the APIs have changed in Version 3 and are not backwards compatible.
Once you've installed NLTK, you'll also need to install the data following the instructions at http://nltk.org/data.html. I recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data
, or on Windows is C:\nltk_data
. Make sure that tokenizers/punkt.zip
is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/PY3/english.pickle
.
Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://nltk.org/install.html. For Mac and Linux/Unix users, you can open a terminal and type python
.
How to do it...
Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:
>>> para = "Hello World. It's good to see you. Thanks for buying this book."
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Now we want to split the paragraph into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument:
>>> from nltk.tokenize import sent_tokenize >>> sent_tokenize(para) ['Hello World.', "It's good to see you.", 'Thanks for buying this book.']
So now we have a list of sentences that we can use for further processing.
How it works...
The sent_tokenize
function uses an instance of PunktSentenceTokenizer
from the nltk.tokenize.punkt
module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.
There's more...
The instance used in sent_tokenize()
is actually loaded on demand from a pickle file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the PunktSentenceTokenizer
class once, and call its tokenize()
method instead:
>>> import nltk.data >>> tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle') >>> tokenizer.tokenize(para) ['Hello World.', "It's good to see you.", 'Thanks for buying this book.']
Tokenizing sentences in other languages
If you want to tokenize sentences in languages other than English, you can load one of the other pickle files in tokenizers/punkt/PY3
and use it just like the English sentence tokenizer. Here's an example for Spanish:
>>> spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle') >>> spanish_tokenizer.tokenize('Hola amigo. Estoy bien.') ['Hola amigo.', 'Estoy bien.']
You can see a list of all the available language tokenizers in /usr/share/nltk_data/tokenizers/punkt/PY3
(or C:\nltk_data\tokenizers\punkt\PY3
).
See also
In the next recipe, we'll learn how to split sentences into individual words. After that, we'll cover how to use regular expressions to tokenize text. We'll cover how to train your own sentence tokenizer in an upcoming recipe, Training a sentence tokenizer.