You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product type Paperback

Published in Aug 2014

Publisher

ISBN-13 9781782167853

Length 304 pages

Edition 2nd Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1):

Jacob Perkins

View More author details

Table of Contents (12) Chapters

Preface

1. Tokenizing Text and WordNet Basics FREE CHAPTER

2. Replacing and Correcting Words

3. Creating Custom Corpora

4. Part-of-speech Tagging

5. Extracting Chunks

6. Transforming Chunks and Trees

7. Text Classification

8. Distributed Processing and Handling Large Datasets

9. Parsing Specific Data Types

A. Penn Treebank Part-of-speech Tags

Index

Tokenizing text into sentences

Tokenization is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. We'll start with sentence tokenization, or splitting a paragraph into a list of sentences.

Getting ready

Installation instructions for NLTK are available at http://nltk.org/install.html and the latest version at the time of writing this is Version 3.0b1. This version of NLTK is built for Python 3.0 or higher, but it is backwards compatible with Python 2.6 and higher. In this book, we will be using Python 3.3.2. If you've used earlier versions of NLTK (such as version 2.0), note that some of the APIs have changed in Version 3 and are not backwards compatible.

Once you've installed NLTK, you'll also need to install the data following the instructions at http://nltk.org/data.html. I recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/punkt.zip is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/PY3/english.pickle.

Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://nltk.org/install.html. For Mac and Linux/Unix users, you can open a terminal and type python.

How to do it...

Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:

>>> para = "Hello World. It's good to see you. Thanks for buying this book."

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Now we want to split the paragraph into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument:

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

So now we have a list of sentences that we can use for further processing.

How it works...

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

There's more...

The instance used in sent_tokenize() is actually loaded on demand from a pickle file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the PunktSentenceTokenizer class once, and call its tokenize() method instead:

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
>>> tokenizer.tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

Tokenizing sentences in other languages

If you want to tokenize sentences in languages other than English, you can load one of the other pickle files in tokenizers/punkt/PY3 and use it just like the English sentence tokenizer. Here's an example for Spanish:

>>> spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
>>> spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')
['Hola amigo.', 'Estoy bien.']

You can see a list of all the available language tokenizers in /usr/share/nltk_data/tokenizers/punkt/PY3 (or C:\nltk_data\tokenizers\punkt\PY3).