You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product type Paperback

Published in Aug 2014

Publisher

ISBN-13 9781782167853

Length 304 pages

Edition 2nd Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1):

Jacob Perkins

View More author details

Table of Contents (12) Chapters

Preface

1. Tokenizing Text and WordNet Basics FREE CHAPTER

2. Replacing and Correcting Words

3. Creating Custom Corpora

4. Part-of-speech Tagging

5. Extracting Chunks

6. Transforming Chunks and Trees

7. Text Classification

8. Distributed Processing and Handling Large Datasets

9. Parsing Specific Data Types

A. Penn Treebank Part-of-speech Tags

Index

Training a sentence tokenizer

NLTK's default sentence tokenizer is general purpose, and usually works quite well. But sometimes it is not the best choice for your text. Perhaps your text uses nonstandard punctuation, or is formatted in a unique way. In such cases, training your own sentence tokenizer can result in much more accurate sentence tokenization.

Getting ready

For this example, we'll be using the webtext corpus, specifically the overheard.txt file, so make sure you've downloaded this corpus. The text in this file is formatted as dialog that looks like this:

White guy: So, do you have any plans for this evening?
Asian girl: Yeah, being angry!
White guy: Oh, that sounds good.

As you can see, this isn't your standard paragraph of sentences formatting, which makes it a perfect case for training a sentence tokenizer.

How to do it...

NLTK provides a PunktSentenceTokenizer class that you can train on raw text to produce a custom sentence tokenizer. You can get raw text either by reading in a file, or from an NLTK corpus using the raw() method. Here's an example of training a sentence tokenizer on dialog text, using overheard.txt from the webtext corpus:

>>> from nltk.tokenize import PunktSentenceTokenizer
>>> from nltk.corpus import webtext
>>> text = webtext.raw('overheard.txt')
>>> sent_tokenizer = PunktSentenceTokenizer(text)

Let's compare the results to the default sentence tokenizer, as follows:

>>> sents1 = sent_tokenizer.tokenize(text)
>>> sents1[0]
'White guy: So, do you have any plans for this evening?'

>>> from nltk.tokenize import sent_tokenize
>>> sents2 = sent_tokenize(text)
>>> sents2[0]
'White guy: So, do you have any plans for this evening?'
>>> sents1[678]
'Girl: But you already have a Big Mac...'
>>> sents2[678]
'Girl: But you already have a Big Mac...\\nHobo: Oh, this is all theatrical.'

While the first sentence is the same, you can see that the tokenizers disagree on how to tokenize sentence 679 (this is the first sentence where the tokenizers diverge). The default tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that the next line is a separate sentence. This difference is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text isn't in the typical paragraph-sentence structure.

How it works...

The PunktSentenceTokenizer class uses an unsupervised learning algorithm to learn what constitutes a sentence break. It is unsupervised because you don't have to give it any labeled training data, just raw text. You can read more about these kinds of algorithms at https://en.wikipedia.org/wiki/Unsupervised_learning. The specific technique used in this case is called sentence boundary detection and it works by counting punctuation and tokens that commonly end a sentence, such as a period or newline, then using the resulting frequencies to decide what the sentence boundaries should actually look like.

This is a simplified description of the algorithm—if you'd like more details, take a look at the source code of the nltk.tokenize.punkt.PunktTrainer class, which can be found online at http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktSentenceTokenizer.

There's more...

The PunktSentenceTokenizer class learns from any string, which means you can open a text file and read its content. Here is an example of reading overheard.txt directly instead of using the raw() corpus method. This assumes that the webtext corpus is located in the standard directory at /usr/share/nltk_data/corpora. We also have to pass a specific encoding to the open() function, as follows, because the file is not in ASCII:

>>> with open('/usr/share/nltk_data/corpora/webtext/overheard.txt', encoding='ISO-8859-2') as f:
...   text = f.read()
>>> sent_tokenizer = PunktSentenceTokenizer(text)
>>> sents = sent_tokenizer.tokenize(text)
>>> sents[0]
'White guy: So, do you have any plans for this evening?'
>>> sents[678]
'Girl: But you already have a Big Mac...'

Once you have a custom sentence tokenizer, you can use it for your own corpora. Many corpus readers accept a sent_tokenizer parameter, which lets you override the default sentence tokenizer object with your own sentence tokenizer. Corpus readers are covered in more detail in Chapter 3, Creating Custom Corpora.