You're reading from Python 3 Text Processing with NLTK 3 Cookbook

Product type Paperback

Published in Aug 2014

Publisher

ISBN-13 9781782167853

Length 304 pages

Edition 2nd Edition

Languages

Python

Tools

NLTK

Concepts

Data Processing

Author (1):

Jacob Perkins

View More author details

Table of Contents (12) Chapters

Preface

1. Tokenizing Text and WordNet Basics FREE CHAPTER

2. Replacing and Correcting Words

3. Creating Custom Corpora

4. Part-of-speech Tagging

5. Extracting Chunks

6. Transforming Chunks and Trees

7. Text Classification

8. Distributed Processing and Handling Large Datasets

9. Parsing Specific Data Types

A. Penn Treebank Part-of-speech Tags

Index

Tokenizing sentences using regular expressions

Regular expressions can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, I only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.

Getting ready

First you need to decide how you want to tokenize a piece of text as this will determine how you construct your regular expression. The choices are:

Match on the tokens
Match on the separators or gaps

We'll start with an example of the first, matching alphanumeric tokens plus single quotes so that we don't split up contractions.

How to do it...

We'll create an instance of RegexpTokenizer, giving it a regular expression string to use for matching tokens:

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[\w']+")
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']

There's also a simple helper function you can use if you don't want to instantiate the class, as shown in the following code:

>>> from nltk.tokenize import regexp_tokenize
>>> regexp_tokenize("Can't is a contraction.", "[\w']+")
["Can't", 'is', 'a', 'contraction']

Now we finally have something that can treat contractions as whole words, instead of splitting them into tokens.

How it works...

The RegexpTokenizer class works by compiling your pattern, then calling re.findall() on your text. You could do all this yourself using the re module, but RegexpTokenizer implements the TokenizerI interface, just like all the word tokenizers from the previous recipe. This means it can be used by other parts of the NLTK package, such as corpus readers, which we'll cover in detail in Chapter 3, Creating Custom Corpora. Many corpus readers need a way to tokenize the text they're reading, and can take optional keyword arguments specifying an instance of a TokenizerI subclass. This way, you have the ability to provide your own tokenizer instance if the default tokenizer is unsuitable.

There's more...

RegexpTokenizer can also work by matching the gaps, as opposed to the tokens. Instead of using re.findall(), the RegexpTokenizer class will use re.split(). This is how the BlanklineTokenizer class in nltk.tokenize is implemented.

Simple whitespace tokenizer

The following is a simple example of using RegexpT okenizer to tokenize on whitespace:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction.']

Notice that punctuation still remains in the tokens. The gaps=True parameter means that the pattern is used to identify gaps to tokenize on. If we used gaps=False, then the pattern would be used to identify tokens.