What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Python 3 Text Processing with NLTK 3 Cookbook

Chapter 1. Tokenizing Text and WordNet Basics

In this chapter, we will cover the following recipes:

Tokenizing text into sentences
Tokenizing sentences into words
Tokenizing sentences using regular expressions
Training a sentence tokenizer
Filtering stopwords in a tokenized sentence
Looking up Synsets for a word in WordNet
Looking up lemmas and synonyms in WordNet
Calculating WordNet Synset similarity
Discovering word collocations

Tokenizing text into sentences

Tokenization is the process of splitting a string into a list of pieces or tokens. A token is a piece of a whole, so a word is a token in a sentence, and a sentence is a token in a paragraph. We'll start with sentence tokenization, or splitting a paragraph into a list of sentences.

Getting ready

Installation instructions for NLTK are available at http://nltk.org/install.html and the latest version at the time of writing this is Version 3.0b1. This version of NLTK is built for Python 3.0 or higher, but it is backwards compatible with Python 2.6 and higher. In this book, we will be using Python 3.3.2. If you've used earlier versions of NLTK (such as version 2.0), note that some of the APIs have changed in Version 3 and are not backwards compatible.

Once you've installed NLTK, you'll also need to install the data following the instructions at http://nltk.org/data.html. I recommend installing everything, as we'll be using a number of corpora and pickled objects. The data is installed in a data directory, which on Mac and Linux/Unix is usually /usr/share/nltk_data, or on Windows is C:\nltk_data. Make sure that tokenizers/punkt.zip is in the data directory and has been unpacked so that there's a file at tokenizers/punkt/PY3/english.pickle.

Finally, to run the code examples, you'll need to start a Python console. Instructions on how to do so are available at http://nltk.org/install.html. For Mac and Linux/Unix users, you can open a terminal and type python.

How to do it...

Once NLTK is installed and you have a Python console running, we can start by creating a paragraph of text:

>>> para = "Hello World. It's good to see you. Thanks for buying this book."

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Now we want to split the paragraph into sentences. First we need to import the sentence tokenization function, and then we can call it with the paragraph as an argument:

>>> from nltk.tokenize import sent_tokenize
>>> sent_tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

So now we have a list of sentences that we can use for further processing.

How it works...

The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module. This instance has already been trained and works well for many European languages. So it knows what punctuation and characters mark the end of a sentence and the beginning of a new sentence.

There's more...

The instance used in sent_tokenize() is actually loaded on demand from a pickle file. So if you're going to be tokenizing a lot of sentences, it's more efficient to load the PunktSentenceTokenizer class once, and call its tokenize() method instead:

>>> import nltk.data
>>> tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
>>> tokenizer.tokenize(para)
['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

Tokenizing sentences in other languages

If you want to tokenize sentences in languages other than English, you can load one of the other pickle files in tokenizers/punkt/PY3 and use it just like the English sentence tokenizer. Here's an example for Spanish:

>>> spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
>>> spanish_tokenizer.tokenize('Hola amigo. Estoy bien.')
['Hola amigo.', 'Estoy bien.']

You can see a list of all the available language tokenizers in /usr/share/nltk_data/tokenizers/punkt/PY3 (or C:\nltk_data\tokenizers\punkt\PY3).

Tokenizing sentences into words

In this recipe, we'll split a sentence into individual words. The simple task of creating a list of words from a string is an essential part of all text processing.

How to do it...

Basic word tokenization is very simple; use the word_toke nize() function:

>>> from nltk.tokenize import word_tokenize
>>> word_tokenize('Hello World.')
['Hello', 'World', '.']

How it works...

The word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class. It's equivalent to the following code:

>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('Hello World.')
['Hello', 'World', '.']

It works by separating words using spaces and punctuation. And as you can see, it does not discard the punctuation, allowing you to decide what to do with it.

There's more...

Ignoring the obviously named WhitespaceTokenizer and SpaceTokenizer, there are two other word tokenizers worth looking at: PunktWordTokenizer and WordPunctTokenizer. These differ from TreebankWordTokenizer by how they handle punctuation and contractions, but they all inherit from TokenizerI. The inheritance tree looks like what's shown in the following diagram:

Separating contractions

The TreebankWordTokenizer class uses conventions found in the Penn Treebank corpus. This corpus is one of the most used corpora for natural language processing, and was created in the 1980s by annotating articles from the Wall Street Journal. We'll be using this later in Chapter 4, Part-of-speech Tagging, and Chapter 5, Extracting Chunks.

One of the tokenizer's most significant conventions is to separate contractions. For example, consider the following code:

>>> word_tokenize("can't")
['ca', "n't"]

If you find this convention unacceptable, then read on for alternatives, and see the next recipe for tokenizing with regular expressions.

PunktWordTokenizer

An alternative word tokenizer is PunktWordTokenizer. It splits on punctuation, but keeps it with the word instead of creating separate tokens, as shown in the following code:

>>> from nltk.tokenize import PunktWordTokenizer
>>> tokenizer = PunktWordTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'t", 'is', 'a', 'contraction.']

WordPunctTokenizer

Another alternative word tokenizer is WordPunctTokenizer. It splits all punctuation into separate tokens:

>>> from nltk.tokenize import WordPunctTokenizer
>>> tokenizer = WordPunctTokenizer()
>>> tokenizer.tokenize("Can't is a contraction.")
['Can', "'", 't', 'is', 'a', 'contraction', '.']

Tokenizing sentences using regular expressions

Regular expressions can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, I only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.

Getting ready

First you need to decide how you want to tokenize a piece of text as this will determine how you construct your regular expression. The choices are:

Match on the tokens
Match on the separators or gaps

We'll start with an example of the first, matching alphanumeric tokens plus single quotes so that we don't split up contractions.

How to do it...

We'll create an instance of RegexpTokenizer, giving it a regular expression string to use for matching tokens:

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer("[\w']+")
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction']

There's also a simple helper function you can use if you don't want to instantiate the class, as shown in the following code:

>>> from nltk.tokenize import regexp_tokenize
>>> regexp_tokenize("Can't is a contraction.", "[\w']+")
["Can't", 'is', 'a', 'contraction']

Now we finally have something that can treat contractions as whole words, instead of splitting them into tokens.

How it works...

The RegexpTokenizer class works by compiling your pattern, then calling re.findall() on your text. You could do all this yourself using the re module, but RegexpTokenizer implements the TokenizerI interface, just like all the word tokenizers from the previous recipe. This means it can be used by other parts of the NLTK package, such as corpus readers, which we'll cover in detail in Chapter 3, Creating Custom Corpora. Many corpus readers need a way to tokenize the text they're reading, and can take optional keyword arguments specifying an instance of a TokenizerI subclass. This way, you have the ability to provide your own tokenizer instance if the default tokenizer is unsuitable.

There's more...

RegexpTokenizer can also work by matching the gaps, as opposed to the tokens. Instead of using re.findall(), the RegexpTokenizer class will use re.split(). This is how the BlanklineTokenizer class in nltk.tokenize is implemented.

Simple whitespace tokenizer

The following is a simple example of using RegexpT okenizer to tokenize on whitespace:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize("Can't is a contraction.")
["Can't", 'is', 'a', 'contraction.']

Notice that punctuation still remains in the tokens. The gaps=True parameter means that the pattern is used to identify gaps to tokenize on. If we used gaps=False, then the pattern would be used to identify tokens.