Tokenizing sentences using regular expressions
Regular expressions can be used if you want complete control over how to tokenize text. As regular expressions can get complicated very quickly, I only recommend using them if the word tokenizers covered in the previous recipe are unacceptable.
Getting ready
First you need to decide how you want to tokenize a piece of text as this will determine how you construct your regular expression. The choices are:
- Match on the tokens
- Match on the separators or gaps
We'll start with an example of the first, matching alphanumeric tokens plus single quotes so that we don't split up contractions.
How to do it...
We'll create an instance of RegexpTokenizer
, giving it a regular expression string to use for matching tokens:
>>> from nltk.tokenize import RegexpTokenizer >>> tokenizer = RegexpTokenizer("[\w']+") >>> tokenizer.tokenize("Can't is a contraction.") ["Can't", 'is', 'a', 'contraction']
There's also a simple helper function you can use if you don't want to instantiate the class, as shown in the following code:
>>> from nltk.tokenize import regexp_tokenize >>> regexp_tokenize("Can't is a contraction.", "[\w']+") ["Can't", 'is', 'a', 'contraction']
Now we finally have something that can treat contractions as whole words, instead of splitting them into tokens.
How it works...
The RegexpTokenizer
class works by compiling your pattern, then calling re.findall()
on your text. You could do all this yourself using the re
module, but RegexpTokenize
r implements the TokenizerI
interface, just like all the word tokenizers from the previous recipe. This means it can be used by other parts of the NLTK package, such as corpus readers, which we'll cover in detail in Chapter 3, Creating Custom Corpora. Many corpus readers need a way to tokenize the text they're reading, and can take optional keyword arguments specifying an instance of a TokenizerI
subclass. This way, you have the ability to provide your own tokenizer instance if the default tokenizer is unsuitable.
There's more...
RegexpTokenizer
can also work by matching the gaps, as opposed to the tokens. Instead of using re.findall()
, the RegexpTokenizer
class will use re.split()
. This is how the BlanklineTokenizer
class in nltk.tokenize
is implemented.
Simple whitespace tokenizer
The following is a simple example of using RegexpT
okenizer
to tokenize on whitespace:
>>> tokenizer = RegexpTokenizer('\s+', gaps=True) >>> tokenizer.tokenize("Can't is a contraction.") ["Can't", 'is', 'a', 'contraction.']
Notice that punctuation still remains in the tokens. The gaps=True
parameter means that the pattern is used to identify gaps to tokenize on. If we used gaps=False
, then the pattern would be used to identify tokens.
See also
For simpler word tokenization, see the previous recipe.