Training a sentence tokenizer
NLTK's default sentence tokenizer is general purpose, and usually works quite well. But sometimes it is not the best choice for your text. Perhaps your text uses nonstandard punctuation, or is formatted in a unique way. In such cases, training your own sentence tokenizer can result in much more accurate sentence tokenization.
Getting ready
For this example, we'll be using the webtext
corpus, specifically the overheard.txt
file, so make sure you've downloaded this corpus. The text in this file is formatted as dialog that looks like this:
White guy: So, do you have any plans for this evening? Asian girl: Yeah, being angry! White guy: Oh, that sounds good.
As you can see, this isn't your standard paragraph of sentences formatting, which makes it a perfect case for training a sentence tokenizer.
How to do it...
NLTK provides a PunktSentenceTokenizer
class that you can train on raw text to produce a custom sentence tokenizer. You can get raw text either by reading in a file, or from an NLTK corpus using the raw()
method. Here's an example of training a sentence tokenizer on dialog text, using overheard.txt
from the webtext
corpus:
>>> from nltk.tokenize import PunktSentenceTokenizer >>> from nltk.corpus import webtext >>> text = webtext.raw('overheard.txt') >>> sent_tokenizer = PunktSentenceTokenizer(text)
Let's compare the results to the default sentence tokenizer, as follows:
>>> sents1 = sent_tokenizer.tokenize(text) >>> sents1[0] 'White guy: So, do you have any plans for this evening?' >>> from nltk.tokenize import sent_tokenize >>> sents2 = sent_tokenize(text) >>> sents2[0] 'White guy: So, do you have any plans for this evening?' >>> sents1[678] 'Girl: But you already have a Big Mac...' >>> sents2[678] 'Girl: But you already have a Big Mac...\\nHobo: Oh, this is all theatrical.'
While the first sentence is the same, you can see that the tokenizers disagree on how to tokenize sentence 679
(this is the first sentence where the tokenizers diverge). The default tokenizer includes the next line of dialog, while our custom tokenizer correctly thinks that the next line is a separate sentence. This difference is a good demonstration of why it can be useful to train your own sentence tokenizer, especially when your text isn't in the typical paragraph-sentence structure.
How it works...
The PunktSentenceTokenizer
class uses an unsupervised learning algorithm to learn what constitutes a sentence break. It is unsupervised because you don't have to give it any labeled training data, just raw text. You can read more about these kinds of algorithms at https://en.wikipedia.org/wiki/Unsupervised_learning. The specific technique used in this case is called sentence boundary detection and it works by counting punctuation and tokens that commonly end a sentence, such as a period or newline, then using the resulting frequencies to decide what the sentence boundaries should actually look like.
This is a simplified description of the algorithm—if you'd like more details, take a look at the source code of the nltk.tokenize.punkt.PunktTrainer
class, which can be found online at http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktSentenceTokenizer.
There's more...
The PunktSentenceTokenizer
class learns from any string, which means you can open a text file and read its content. Here is an example of reading overheard.txt
directly instead of using the raw()
corpus method. This assumes that the webtext
corpus is located in the standard directory at /usr/share/nltk_data/corpora
. We also have to pass a specific encoding to the open()
function, as follows, because the file is not in ASCII:
>>> with open('/usr/share/nltk_data/corpora/webtext/overheard.txt', encoding='ISO-8859-2') as f: ... text = f.read() >>> sent_tokenizer = PunktSentenceTokenizer(text) >>> sents = sent_tokenizer.tokenize(text) >>> sents[0] 'White guy: So, do you have any plans for this evening?' >>> sents[678] 'Girl: But you already have a Big Mac...'
Once you have a custom sentence tokenizer, you can use it for your own corpora. Many corpus readers accept a sent_tokenizer
parameter, which lets you override the default sentence tokenizer object with your own sentence tokenizer. Corpus readers are covered in more detail in Chapter 3, Creating Custom Corpora.
See also
Most of the time, the default sentence tokenizer will be sufficient. This is covered in the first recipe, Tokenizing text into sentences.