Pre-processing data using tokenization
The pre-processing of data involves converting the existing text into acceptable information for the learning algorithm.
Tokenization is the process of dividing text into a set of meaningful pieces. These pieces are called tokens.
How to do it...
- Introduce sentence tokenization:
from nltk.tokenize import sent_tokenize
- Form a new text tokenizer:
tokenize_list_sent = sent_tokenize(text) print "nSentence tokenizer:" print tokenize_list_sent
- Form a new word tokenizer:
from nltk.tokenize import word_tokenize print "nWord tokenizer:" print word_tokenize(text)
- Introduce a new WordPunct tokenizer:
from nltk.tokenize import WordPunctTokenizer word_punct_tokenizer = WordPunctTokenizer() print "nWord punct tokenizer:" print word_punct_tokenizer.tokenize(text)
The result obtained by the tokenizer is shown here. It divides a sentence into word groups: