Tokenizing text data
When we deal with text, we need to break it down into smaller pieces for analysis. To do this, tokenization can be applied. Tokenization is the process of dividing text into a set of pieces, such as words or sentences. These pieces are called tokens. Depending on what we want to do, we can define our own methods to divide the text into many tokens. Let's look at how to tokenize the input text using NLTK.
Create a new Python file and import the following packages:
from nltk.tokenize import sent_tokenize, \
word_tokenize, WordPunctTokenizer
Define the input text that will be used for tokenization:
# Define input text
input_text = "Do you know how tokenization works? It's actually \
quite interesting! Let's analyze a couple of sentences and \
figure it out."
Divide the input text into sentence tokens:
# Sentence tokenizer
print("\nSentence tokenizer:")
print(sent_tokenize(input_text...