Tokenization is the initial step in text analysis. Tokenization is defined as breaking down text paragraphs into smaller parts or tokens such as sentences or words and ignoring punctuation marks. Tokenization can be of two types: sentence tokenization and word tokenization. A sentence tokenizer splits a paragraph into sentences and word tokenization splits a text into words or tokens.
Let's tokenize a paragraph using NLTK and spaCy:
- Before tokenization, import NLTK and download the required files:
# Loading NLTK module
import nltk
# downloading punkt
nltk.download('punkt')
# downloading stopwords
nltk.download('stopwords')
# downloading wordnet
nltk.download('wordnet')
# downloading average_perception_tagger
nltk.download('averaged_perceptron_tagger')
- Now, we will tokenize paragraphs into sentences using the sent_tokenize() method of NLTK:
# Sentence Tokenization
from nltk.tokenize import sent_tokenize
paragraph="""Taj...