Tokenization
Next, we will learn about tokenization for NLP, a way of pre-processing text for entry into our models. Tokenization splits our sentences up into smaller parts. This could involve splitting a sentence up into its individual words or splitting a whole document up into individual sentences. This is an essential pre-processing step for NLP that can be done fairly simply in Python:
- We first take a basic sentence and split this up into individual words using the word tokenizer in NLTK:
text = 'This is a single sentence.' tokens = word_tokenize(text) print(tokens)
This results in the following output:
- Note how a period (
.
) is considered a token as it is a part of natural language. Depending on what we want to do with the text, we may wish to keep or dispose of the punctuation:no_punctuation = [word.lower() for word in tokens if word.isalpha()] print(no_punctuation)
This results in the following output: