Dividing sentences into words – tokenization
In many instances, we rely on individual words when we do NLP tasks. This happens, for example, when we build semantic models of texts by relying on the semantics of individual words, or when we are looking for words with a specific part of speech. To divide text into words, we can use NLTK and spaCy.
Getting ready
For this recipe, we will be using the same text of the book The Adventures of Sherlock Holmes. You can find the whole text in the book's GitHub repository. For this recipe, we will need just the beginning of the book, which can be found in the sherlock_holmes_1.txt
file.
In order to do this task, you will need the nltk
package, described in the Technical requirements section.
How to do it…
- Import the
nltk
package:import nltk
- Read in the book text:
filename = "sherlock_holmes_1.txt" file = open(filename, "r", encoding="utf-8") text = file.read()
- Replace...