Dividing sentences into words – tokenization
In many instances, we rely on individual words when we do NLP tasks. This happens, for example, when we build semantic models of texts by relying on the semantics – of individual words, or when we are looking for words with a specific part of speech. To divide text into words, we can use NLTK and spaCy to do this task for us.
Getting ready
For this part, we will be using the same text of the book The Adventures of Sherlock Holmes. You can find the whole text in the book’s GitHub repository. For this recipe, we will need just the beginning of the book, which can be found in the sherlock_holmes_1.txt
file.
In order to do this task, you will need the NLTK and spaCy packages, which are part of the Poetry file. Directions to install Poetry are described in the Technical requirements section.
(Notebook reference: https://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob...