Dividing text into sentences
When we work with text, we can work with text units on different scales: we can work at the level of the document itself, such as a newspaper article; the paragraph, the sentence, or the word. Sentences are the main unit of processing in many NLP tasks. In this section, I will show you how to divide text into sentences.
Getting ready
For this part, we will be using the text of the book The Adventures of Sherlock Holmes. You can find the whole text in the book's GitHub (see the sherlock_holmes.txt
file). For this recipe, we will need just the beginning of the book, which can be found in the sherlock_holmes_1.txt
file.
In order to do this task, you will need the nltk
package and its sentence tokenizers, described in the Technical requirements section.
How to do it…
We will now divide the text of The Adventures of Sherlock Holmes, outputting a list of sentences:
- Import the
nltk
package:import nltk
- Read in the book text...