Estimating text complexity by counting sentences
One aspect of a piece of text that we can capture in features is its complexity. Usually, longer descriptions that contain multiple sentences spread over several paragraphs tend to provide more information than descriptions with very few sentences. Therefore, capturing the number of sentences may provide some insight into the amount of information provided by the text. This process is called sentence tokenization. Tokenization is the process of splitting a string into a list of pieces or tokens. In the Counting characters, words, and vocabulary recipe, we did word tokenization – that is, we divided the string into words. In this recipe, we will divide the string into sentences, and then we will count them. We will use the NLTK
Python library, which provides this functionality.
Getting ready
In this recipe, we will use the NLTK
Python library. For guidelines on how to install NLTK
, check out the Technical requirements section...