Counting characters, words, and vocabulary
One of the salient characteristics of text is its complexity. Long descriptions are more likely to contain more information than short descriptions. Texts rich in different, unique words are more likely to be richer in detail than texts that repeat the same words over and over. In the same way, when we speak, we use many short words such as articles and prepositions to build the sentence structure, yet the main concept is often derived from the nouns and adjectives we use, which tend to be longer words. So, as you can see, even without reading the text, we can start inferring how much information the text provides by determining the number of words, the number of unique words (non-repeated occurrences of a word), the lexical diversity, and the length of those words. In this recipe, we will learn how to extract these features from a text variable using pandas
.
Getting ready
We are going to use the 20 Newsgroup dataset that comes with...