Summary
The use of data science techniques with text spans a few different areas, and the broad field of working with language and text with computers is called NLP. We saw first how we can clean text data using Python and the spaCy package by removing things like punctuation, stop words, and numbers. Lowercasing can be used to condense the same words (regardless of capitalization) into the same count for word frequency analysis. We can also use stemming or lemmatizing to reduce words to a stem or root, which further groups similar words for measuring word frequencies. The spaCy package makes cleaning and lemmatizing easy, and this can be done in a few lines of code.
We then saw how basic analytics, such as word frequency plots, POS tags, and word collocations, can be performed to get an understanding of the text. Zipf's law can be used to analyze text as well, to understand a text's characteristic shape parameter from the Zipfian distribution. Although wordclouds...