Basic text analysis
The first step in analysis is to explore the data. Common exploratory data analysis (EDA) with text includes frequency and TFIDF bar plots as well as plots of word counts. We'll also look at Zipf's law, word collocations, and analyzing the POS tags from our data in this section.
Word frequency plots
A simple way to explore data is with a word frequency or word count plot. There are a few ways to generate this: we could use the CountVectorizer
from sklearn
, NLTK's FreqDist
, pycaret
, and more.
Note that at the time of writing, pycaret
installs spacy
version 2.x.x, while the latest is 3.x.x. One solution is to install pycaret
, then reinstall spacy
with the latest version with conda install spacy=3.1.2
(using the latest version at the time of reading instead of 3.1.2), although this could potentially cause some problems with pycaret functionality. It may be useful to create a separate conda
environment for this chapter to deal with...