Our goal here is to build a classifier to predict Presidential party affiliation, either Democrat or Republican, since 1900. We will turn the word counts per year into features, create a DTM, create features using the term frequency-inverse document frequency (tf-idf), and use them in our model. As you can imagine, we will have thousands of features, so we will change how the data is prepared versus what we covered in prior sections, and also use the text2vec package for feature creation and modeling.
Classifying text
Data preparation
We'll start by getting the pertinent data period. Then, we'll take a look at a table of the labels:
> sotu_party <- sotu_meta %>%
dplyr::filter(year > 1899)
> table...