The first step we will take to begin analyzing text is loading text files and then tokenizing our data by transforming the text from sentences into smaller pieces, such as words or terms. A text object can be tokenized in a number of ways. In this chapter, we will tokenize text into words, although other sized terms could also be tokenized. These are referred to as n-grams, so we can get two-word terms (2-grams), three-word terms, or a term of any arbitrary size.
To get started with the process of creating one-word tokens from our text objects, we will use the following steps:
- Let's load the libraries that we will need. For this project, we will use tidyverse for data manipulation, tidytext for special functions to manipulate text data, spacyr for extracting text metadata, and textmineR for word embeddings. To load these libraries, we run...