Text mining
NLP is a wide and active field of study. You probably interact with NLP algorithms at least once a day while using voice assistants, translators, or maybe speech-to-text converter. As you may have already noticed in this chapter, there is a lot of information to be extracted from a text, especially when we can count values and measure the importance of words.
For this section, we will use the tidytext
library, which helps us to mine text very easily with just a few functions. Before we dive in to the following subsections, let’s load a dataset for our examples. It is the book The Time Machine, by H. G. Wells, downloaded from the open source gutenberg
library for R:
# Downloading "The Time Machine" by H. G Wells book <- gutenberg_download(gutenberg_id = 35)
Let’s move on.
Tokenization
We should start with the definition of a token. A token is the smallest meaningful unit of a text. It is most common to find words as tokens,...