Sentiment analysis from movie reviews
Let's continue with the IMDb data and put into practice the ideas from the previous sections. In this section, we will use a few familiar packages, like tidytext
, plyr
and dplyr
, as well as the excellent text2vec
by Dimitriy Selivanov, which was released in 2017, and the well-known caret
package by Max Kuhn.
Data preprocessing
We need to prepare our data for the algorithm.
First, a few imports that will be necessary:
library(plyr) library(dplyr) library(text2vec) library(tidytext) library(caret)
We will use the IMDb data as before:
imdb <- read.csv("./data/labeledTrainData.tsv", encoding = "utf-8", quote = "", sep="\t", stringsAsFactors = F)
And create an iterator over the tokens:
tokens <- space_tokenizer(imdb$review) token_iterator <- itoken(tokens)
The tokens are simple words, also known as unigrams. This constitutes our vocabulary:
vocab <- create_vocabulary(token_iterator)
It's important for the co-occurrence matrix to include only words that appear...