Cleaning the corpus
One of the nicest features of the tm
package is the variety of bundled transformations to be applied on corpora (corpuses). The tm_map
function provides a convenient way of running the transformations on the corpus to filter out all the data that is irrelevant in the actual research. To see the list of available transformation methods, simply call the getTransformations
function:
> getTransformations() [1] "as.PlainTextDocument" "removeNumbers" [3] "removePunctuation" "removeWords" [5] "stemDocument" "stripWhitespace"
We should usually start with removing the most frequently used, so called stopwords from the corpus. These are the most common, short function terms, which usually carry less important meanings than the other expressions in the corpus, especially the keywords. The package already includes such lists of words in different languages:
> stopwords("english") [1] "i" "me" "my" "myself" "we" [6] "our...