Importing the corpus
A corpus is basically a collection of text documents that you want to include in the analytics. Use the getSources
function to see the available options to import a corpus with the tm
package:
> library(tm) > getSources() [1] "DataframeSource" "DirSource" "ReutersSource" "URISource" [2] "VectorSource"
So, we can import text documents from a data.frame
, a vector
, or directly from a uniform resource identifier with the URISource
function. The latter stands for a collection of hyperlinks or file paths, although this is somewhat easier to handle with DirSource
, which imports all the textual documents found in the referenced directory on our hard drive. By calling the getReaders
function in the R console, you can see the supported text file formats:
> getReaders() [1] "readDOC" "readPDF" [3] "readPlain" "readRCV1" [5] "readRCV1asPlain" "readReut21578XML" [7] "readReut21578XMLasPlain" ...