Summary
In this chapter, we discussed different methods of mining against a text source. We took a raw document, cleaned it up using built-in R functions, and produced a corpus that allowed analysis. We were able to remove sparse terms and stop words to be able to focus on the real value of the text.
From the corpus, we were able to generate a document term matrix that holds all of the word references in a source.
Once the matrix was available, we organized the words into clusters and plotted the data/text accordingly. Similarly, once in clusters, we could perform standard R clustering techniques to the data.
Finally, we looked at using raw XML as the text source for our processing and examined some of the XML processing features available in R.
In the next chapter, we will be covering regression analysis.