In this chapter, we showed how to perform predictive analysis using text data. To do so, we showed how to tokenize text to extract relevant words, how to build and work with document-feature matrices (DFMs), how to apply transformations to DFMs to explore different predictive models using term frequency-inverse document frequency weights, n-grams, partial singular value decompositions, and cosine similarities, and how to use these data structures within random forests to produce predictions. You learned why these techniques may be important for some problems and how to combine them. We also showed how to include sentiment analysis inferred from text to increase the predictive power of our models. Finally, we showed how to retrieve live data from Twitter that can be used to analyze what people are saying in the social network shortly after they have said it.
We encourage...