Chapter 2. Processing Text
A significant part of the time spent on any modeling or analysis activity goes into accessing, preprocessing, and cleaning the data. We should have the capability to access data from diverse sources, load them in our statistical analysis environment and process them in a manner conducive for advanced analysis.
In this chapter, we will learn to access data from a wide variety of sources and load it into our R environment. We will also learn to perform some standard text processing.
By the time you finish the chapter, you should be equipped with enough knowledge to retrieve data from most of the data sources and process it into custom corpus for further analysis:
- Accessing texts from diverse sources
- Processing texts using regular expressions
- Normalizing texts
- Lexical diversity
- Language detection