Working with proprietary files and databases
Unlike the examples in this book, real-world data is rarely packaged in a simple CSV form that can be downloaded from a website. Instead, significant effort is needed to prepare data for analysis. Data must be collected, merged, sorted, filtered, or reformatted to meet the requirements of the learning algorithm. This process is informally known as data munging or data wrangling.
Data preparation has become even more important, as the size of typical datasets has grown from megabytes to gigabytes, and data is gathered from unrelated and messy sources, many of which are stored in massive databases. Several packages and resources for retrieving and working with proprietary data formats and databases are listed in the following sections.
Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
A frustrating aspect of data analysis is the large amount of work required to pull and combine data from various proprietary formats. Vast troves...