Chapter 2. Data Munging
It is said that around 50% of the data scientist's time goes into transforming raw data into a usable format. Raw data can be in any format or size. It can be structured like RDBMS, semi-structured like CSV, or unstructured like regular text files. These contain some valuable information. And to extract that information, it has to be converted into a data structure or a usable format from which an algorithm can find valuable insights. Therefore, usable format refers to the data in a model that can be consumed in the data science process. This usable format differs from use case to use case.
This chapter will guide you through data munging, or the process of preparing the data. It covers the following topics:
- What is data munging?
- DataFrames.jl
- Uploading data from a file
- Finding the required data
- Joins and indexing
- Split-Apply-Combine strategy
- Reshaping the data
- Formula (ModelFrame and ModelMatrix)
- PooledDataArray
- Web scraping