Data acquisition
Data acquisition, or data collection, is the very first step in any data science project. Usually, you won't find the complete set of required data in one place as it is distributed across line-of-business (LOB) applications and systems.
The majority of this section has already been covered in the previous chapter, which outlined how to source data from different data sources and store the data in DataFrames for easier analysis. There is a built-in mechanism in Spark to fetch data from some of the common data sources and the Data Source API is provided for the ones not supported out of the box on Spark.
To get a better understanding of the data acquisition and preparation phases, let us assume a scenario and try to address all the steps involved with example code snippets. The scenario is such that employee data is present across native RDDs, JSON files, and on a SQL server. So, let's see how we can get those to Spark DataFrames:
Python
// From RDD: Create an RDD and convert...