Chapter 4. Unified Data Access
Data integration from disparate data sources had always been a daunting feat. The three V's of big data and ever-shrinking processing time frames have made the task even more challenging. Delivering a clear view of well-curated data in near real time is extremely important for business. However, real-time curated data along with the ability to perform different operations such as ETL, ad hoc querying, and machine learning in a unified fashion is what is emerging as a key business differentiator.
Apache Spark was created to offer a single general-purpose engine that can process data from a variety of data sources and support large-scale data processing for various different operations. Spark enables developers to combine SQL, Streaming, graphs, and machine learning algorithms in a single workflow!
In the previous chapters, we discussed Resilient Distributed Datasets (RDDs) as well as DataFrames. In Chapter 3, Introduction to DataFrames, we...