Introduction
If you have been part of the data industry for a while, you will understand the challenge of working with different data sources, analyzing them, and presenting them in consumable business reports. When using Spark on Python, you may have to read data from various sources, such as flat files, REST APIs in JSON format, and so on.
In the real world, getting data in the right format is always a challenge and several SQL operations are required to gather data. Thus, it is mandatory for any data scientist to know how to handle different file formats and different sources, and to carry out basic SQL operations and present them in a consumable format.
This chapter provides common methods for reading different types of data, carrying out SQL operations on it, doing descriptive statistical analysis, and generating a full analysis report. We will start with understanding how to read different kinds of data into PySpark and will then generate various analyses and plots on it.