Spark provides a mechanism to work with a variety of data sources and formats. It also has excellent support for integrating with the Hadoop Distributed File System (HDFS), as well as several other popular storage systems, such as Amazon S3. In this section, we will focus on the variety of data sources and formats supported by Spark.
Sourcing data using Spark
Parquet file format
Apache Parquet (https://parquet.apache.org/) is an open source project and defines the specifications of a columnar data storage format. This storage format is extremely popular in the big data world for the following reasons:
- It supports nested data structures, which is good because most real-world data fits more naturally into a nested structure...