Spark SQL
From Spark version 1.3, data frames have been introduced in Apache Spark, so that Spark data can be processed in a tabular form and tabular functions (such as select
, filter
, and groupBy
) can be used to process data. The Spark SQL module integrates with Parquet and JSON formats, to allow data to be stored in formats, that better represent the data. This also offers more options to integrate with external systems.
The idea of integrating Apache Spark into the Hadoop Hive big data database can also be introduced. Hive context-based Spark applications can be used to manipulate Hive-based table data. This brings Spark's fast in-memory distributed processing to Hive's big data storage capabilities. It effectively lets Hive use Spark as a processing engine.
Additionally, there is an abundance of additional connectors to access NoSQL databases outside the Hadoop ecosystem directly from Apache Spark.