History of Spark SQL
To address the challenges of performance issues of Hive queries, a new project called Shark was introduced into the Spark ecosystem in early versions of Spark. Shark used Spark as an execution engine instead of the MapReduce engine for executing hive queries. Shark was built on the hive codebase using the Hive query compiler to parse hive queries and generate an abstract syntax tree, which is converted to a logical plan with some basic optimizations. Shark applied additional optimizations and created a physical plan of RDD operations, then executed them in Spark. This provided in-memory performance to Hive queries. But, Shark had three major problems to deal with:
- Shark was suitable to query Hive tables only. Running relational queries on RDDs was not possible
- Running Hive QL as a string within spark programs was error-prone
- Hive optimizer was created for the MapReduce paradigm and it was difficult to extend Spark for new data sources and new processing models
Shark was...