Introduction
In this chapter, we will explore the current fundamental data structure—DataFrames. DataFrames take advantage of the developments in the tungsten project and the Catalyst Optimizer. These two improvements bring the performance of PySpark on par with that of either Scala or Java.
Project tungsten is a set of improvements to Spark Engine aimed at bringing its execution process closer to the bare metal. The main deliverables include:
- Code generation at runtime: This aims at leveraging the optimizations implemented in modern compilers
- Taking advantage of the memory hierarchy: The algorithms and data structures exploit memory hierarchy for fast execution
- Direct-memory management: Removes the overhead associated with Java garbage collection and JVM object creation and management
- Low-level programming: Speeds up memory access by loading immediate data to CPU registers
- Virtual function dispatches elimination: This eliminates the necessity of multiple CPU calls
Note
Check this blog from Databricks...