Understanding the SparkR architecture
SparkR's distributed DataFrame programming syntax that is familiar to R users. The high-level DataFrame API integrates the R API with the optimized SQL execution engine in Spark.
SparkR's architecture primarily consists of two components: an R to JVM binding on the driver that enables R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
SparkR's design consists of support for launching R
processes on Spark executor machines. However, there is an overhead associated with serializing the query and deserializing the results after they have been computed. As the amount of data transferred between R
and the JVM
increases, these overheads can become more significant as well. However, caching can enable efficient interactive query processing in SparkR.
Note
For a detailed description of SparkR design and implementation, refer: "SparkR: Scaling R Programs with Spark" by Shivaram Venkataraman1, Zongheng Yang, et al, available...