Creating SparkR DataFrames
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a DataFrame in R, but with rich optimizations. SparkR DataFrames scale to large datasets using the support for distributed computation in Spark. In this recipe, we'll see how to create SparkR DataFrames from different sources, such as JSON, CSV, local R DataFrames, and Hive tables.
Getting ready
To step through this recipe, you will need a running Spark Cluster either in pseudo distributed mode or in one of the distributed modes, that is, standalone, YARN, or Mesos. Also, install RStudio. Please refer to the Installing R recipe for details on the installation of R. Please refer to the Creating a SparkR standalone application from Rstudio recipe for details on working with the SparkR package.
How to do it…
In this recipe, we'll see how to create SparkR data frames in Spark 1.6.0 as well as Spark 2.0.2:
- Use...