Creating DataFrames
A Spark DataFrame is an immutable collection of data distributed within a cluster. The data inside a DataFrame is organized into named columns that can be compared to tables in a relational database.
In this recipe, we will learn how to create Spark DataFrames.
Getting ready
To execute this recipe, you need to have a working Spark 2.3 environment. If you do not have one, you might want to go back to Chapter 1, Installing and Configuring Spark, and follow the recipes you find there.
All the code that you will need for this chapter can be found in the GitHub repository we set up for the book: http://bit.ly/2ArlBck; go to Chapter 3
and open the 3. Abstracting data with DataFrames.ipynb
notebook.
There are no other requirements.
How to do it...
There are many ways to create a DataFrame, but the simplest way is to create an RDD and convert it into a DataFrame:
sample_data = sc.parallelize([ (1, 'MacBook Pro', 2015, '15"', '16GB', '512GB SSD' , 13.75, 9.48, 0.61, 4...