Importing and saving data
We wanted to add this section about importing and saving data here, even though it is not purely about Spark SQL, so that concepts such as Parquet and JSON file formats could be introduced. This section also allows us to cover how to access saved data in loose text as well as the CSV, Parquet, and JSON formats conveniently in one place.
Processing the text files
Using SparkContext
, it is possible to load a text file in RDD
using the textFile
method. Additionally, the wholeTextFile
method can read the contents of a directory to RDD.
The following examples show you how a file, based on the local filesystem (file://
) or HDFS (hdfs://
), can be read to a Spark RDD. These examples show you that the data will be divided into six partitions for increased performance. The first two examples are the same as they both load a file from the Linux filesystem, whereas the last one resides in HDFS:
sc.textFile("/data/spark/tweets.txt",6) sc.textFile("file:///data/spark/tweets.txt...