Analytics with DataFrames
Let's learn how to create and use DataFrames for Big Data Analytics. For easy understanding and a quick example, the pyspark
shell is to be used for the code in this chapter. The data needed for exercises used in this chapter can be found at https://github.com/apache/spark/tree/master/examples/src/main/resources. You can always create multiple data formats by reading one type of data file. For example, once you read .json
file, you can write data in parquet, ORC, or other formats.
Note
All programs in this chapter are executed on CDH 5.8 VM except the programs in the DataFrame based Spark-on-HBase connector section, which are executed on HDP2.5. For other environments, file paths might change, but the concepts are the same in any environment.
Creating SparkSession
In Spark versions 1.6 and below, the entry point into all relational functionality in Spark is the SQLContext
class. To create SQLContext
in an application, we need to create a SparkContext
and wrap SQLContext...