Spark SQL programming
Let's now get our hands dirty and work through various examples. We will start with a simple Dataset and then progressively perform more sophisticated SQL statements. We will use the NorthWind Dataset.
Datasets/DataFrames
In short, Datasets are semantic domain-specific objects, which means they are very rich in terms of typing and they possess all the functions of RDDs. In short, the best of both worlds! A DataFrame is an untyped view into a Dataset, basically a collection of rows. This is useful for doing abstract generic operations on a Dataset, that is, operations that depend only on the positions of elements in a row and other factors. We will learn more in later sections.
Tip
As languages, such as Python and R, do not have compile-time type checking, Datasets and DataFrames are collapsed and called DataFrames.
Another change in 2.0 is sparksession
, which replaces sqlcontext
, hivecontext
, and others. The sparksession
instance has a very rich and flexible read
method...