DataFrames
We have already used DataFrames in previous examples; it is based on a columnar format. Temporary tables can be created from it but we will expand on this in the next section. There are many methods available to the data frame that allow data manipulation and processing.
Let's start with a simple example and load some JSON data coming from an IoT sensor on a washing machine. We are again using the Apache Spark DataSource API under the hood to read and parse JSON data. The result of the parser is a data frame. It is possible to display a data frame schema as shown here:
As you can see, this is a nested data structure. So, the doc
field contains all the information that we are interested in, and we want to get rid of the meta information that Cloudant/ApacheCouchDB added to the original JSON
file. This can be accomplished by a call to the select
method on the DataFrame:
This is the first time that we are using the DataFrame API for data processing. Similar to RDDs, a set of methods...