Data preprocessing in Spark
So far, we've seen how to load text data from the local filesystem and HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines able to transform a file into a DataFrame, similar to the DataFrame in R and Python pandas. DataFrames are very similar to RDBMS tables, where a schema is set.
JSON files and Spark DataFrames
In order to import JSON-compliant files, we should first create a SQL context, creating a SQLContext
object from the local Spark Context:
In:from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
Now, let's see the content of a small JSON file (it's provided in the Vagrant virtual machine). It's a JSON representation of a table with six rows and three columns, where some attributes are missing (such as the gender
attribute for the user with user_id=0
):
In:!cat /home/vagrant/datasets/users.json...