So far, we've seen how to load text data from the local filesystem and the HDFS. Text files can contain either unstructured data (like a text document) or structured data (like a CSV file). As for semi-structured data, just like files containing JSON objects, Spark has special routines that are able to transform a file into a DataFrame, similar to the DataFrame in R and the Python package pandas. DataFrames are very similar to RDBMS tables, where a schema is set.
Data preprocessing in Spark
CSV files and Spark DataFrames
We start by showing you how to read CSV files and transform them into Spark DataFrames. Just follow the steps in the following example:
- In order to import CSV-compliant files, we need to first create...