DataFrame APIs and the SQL API
A DataFrame can be created in several ways; some of them are as follows:
- Execute SQL queries, load external data such as Parquet, JSON, CSV, Text, Hive, JDBC, and so on
- Convert RDDs to DataFrames
- Load a CSV file
We will take a look at statesPopulation.csv
here, which we will then load as a DataFrame.
The CSV has the following format of the population of US states from the years 2010 to 2016:
State | Year | Population |
Alabama | 2010 | 47,85,492 |
Alaska | 2010 | 714,031 |
Arizona | 2010 | 64,08,312 |
Arkansas | 2010 | 2,921,995 |
California | 2010 | 37,332,685 |
Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection:
scala> val statesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv") statesDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field]
Once we load the DataFrame, it can be examined for the schema:
scala> statesDF.printSchema root |-- State: string (nullable ...