A DataFrame can be created in several ways; some of them are as follows:
- Execute SQL queries, load external data such as Parquet, JSON, CSV, Text, Hive, JDBC, and so on
- Convert RDDs to DataFrames
- Load a CSV file
We will take a look at statesPopulation.csv here, which we will then load as a DataFrame.
The CSV has the following format of the population of US states from the years 2010 to 2016:
State | Year | Population |
Alabama
|
2010
|
47,85,492
|
Alaska
|
2010
|
714,031
|
Arizona
|
2010
|
64,08,312
|
Arkansas
|
2010
|
2,921,995
|
California
|
2010
|
37,332,685
|
Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection:
scala> val statesDF = spark.read.option("header",
"true").option("inferschema", "true").option("sep"...