The creation of a DataFrame can be done in several ways:
- By executing SQL queries
- Loading external data such as Parquet, JSON, CSV, text, Hive, JDBC, and so on
- Converting RDDs to data frames
A DataFrame can be created by loading a CSV file. We will look at a CSV statesPopulation.csv, which is being loaded as a DataFrame.
The CSV has the following format of US states populations from years 2010 to 2016.
State | Year | Population |
Alabama | 2010 | 4785492 |
Alaska | 2010 | 714031 |
Arizona | 2010 | 6408312 |
Arkansas | 2010 | 2921995 |
California | 2010 | 37332685 |
Since this CSV has a header, we can use it to quickly load into a DataFrame with an implicit schema detection.
scala> val statesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation...