Loading datasets
Spark SQL can read data from external storage systems such as files, Hive tables, and JDBC databases through the DataFrameReader
interface.
The format of the API call is spark.read.inputtype
:
- Parquet
- CSV
- Hive table
- JDBC
- ORC
- Text
- JSON
Let's look at a couple of simple examples of reading CSV files into DataFrames:
scala> val statesPopulationDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesPopulation.csv") statesPopulationDF: org.apache.spark.sql.DataFrame = [State: string, Year: int ... 1 more field] scala> val statesTaxRatesDF = spark.read.option("header", "true").option("inferschema", "true").option("sep", ",").csv("statesTaxRates.csv") statesTaxRatesDF: org.apache.spark.sql.DataFrame = [State: string, TaxRate: double]