Introducing SQL, Datasources, DataFrame, and Dataset APIs
Let's understand four components in Spark SQL—SQL, the Data Sources API, the DataFrame API, and the Dataset API.
Spark SQL can write and read data to and from Hive tables using the SQL language. SQL can be used within Java, Scala, Python, R languages, over JDBC/ODBC, or using the command-line option. When SQL is used in programming languages, the results will be converted as DataFrames.
Advantages of SQL are:
- Can work with Hive tables easily
- Can connect BI tools to a distributed SQL engine using Thrift Server and submit SQL or Hive QL queries using JDBC or ODBC interfaces
The Data Sources API provides a single interface for reading and writing data using Spark SQL. In addition to the in-built sources that come prepackaged with the Apache Spark distribution, the Data Sources API provides integration for external developers to add custom data sources. All external data sources and other packages can be viewed at http:/...