Advanced concepts of Spark Streaming
Let's go through some of the important advanced concepts of Spark Streaming.
Using DataFrames
We learned Spark SQL and DataFrames in Chapter 4, Big Data Analytics with Spark SQL, DataFrames, and Datasets. There are many use cases where you want to convert DStream and DataFrame to do interactive analytics. RDDs generated by DStreams can be converted to DataFrames and queried with SQL internally within the program or from external SQL clients as well. Refer to the sql_network_wordcount.py
program in /usr/lib/spark/examples/lib/streaming
for implementing SQL in a Spark Streaming application. You can also start JDBC server within the application with the following code:
HiveThriftServer2.startWithContext(hiveContext)
Temporary tables can now be accessed from any SQL client such as beeline to query the data.
MLlib operations
It is easy to implement machine learning algorithms in Spark Streaming applications. The following Scala code trains a KMeans clustering model...