Orchestrating data workloads
Now that we have all the pre-setup work done, let’s jump right into organizing and running our workloads in Databricks. We will cover a variety of topics, the first of which is managing incremental new additions via files.
Making life easier with Autoloader
Spark Streaming isn’t something new and many deployments are using it in their data platforms. Spark Streaming has rough edges that Autoloader resolves. Autoloader is an efficient way to have Databricks detect new files and process them. Autoloader works with the Spark structured streaming context, so there isn’t much difference in usage once it’s set up.
Reading
To create a streaming DataFrame using Autoloader, you can simply use the cloud file format, along with the needed options. In the following case, we are setting the schema, delimiter, and format for a CSV load:
spark.readStream.format("cloudFiles") \ .option("cloudFiles...