DataFrame scenario – on-time flight performance
To showcase the types of queries you can do with DataFrames, let's look at the use case of on-time flight performance. We will analyze the Airline On-Time Performance and Causes of Flight Delays: On-Time Data (http://bit.ly/2ccJPPM), and join this with the airports dataset, obtained from the Open Flights Airport, airline, and route data (http://bit.ly/2ccK5hw), to better understand the variables associated with flight delays.
Tip
For this section, we will be using Databricks Community Edition (a free offering of the Databricks product), which you can get at https://databricks.com/try-databricks. We will be using visualizations and pre-loaded datasets within Databricks to make it easier for you to focus on writing the code and analyzing the results.
If you would prefer to run this on your own environment, you can find the datasets available in our GitHub repository for this book at https://github.com/drabastomek/learningPySpark.