Working with batch processing
We will now begin learning the essential PySpark code to read, transform, and write data. Any ETL script begins with reading from a source, transforming the data, and then writing data to a sink. Let's begin with reading data from DBFS (Databricks File System) for a batch process.
Reading data
Run the following command in a new cell in a notebook:
%fs ls dbfs:/databricks-datasets/
This will display a list of sample datasets mounted by the Databricks team for learning and testing purposes. The dataset that we will be working with resides in the DBFS path, dbfs:/databricks-datasets/asa/airlines/
. This dataset describes different airlines' on-time performance and consists of about 120 million records!
- Run the
%fs ls dbfs:/databricks-datasets/asa/airlines/
command, and we can see that the path contains 22 CSV files. Their corresponding sizes are also mentioned in bytes. We will now read all the CSV files at once by specifying...