Cleaning data using Airflow
Now that you can clean your data in Python, you can create functions to perform different tasks. By combining the functions, you can create a data pipeline in Airflow. The following example will clean data, and then filter it and write it out to disk.
Starting with the same Airflow code you have used in the previous examples, set up the imports and the default arguments, as shown:
import datetime as dt from datetime import timedelta from airflow import DAG from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator import pandas as pd default_args = { 'owner': 'paulcrickard', 'start_date': dt.datetime(2020, 4, 13), 'retries': 1, 'retry_delay': dt.timedelta(minutes=5), }
Now you can write the functions that will perform the cleaning tasks. First...