Introduction to Data Ingestion
Welcome to the fantastic world of data! Are you ready to embark on a thrilling journey into data ingestion? If so, this is the perfect book to start! Ingesting data is the first step into the big data world.
Data ingestion is a process that involves gathering and importing data and also storing it properly so that the subsequent extract, transform, and load (ETL) pipeline can utilize the data. To make it happen, we must be cautious about the tools we will use and how to configure them properly.
In our book journey, we will use Python and PySpark to retrieve data from different data sources and learn how to store them properly. To orchestrate all this, the basic concepts of Airflow will be implemented, along with efficient monitoring to guarantee that our pipelines are covered.
This chapter will introduce some basic concepts about data ingestion and how to set up your environment to start the tasks.
In this chapter, you will build and learn the following recipes:
- Setting up Python and the environment
- Installing PySpark
- Configuring Docker for MongoDB
- Configuring Docker for Airflow
- Logging libraries
- Creating schemas
- Applying data governance in ingestion
- Implementing data replication