Building a Data Pipeline in PyCharm
The term data pipeline generally denotes a step-wise procedure that entails collecting, processing, and analyzing data. This term is widely used in the industry to express the need for a reliable workflow that takes raw data and converts it into actionable insights. Some data pipelines work at massive scales, such as a marketing technology (MarTech) company ingesting millions of data points from Kafka streams, storing them in large data stores such as Hadoop or Clickhouse, and then cleansing, enriching, and visualizing that data. Other times, the data is smaller but far more impactful, such as the project we’ll be working on in this chapter.
In this chapter, we will learn about the following topics:
- How to work with and maintain datasets
- How to clean and preprocess data
- How to visualize data
- How to utilize machine learning (ML)
Throughout this chapter, you will be able to apply what you have learned about the...