Chapter 1: Introduction to Data Processing with Apache Beam
Data. Big data. Real-time data. Data streams. Many buzzwords to describe many things, and yet they have many common properties. Mind-blowing applications can be developed from the successful application of (theoretically) simple logic – take data and produce knowledge. However, a simple-sounding task can turn out to be difficult when the amount of data needed to produce knowledge is huge (and still growing). Given the vast volumes of data produced by humanity every day, which tools should we choose to turn our simple logic into scalable solutions? That is, solutions that protect our investment in creating the data extraction logic, even in the presence of new requirements arising or changing on a daily basis, and new data processing technologies being created? This book focuses on why Apache Beam might be a good solution to these challenges, and it will guide you through the Beam learning process.
In this chapter, we will cover the following topics:
- Why Apache Beam?
- Writing your first pipeline
- Running a pipeline against streaming data
- Exploring the key properties of Unbounded data
- Measuring the event time progress inside data streams
- Assigning data to windows
- Unifying batch and streaming data processing