Differentiating stream processing from batch processing
While the processing tools don't change whether you are processing streams or batches, there are two things you should keep in mind while processing streams – unbounded and time.
Data can be bounded or unbounded. Bounded data has an end, whereas unbounded data is constantly created and is possibly infinite. Bounded data is last year's sales of widgets. Unbounded data is a traffic sensor counting cars and recording their speeds on the highway.
Why is this important in building data pipelines? Because with bounded data, you will know everything about the data. You can see it all at once. You can query it, put it in a staging environment, and then run Great Expectations on it to get a sense of the ranges, values, or other metrics to use in validation as you process your data.
With unbounded data, it is streaming in and you don't know what the next piece of data will look like. This doesn't mean...