Batch data pipelines
Now that we’ve used our Jupyter notebook to explore our data and figure out what kinds of transformations we want to perform on our dataset, let’s imagine that we want to turn this into a production workload that can run automatically on very large files, without needing any further human effort. As I mentioned earlier, this is essential in any company that implements large-scale data analytics and AI/ML workloads. It’s simply not feasible to get somebody to manually perform those transformations every time, and for very large data volumes, the transformations could not be performed on a notebook instance. For example, imagine we get thousands of new postings every day, and we want to automatically prepare that data for an ML model. We can do this by creating an automated pipeline to perform data transformations every night (or however often we wish).
Batch data pipeline concepts and tools
Before we start diving in and building our batch...