Building Batch Pipelines Using Spark and Scala
The goal of this chapter is to combine all the things we’ve learned so far to build a batch pipeline. The ability to handle large volumes of data efficiently and reliably in batch mode is an essential skill for data engineers. A batch pipeline is simply a process that ingests, transforms, and stores a set of data at a scheduled time or in an ad hoc fashion. Apache Spark, with its powerful capabilities for distributed data processing, and Scala, as a versatile and expressive programming language, provide an ideal foundation for constructing robust batch pipelines. This chapter will equip you with the knowledge and tools to harness the full potential of batch processing in the big data landscape.
In this chapter, we’re going to cover the following main topics:
- Understanding our use case and data
- Understanding the medallion architecture
- Ingesting data in batch
- Transforming data and checking quality ...