Apache Spark Deep Dive
One of the most fundamental questions for an architect is how they should store their data and what methodology they should use. For example, should they use a relational database, or should they use object storage? This chapter attempts to explain which storage pattern is best for your scenario. Then, we will go through how to set up Delta Lake, a hybrid approach to data storage in an object store. In most cases, we will stick to the Python API, but in some cases, we will have to use SQL. Lastly, we will cover the most important Apache Spark theory you need to know to build a data platform effectively.
In this chapter, we’re going to cover the following main topics:
- Understand how Spark manages its cluster
- How Spark processes data
- How cloud storage varies and what options are available
- How to create and manage Delta Lake tables and databases