Introduction to Dataproc
Dataproc is a Google-managed service for Hadoop environments. It manages the underlying virtual machines (VMs), operating systems, and Hadoop software installations. Using Dataproc, Hadoop developers can focus on developing jobs and submitting them to Dataproc.
From a data engineering perspective, understanding Dataproc is equal to understanding Hadoop and the data lake concept. If you are not familiar with Hadoop, let’s learn about it in the next section.
A brief history of the data lake and Hadoop ecosystem
The popularity of the data lake rose in the 2010s. Companies started to talk about this concept a lot more, compared to the data warehouse, which is similar but different in principle. The concept of storing data as files in a centralized system makes a lot of sense in the modern era, compared to the old days when companies stored and processed data typically for regular reporting. In the modern era, people use data for exploration from...