Understanding distributed file systems
The GFS paper outlined to the technology world how to successfully store and access data on a massive (for the time) scale. At the time, hundreds of terabytes were spread across thousands of commodity servers. Not only could they store and access vast amounts of data but they also provided the ability to store non-traditional types of data.
However, the rise of the internet brought with it video files, images, audio, email, and HTML. Data warehouses did not have the capability to store and use these types of data, so the new distributed file system was a perfect solution. This solution very quickly took hold in the industry through Apache Hadoop, in 2005, as the first widely adopted distributed file system, called Hadoop Distributed File System (HDFS), and processing framework (MapReduce). The newly found scalability in storage and compute at commodity prices brought on the rise of data lakes.
Now, let’s dive into data lakes and explore...