Arguably, the first time big data was being talked about in a context we know now was in July, 1997. Michael Cox and David Ellsworth, scientists/researchers from NASA, described the problem they faced when processing humongous amounts of data with the traditional computers of that time. In the early 2000s, Lexis Nexis designed a proprietary system, which later went on to become the High-Performance Computing Cluster (HPCC), to address the growing need of processing data on a cluster. It was later open sourced in 2011.
It was an era of dot coms and Google was challenging the limits of the internet by crawling and indexing the entire internet. With the rate at which the internet was expanding, Google knew it would be difficult if not impossible to scale vertically to process data of that size. Distributed computing, though still in its infancy, caught Google's attention. They not only developed a distributed fault tolerant filesystem, Google File System (GFS), but also a distributed processing engine/system called MapReduce. It was then in 2003-2004 that Google released the white paper titled The Google File System by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, and shortly thereafter they released another white paper titled MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat.
Doug Cutting, an open source contributor, around the same time was looking for ways to make an open source search engine and like Google was failing to process the data at the internet scale. By 1999, Doug Cutting had developed Lucene, a Java library with the capability of text/web searching among other things. Nutch, an open source web crawler and data indexer built by Doug Cutting along with Mike Cafarella, was not scaling well. As luck would have it, Google's white paper caught Doug Cutting's attention. He began working on similar concepts calling them Nutch Distributed File System (NDFS) and Nutch MapReduce. By 2005, he was able to scale Nutch, which could index from 100 million pages to multi-billion pages using the distributed platform.
However, it wasn't just Doug Cutting but Yahoo! too who became interested in the development of the MapReduce computing framework to serve its processing capabilities. It is here that Doug Cutting refactored the distributed computing framework of Nutch and named it after his kid's elephant toy, Hadoop. By 2008, Yahoo! was using Hadoop in its production cluster to build its search index and metadata called web map. Despite being a direct competitor to Google, one distinct strategic difference that Yahoo! took while co-developing Hadoop was the nature in which the project was to be developed: they open sourced it. And the rest, as we know is history!
In this chapter, we will cover the following topics:
- What is big data?
- Why Apache Spark?
- RDD the first citizen of Spark
- Spark ecosystem -- Spark SQL, Spark Streaming, Milb, Graphx
- What's new in Spark 2.X?