Introduction to Apache Hadoop
Apache Hadoop is an open source, Java-based project from the Apache Software Foundation. The core purpose of this software has been to provide a platform that is scalable, extensible, and fault tolerant for the distributed storage and processing of big data. Please refer to Chapter 2, Machine learning and Large-scale Datasets for more information on what data qualifies as big data. The following image is the standard logo of Hadoop:
At the heart of it, it leverages clusters of nodes that can be commodity servers and facilitates parallel processing. The name Hadoop was given by its creator Doug Cutting, naming it after his child's yellow stuffed toy elephant. Till date, Yahoo! has been the largest contributor and an extensive user of Hadoop. More details of Hadoop, its architecture, and download links are accessible at http://hadoop.apache.org/.
Hadoop is an industry standard platform for big data, and it comes with extensive support for all the popular...