Apache Hadoop is an open source software platform that processes very large datasets in a distributed environment with respect to storage and computational power, and is mainly built on low cost commodity hardware. It came into existence thanks to a Google File System paper that was published in October 2003. Another research paper from Google MapReduce looked at simplified data processing in large clusters.
Apache Nutch is a highly-scalable and extensible open source web crawler project which implemented the MapReduce facility and the distributed file system based on Google's research paper. These facilities were later announced as a sub-project called Apache Hadoop.
Apache Hadoop is designed to easily scale up from a few to thousands of servers. It helps you to process locally stored data in an overall parallel processing setup. One of the benefits of Hadoop...