Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Hadoop

You're reading from   Mastering Hadoop Go beyond the basics and master the next generation of Hadoop data processing platforms

Arrow left icon
Product type Paperback
Published in Dec 2014
Publisher
ISBN-13 9781783983643
Length 374 pages
Edition 1st Edition
Tools
Arrow right icon
Author (1):
Arrow left icon
Sandeep Karanth Sandeep Karanth
Author Profile Icon Sandeep Karanth
Sandeep Karanth
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Hadoop 2.X FREE CHAPTER 2. Advanced MapReduce 3. Advanced Pig 4. Advanced Hive 5. Serialization and Hadoop I/O 6. YARN – Bringing Other Paradigms to Hadoop 7. Storm on YARN – Low Latency Processing in Hadoop 8. Hadoop on the Cloud 9. HDFS Replacements 10. HDFS Federation 11. Hadoop Security 12. Analytics Using Hadoop A. Hadoop for Microsoft Windows Index

The inception of Hadoop

The birth and evolution of the Internet led to World Wide Web (WWW), a huge set of documents written in the markup language, HTML, and linked with one another via hyperlinks. Clients, known as browsers, became the user's window to WWW. Ease of creation, editing, and publishing of these web documents meant an explosion of document volume on the Web.

In the latter half of the 90s, the huge volume of web documents led to discoverability problems. Users found it hard to discover and locate the right document for their information needs, leading to a gold rush among web companies in the space of web discovery and search. Search engines and directory services for the Web, such as Lycos, Altavista, Yahoo!, and Ask Jeeves, became commonplace.

These search engines started ingesting and summarizing the Web. The process of traversing the Web and ingesting the documents is known as crawling. Smart crawlers, those that can download documents quickly, avoid link cycles, and detect document updates, have been developed.

In the early part of this century, Google emerged as the torchbearer of the search technology. Its success was attributed not only to the introduction of robust, spam-defiant relevance technology, but also its minimalistic approach, speed, and quick data processing. It achieved the former goals by developing novel concepts such as PageRank, and the latter goals by innovative tweaking and applying existing techniques, such as MapReduce, for large-scale parallel and distributed data processing.

Note

PageRank is an algorithm named after Google's founder Larry Page. It is one of the algorithms used to rank web search results for a user. Search engines use keyword matching on websites to determine relevance corresponding to a search query. This prompts spammers to include many keywords, relevant or irrelevant, on websites to trick these search engines and appear in almost all queries. For example, a car dealer can include keywords related to shopping or movies and appear in a wider range of search queries. The user experience suffers because of irrelevant results.

PageRank thwarted this kind of fraud by analyzing the quality and quantity of links to a particular web page. The intention was that important pages have more inbound links.

In Circa 2004, Google published and disclosed its MapReduce technique and implementation to the world. It introduced Google File System (GFS) that accompanies the MapReduce engine. Since then, the MapReduce paradigm has become the most popular technique to process massive datasets in parallel and distributed settings across many other companies. Hadoop is an open source implementation of the MapReduce framework, and Hadoop and its associated filesystem, HDFS, are inspired by Google's MapReduce and GFS, respectively.

Since its inception, Hadoop and other MapReduce-based systems run a diverse set of workloads from different verticals, web search being one of them. As an example, Hadoop is extensively used in http://www.last.fm/ to generate charts and track usage statistics. It is used for log processing in the cloud provider, Rackspace. Yahoo!, one of the biggest proponents of Hadoop, uses Hadoop clusters not only to build web indexes for search, but also to run sophisticated advertisement placement and content optimization algorithms.

You have been reading a chapter from
Mastering Hadoop
Published in: Dec 2014
Publisher:
ISBN-13: 9781783983643
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime