Preface
We are in an age where data is the primary driver in decision-making. With storage costs declining, network speeds increasing, and everything around us becoming digital, we do not hesitate a bit to download, store, or share data with others around us. About 20 years back, a camera was a device used to capture pictures on film. Every photograph had to be captured almost perfectly. The storage of film negatives was done carefully lest they get damaged. There was a higher cost associated with taking prints of these photographs. The time taken between a picture click and to view it was almost a day. This meant that less data was being captured as these factors presented a cliff for people from recording each and every moment of their life, unless it was very significant.
However, with cameras becoming digital, this has changed. We do not hesitate to click a photograph of almost anything anytime. We do not worry about storage as our externals disks of a terabyte capacity always provide a reliable backup. We seldom take our cameras anywhere as we have mobile devices that we can use to take photographs. We have applications such as Instagram that can be used to add effects to our pictures and share them. We gather opinions and information about the pictures, and we click and base some of our decisions on them. We capture almost every moment, of great significance or not, and push it into our memory books. The era of big data has arrived!
This era of Big Data has similar changes in businesses as well. Almost everything in a business is logged. Every action taken by a user on the page of an e-commerce page is recorded to improve quality of service and every item bought by the user are recorded to cross-sell or up-sell other items. Businesses want to understand the DNA of their customers and try to infer it by pinching out every possible data they can get about these customers. Businesses are not worried about the format of the data. They are ready to accept speech, images, natural language text, or structured data. These data points are used to drive business decisions and personalize experiences for the user. The more data, the higher the degree of personalization and better the experience for the user.
We saw that we are ready, in some aspects, to take on this Big Data challenge. However, what about the tools used to analyze this data? Can they handle the volume, velocity, and variety of the incoming data? Theoretically, all this data can reside on a single machine, but what is the cost of such a machine? Will it be able to cater to the variations in loads? We know that supercomputers are available, but there are only a handful of them in the world. Supercomputers don't scale. The alternative is to build a team of machines, a cluster, or individual computing units that work in tandem to achieve a task. A team of machines are interconnected via a very fast network and provide better scaling and elasticity, but that is not enough. These clusters have to be programmed. A greater number of machines, just like a team of human beings, require more coordination and synchronization. The higher the number of machines, the greater the possibility of failures in the cluster. How do we handle synchronization and fault tolerance in a simple way easing the burden on the programmer? The answer is systems such as Hadoop.
Hadoop is synonymous with Big Data processing. Its simple programming model, "code once and deploy at any scale" paradigm, and an ever-growing ecosystem make Hadoop an inclusive platform for programmers with different levels of expertise and breadth of knowledge. Today, it is the number-one sought after job skill in the data sciences space. To handle and analyze Big Data, Hadoop has become the go-to tool. Hadoop 2.0 is spreading its wings to cover a variety of application paradigms and solve a wider range of data problems. It is rapidly becoming a general-purpose cluster platform for all data processing needs, and will soon become a mandatory skill for every engineer across verticals.
This book covers optimizations and advanced features of MapReduce, Pig, and Hive. It also covers Hadoop 2.0 and illustrates how it can be used to extend the capabilities of Hadoop.
Hadoop, in its 2.0 release, has evolved to become a general-purpose cluster-computing platform. The book will explain the platform-level changes that enable this. Industry guidelines to optimize MapReduce jobs and higher-level abstractions such as Pig and Hive in Hadoop 2.0 are covered. Some advanced job patterns and their applications are also discussed. These topics will empower the Hadoop user to optimize existing jobs and migrate them to Hadoop 2.0. Subsequently, it will dive deeper into Hadoop 2.0-specific features such as YARN (Yet Another Resource Negotiator) and HDFS Federation, along with examples. Replacing HDFS with other filesystems is another topic that will be covered in the latter half of the book. Understanding these topics will enable Hadoop users to extend Hadoop to other application paradigms and data stores, making efficient use of the available cluster resources.
This book is a guide focusing on advanced concepts and features in Hadoop. Foundations of every concept are explained with code fragments or schematic illustrations. The data processing flow dictates the order of the concepts in each chapter.