Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Scaling Big Data with Hadoop and Solr, Second Edition

You're reading from   Scaling Big Data with Hadoop and Solr, Second Edition Understand, design, build, and optimize your big data search engine with Hadoop and Apache Solr

Arrow left icon
Product type Paperback
Published in Apr 2015
Publisher
ISBN-13 9781783553396
Length 166 pages
Edition 1st Edition
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Hrishikesh Vijay Karambelkar Hrishikesh Vijay Karambelkar
Author Profile Icon Hrishikesh Vijay Karambelkar
Hrishikesh Vijay Karambelkar
Arrow right icon
View More author details
Toc

Chapter 1. Processing Big Data Using Hadoop and MapReduce

Continuous evolution in computer sciences has enabled the world to work in a faster, more reliable, and more efficient manner. Many businesses have been transformed to utilize electronic media. They use information technologies to innovate the communication with their customers, partners, and suppliers. It has also given birth to new industries such as social media and e-commerce. This rapid increase in the amount of data has led to an "information explosion." To handle the problems of managing huge information, the computational capabilities have evolved too, with a focus on optimizing the hardware cost, giving rise to distributed systems. In today's world, this problem has multiplied; information is generated from disparate sources such as social media, sensors/embedded systems, and machine logs, in either a structured or an unstructured form. Processing of these large and complex data using traditional systems and methods is a challenging task. Big Data is an umbrella term that encompasses the management and processing of such data.

Big data is usually associated with high-volume and heavily growing data with unpredictable content. The IT advisory firm Gartner defines big data using 3Vs (high volume of data, high velocity of processing speed, and high variety of information). IBM has added a fourth V (high veracity) to this definition to make sure that the data is accurate and helps you make your business decisions. While the potential benefits of big data are real and significant, there remain many challenges. So, organizations that deal with such a high volumes of data, must work on the following areas:

  • Data capture/acquisition from various sources
  • Data massaging or curating
  • Organization and storage
  • Big data processing such as search, analysis, and querying
  • Information sharing or consumption
  • Information security and privacy

Big data poses a lot of challenges to the technologies in use today. Many organizations have started investing in these big data areas. As per Gartner, through 2015, 85% of the Fortune 500 organizations will be unable to exploit big data for a competitive advantage.

To handle the problem of storing and processing complex and large data, many software frameworks have been created to work on the big data problem. Among them, Apache Hadoop is one of the most widely used open source software frameworks for the storage and processing of big data. In this chapter, we are going to understand Apache Hadoop. We will be covering the following topics:

  • Apache Hadoop's ecosystem
  • Configuring Apache Hadoop
  • Running Apache Hadoop
  • Setting up a Hadoop cluster
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image