What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Apache Hive Essentials

Chapter 1. Overview of Big Data and Hive

This chapter is an overview of big data and Hive, especially in the Hadoop ecosystem. It briefly introduces the evolution of big data so that readers know where they are in the journey of big data and find their preferred areas in future learning. This chapter also covers how Hive has become one of the leading tools in big data warehousing and why Hive is still competitive.

In this chapter, we will cover the following topics:

A short history from database and data warehouse to big data
Introducing big data
Relational and NoSQL databases versus Hadoop
Batch, real-time, and stream processing
Hadoop ecosystem overview
Hive overview

A short history

In the 1960s, when computers became a more cost-effective option for businesses, people started to use databases to manage data. Later on, in the 1970s, relational databases became more popular to business needs since they connected physical data to the logical business easily and closely. In the next decade, around the 1980s, Structured Query Language (SQL) became the standard query language for databases. The effectiveness and simplicity of SQL motivated lots of people to use databases and brought databases closer to a wide range of users and developers. Soon, it was observed that people used databases for data application and management and this continued for a long period of time.

Once plenty of data was collected, people started to think about how to deal with the old data. Then, the term data warehousing came up in the 1990s. From that time onwards, people started to discuss how to evaluate the current performance by reviewing the historical data. Various data models and tools were created at that time for helping enterprises to effectively manage, transform, and analyze the historical data. Traditional relational databases also evolved to provide more advanced aggregation and analyzed functions as well as optimizations for data warehousing. The leading query language was still SQL, but it was more intuitive and powerful as compared to the previous versions. The data was still well structured and the model was normalized. As we entered the 2000s, the Internet gradually became the topmost industry for the creation of the majority of data in terms of variety and volume. Newer technologies, such as social media analytics, web mining, and data visualizations, helped lots of businesses and companies deal with massive amounts of data for a better understanding of their customers, products, competition, as well as markets. The data volume grew and the data format changed faster than ever before, which forced people to search for new solutions, especially from the academic and open source areas. As a result, big data became a hot topic and a challenging field for many researchers and companies.

However, in every challenge there lies great opportunity. Hadoop was one of the open source projects earning wide attention due to its open source license and active communities. This was one of the few times that an open source project led to the changes in technology trends before any commercial software products. Soon after, the NoSQL database and real-time and stream computing, as followers, quickly became important components for big data ecosystems. Armed with these big data technologies, companies were able to review the past, evaluate the current, and also predict the future. Around the 2010s, time to market became the key factor for making business competitive and successful. When it comes to big data analysis, people could not wait to see the reports or results. A short delay could make a great difference when making important business decisions. Decision makers wanted to see the reports or results immediately within a few hours, minutes, or even possibly seconds in a few cases. Real-time analytical tools, such as Impala (http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html), Presto (http://prestodb.io/), Storm (https://storm.apache.org/), and so on, make this possible in different ways.

Introducing big data

Big data is not simply a big volume of data. Here, the word "Big" refers to the big scope of data. A well-known saying in this domain is to describe big data with the help of three words starting with the letter V. They are volume, velocity, and variety. But the analytical and data science world has seen data varying in other dimensions in addition to the fundament 3 Vs of big data such as veracity, variability, volatility, visualization, and value. The different Vs mentioned so far are explained as follows:

Volume: This refers to the amount of data generated in seconds. 90 percent of the world's data today has been created in the last two years. Since that time, the data in the world doubles every two years. Such big volumes of data is mainly generated by machines, networks, social media, and sensors, including structured, semi-structured, and unstructured data.
Velocity: This refers to the speed in which the data is generated, stored, analyzed, and moved around. With the availability of Internet-connected devices, wireless or wired, machines and sensors can pass on their data immediately as soon as it is created. This leads to real-time streaming and helps businesses to make valuable and fast decisions.
Variety: This refers to the different data formats. Data used to be stored as text, dat, and csv from sources such as filesystems, spreadsheets, and databases. This type of data that resides in a fixed field within a record or file is called structured data. Nowadays, data is not always in the traditional format. The newer semi-structured or unstructured forms of data can be generated using various methods such as e-mails, photos, audio, video, PDFs, SMSes, or even something we have no idea about. These varieties of data formats create problems for storing and analyzing data. This is one of the major challenges we need to overcome in the big data domain.
Veracity: This refers to the quality of data, such as trustworthiness, biases, noise, and abnormality in data. Corrupt data is quite normal. It could originate due to a number of reasons, such as typos, missing or uncommon abbreviation, data reprocessing, system failures, and so on. However, ignoring this malicious data could lead to inaccurate data analysis and eventually a wrong decision. Therefore, making sure the data is correct in terms of data audition and correction is very important for big data analysis.
Variability: This refers to the changing of data. It means that the same data could have different meanings in different contexts. This is particularly important when carrying out sentiment analysis. The analysis algorithms are able to understand the context and discover the exact meaning and values of data in that context.
Volatility: This refers to how long the data is valid and stored. This is particularly important for real-time analysis. It requires a target scope of data to be determined so that analysts can focus on particular questions and gain good performance out of the analysis.
Visualization: This refers to the way of making data well understood. Visualization does not mean ordinary graphs or pie charts. It makes vast amounts of data comprehensible in a multidimensional view that is easy to understand. Visualization is an innovative way to show changes in data. It requires lots of interaction, conversations, and joint efforts between big data analysts and business domain experts to make the visualization meaningful.
Value: This refers to the knowledge gained from data analysis on big data. The value of big data is how organizations turn themselves into big data-driven companies and use the insight from big data analysis for their decision making.

In summary, big data is not just about lots of data, it is a practice to discover new insight from existing data and guide the analysis for future data. A big-data-driven business will be more agile and competitive to overcome challenges and win competitions.

Batch, real-time, and stream processing

Batch processing is used to process data in batches and it reads data input, processes it, and writes it to the output. Apache Hadoop is the most well-known and popular open source implementation of batch processing and a distributed system using the MapReduce paradigm. The data is stored in a shared and distributed filesystem called Hadoop Distributed File System (HDFS), divided into splits, which are the logical data divisions for MapReduce processing. To process these splits using the MapReduce paradigm, the map task reads the splits and passes all of its key/value pairs to a map function and writes the results to intermediate files. After the map phase is completed, the reducer reads intermediate files and passes it to the reduce function. Finally, the reduce task writes results to the final output files. The advantages of the MapReduce model include making distributed programming easier, near-linear speed up, good scalability, as well as fault tolerance. The disadvantage of this batch processing model is being unable to execute recursive or iterative jobs. In addition, the obvious batch behavior is that all inputs must be ready by map before the reduce job starts, which makes MapReduce unsuitable for online and stream processing use cases.

Real-time processing is to process data and get the result almost immediately. This concept in the area of real-time ad hoc queries over big data was first implemented in Dremel by Google. It uses a novel columnar storage format for nested structures with fast index and scalable aggregation algorithms for computing query results in parallel instead of batch sequences. These two techniques are the major characters for real-time processing and are used by similar implementations, such as Cloudera Impala, Facebook Presto, Apache Drill, and Hive on Tez powered by Stinger whose effort is to make a 100x performance improvement over Apache Hive. On the other hand, in-memory computing no doubt offers other solutions for real-time processing. In-memory computing offers very high bandwidth, which is more than 10 gigabytes/second, compared to hard disks' 200 megabytes/second. Also, the latency is comparatively lower, nanoseconds versus milliseconds, compared to hard disks. With the price of RAM going lower and lower each day, in-memory computing is more affordable as real-time solutions, such as Apache Spark, which is a popular open source implementation of in-memory computing. Spark can be easily integrated with Hadoop and the resilient distributed dataset can be generated from data sources such as HDFS and HBase for efficient caching.

Stream processing is to continuously process and act on the live stream data to get a result. In stream processing, there are two popular frameworks: Storm (https://storm.apache.org/) from Twitter and S4 (http://incubator.apache.org/s4/) from Yahoo!. Both the frameworks run on the Java Virtual Machine (JVM) and both process keyed streams. In terms of the programming model, S4 is a program defined as a graph of Processing Elements (PE), small subprograms, and S4 instantiates a PE per key. In short, Storm gives you the basic tools to build a framework, while S4 gives you a well-defined framework.

Hive overview

Hive is a standard for SQL queries over petabytes of data in Hadoop. It provides SQL-like access for data in HDFS making Hadoop to be used like a warehouse structure. The Hive Query Language (HQL) has similar semantics and functions as standard SQL in the relational database so that experienced database analysts can easily get their hands on it. Hive's query language can run on different computing frameworks, such as MapReduce, Tez, and Spark for better performance.

Hive's data model provides a high-level, table-like structure on top of HDFS. It supports three data structures: tables, partitions, and buckets, where tables correspond to HDFS directories and can be divided into partitions, which in turn can be divided into buckets. Hive supports a majority of primitive data formats such as TIMESTAMP, STRING, FLOAT, BOOLEAN, DECIMAL, DOUBLE, INT, SMALLINT, BIGINT, and complex data types, such as UNION, STRUCT, MAP, and ARRAY.

The following diagram is the architecture seen inside the view of Hive in the Hadoop ecosystem. The Hive metadata store (or called metastore) can use either embedded, local, or remote databases. Hive servers are built on Apache Thrift Server technology. Since Hive has released 0.11, Hive Server 2 is available to handle multiple concurrent clients, which support Kerberos, LDAP, and custom pluggable authentication, providing better options for JDBC and ODBC clients, especially for metadata access.

Hive architecture

Here are some highlights of Hive that we can keep in mind moving forward:

Hive provides a simpler query model with less coding than MapReduce
HQL and SQL have similar syntax
Hive provides lots of functions that lead to easier analytics usage
The response time is typically much faster than other types of queries on the same type of huge datasets
Hive supports running on different computing frameworks
Hive supports ad hoc querying data on HDFS
Hive supports user-defined functions, scripts, and a customized I/O format to extend its functionality
Hive is scalable and extensible to various types of data and bigger datasets
Matured JDBC and ODBC drivers allow many applications to pull Hive data for seamless reporting
Hive allows users to read data in arbitrary formats, using SerDes and Input/Output formats
Hive has a well-defined architecture for metadata management, authentication, and query optimizations
There is a big community of practitioners and developers working on and using Hive

Description

If you are a data analyst, developer, or simply someone who wants to use Hive to explore and analyze data in Hadoop, this is the book for you. Whether you are new to big data or an expert, with this book, you will be able to master both the basic and the advanced features of Hive. Since Hive is an SQL-like language, some previous experience with the SQL language and databases is useful to have a better understanding of this book.

What do you get with eBook?

Instant access to your Digital eBook purchase

Download this book in EPUB and PDF formats

Access this title in our online reader with advanced features

DRM FREE - Read whenever, wherever and however you want

Frequently bought together

Apache Hive Cookbook

Apr 2016 268 pages

3 (4)

eBook

€8.99 ~~€29.99~~

Learning Hbase

Nov 2014 326 pages

4.1 (8)

eBook

€8.99 ~~€29.99~~

Apache Hive Essentials

Feb 2015 208 pages

4.5 (4)

eBook

€8.99 ~~€26.99~~

Total € 106.97

€36.99

€32.99

Total € 106.97

Ian Stirk May 01, 2015

Hi,I have written a detailed chapter-by-chapter review of this book on www DOT i-programmer DOT info, the first and last parts of this review are given here. For my review of all chapters, search i-programmer DOT info for STIRK together with the book's title.This book aims to introduce you to a popular platform for storing and analyzing big data on Hadoop. How does it fare?Increasing amounts of data are being created, and there’s a need to store and process this data to gain competitive advantage. Hive is a popular platform for storing and analyzing big data on Hadoop. Hive tends to be popular because it uses a SQL-like syntax, familiar to many people. With plenty of built-in functionality, big data analysis can be done in Hive without advanced coded skills.The book is aimed at both the beginner and the more advanced audience (data analysts, developers, and users). Some previous experience of SQL and databases is advantageous.Chapter 1 Overview of Big Data and HiveThe chapter opens with a brief overview of the history of data processing, covering: batch, online, relational databases, and the internet. The latter has led to a massive rise in the amount of data being created, requiring new approaches to processing. This big data can be described in terms of various attributes including: volume, velocity and variety.Big data tends to be processed on relatively cheap commodity hardware, using a distributed processing. Hadoop is a popular platform for big data processing. The chapter discusses the major components of Hadoop:*Hadoop Distributed File System (HDFS) – storage system*MapReduce – computing system (distributes processing and aggregates results)*Associated components (e.g. HBase, Sqoop, Flume, Impala etc)Having described how we arrived at big data and Hadoop, the chapter proceeds with an overview of Hive. Hive allows you to issue queries against petabytes of data, using its Hive Query Language (HQL) which is similar to SQL. Hive gives a table structure to data held in HDFS. Using Hive allows simpler data processing, compared with similar code written in Java.This chapter provides a helpful background on how we arrived at today’s big data and Hadoop platform. An overview of Hadoop and its components is given, together with a very helpful diagram of the Hadoop ecosystem (e.g. HDFS, HBase, Sqoop, Impala, etc). A useful overview of Hive is provided, highlighting its purpose and advantages....ConclusionThis book provides up-to-date detail on Hive, a very popular platform for storing and analyzing big data on Hadoop.Most topics are explained in a very readable manner, a few sections could do with more detail (e.g. transactions). Throughout, there are helpful explanations, screenshots, practical code examples, and inter-chapter references. Some links to websites are provided for further information.This book is especially suitable for developers and data analysts starting out with Hive. Additionally, since it also contains advanced and up-to-date material, it is also suitable for more advanced developers/analysts. If you have a background in SQL the book is even easier to understand.There are very few books dedicated to Hive, and these tend to be out of date now (especially since Hive changes regularly). If you want an up-to-date, practical, wide-ranging review of Hive’s functionality, I highly recommend this book.

Amazon Verified review

H H C X Apr 10, 2015

This is by far the most up to date book about Hive. It's been such a long time I've been waiting for a book to cover most stable and widely used Hive version. It is well written. All topics in each chapter are carefully picked and clearly presented for the all level of readers. You really do not need much programming or big data backgrounds to learn it. By reading the book, I strongly agree to the author that Hive will be the most important and popular tools of big data ecosystem for now and future. And most people can and should start the journey of big data from learning Hive. Of course, this is really a good book to start reading.

Ryan May 05, 2015

Given the hyped data science and big data framework buzzwords, the topic this book covers is definitely relevant and important to big data practitioners. The author appears to have a long and solid experience in the industry which gave him much practical knowledge on the subject. Having quickly skimmed through the book, my first impression is the book has a broad coverage of Apache Hive, ranging from the basic setup to security, data manipulation and the detailed explanation on the grammar, complemented with relatively straightforward examples.My current feeling is, as a thin book of 200 pages, it did quite a good job.

Sumit Pal Apr 03, 2015

This is a good starting to intermediate level book for readers getting to learn hie for the 1st time.This book is quite detailed and covers all the basics needed to get you up and running and playing around and working with Hive.The book is well organized and each chapter gets one or few concepts across. I like the chapters on performance optimizations and security.

Apache Hive Essentials: Immerse yourself on a fantastic journey to discover the attributes of big data by using Hive

What do you get with eBook?

Apache Hive Essentials

Chapter 1. Overview of Big Data and Hive

A short history

Introducing big data

Relational and NoSQL database versus Hadoop

Batch, real-time, and stream processing

Overview of the Hadoop ecosystem

Hive overview

Summary

Page 1 of 8

Description

Product Details

What do you get with eBook?

Product Details

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs

Apache Hive Essentials: Immerse yourself on a fantastic journey to discover the attributes of big data by using Hive

What do you get with eBook?

Contact Details

Billing Address

Description

Product Details

What do you get with eBook?

Contact Details

Billing Address

Product Details

Packt Subscriptions

Frequently bought together

Table of Contents

Recommendations for you

Customer reviews

People who bought this also bought

About the author

FAQs