Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Apache Hive Essentials

You're reading from   Apache Hive Essentials Essential techniques to help you process, and get unique insights from, big data

Arrow left icon
Product type Paperback
Published in Jun 2018
Publisher Packt
ISBN-13 9781788995092
Length 210 pages
Edition 2nd Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Dayong Du Dayong Du
Author Profile Icon Dayong Du
Dayong Du
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Overview of Big Data and Hive FREE CHAPTER 2. Setting Up the Hive Environment 3. Data Definition and Description 4. Data Correlation and Scope 5. Data Manipulation 6. Data Aggregation and Sampling 7. Performance Considerations 8. Extensibility Considerations 9. Security Considerations 10. Working with Other Tools 11. Other Books You May Enjoy

Introducing big data

Big Data is not simply a big volume of data. Here, the word Big refers to the big scope of data. A well-known saying in this domain is to describe big data with the help of three words starting with the letter V: volume, velocity, and variety. But the analytical and data science world has seen data varying in other dimensions in addition to the fundament three Vs of big data, such as veracity, variability, volatility, visualization, and value. The different Vs mentioned so far are explained as follows:

  • Volume: This refers to the amount of data generated in seconds. 90% of the world's data today has been created in the last two years. Since that time, the data in the world doubles every two years. Such big volumes of data are mainly generated by machines, networks, social media, and sensors, including structured, semi-structured, and unstructured data.
  • Velocity: This refers to the speed at which the data is generated, stored, analyzed, and moved around. With the availability of internet-connected devices, wireless or wired machines and sensors can pass on their data as soon as it is created. This leads to real-time data streaming and helps businesses to make valuable and fast decisions.
  • Variety: This refers to the different data formats. Data used to be stored in the .txt, .csv, and .dat formats from data sources such as filesystems, spreadsheets, and databases. This type of data, which resides in a fixed field within a record or file, is called structured data. Nowadays, data is not always in the traditional structured format. The newer semi-structured or unstructured forms of data are also generated by various methods such as email, photos, audio, video, PDFs, SMSes, or even something we have no idea about. These varieties of data formats create problems for storing and analyzing data. This is one of the major challenges we need to overcome in the big data domain.
  • Veracity: This refers to the quality of data, such as trustworthiness, biases, noise, and abnormality in data. Corrupted data is quite normal. It could originate due to a number of reasons, such as typos, missing or uncommon abbreviations, data reprocessing, and system failures. However, ignoring this malicious data could lead to inaccurate data analysis and eventually a wrong decision. Therefore, making sure the data is correct in terms of data audition and correction is very important for big data analysis.
  • Variability: This refers to the changing of data. It means that the same data could have different meanings in different contexts. This is particularly important when carrying out sentiment analysis. The analysis algorithms are able to understand the context and discover the exact meaning and values of data in that context.
  • Volatility: This refers to how long the data is valid and stored. This is particularly important for real-time analysis. It requires a target time window of data to be determined so that analysts can focus on particular questions and gain good performance out of the analysis.
  • Visualization: This refers to the way of making data well understood. Visualization does not only mean ordinary graphs or pie charts; it also makes vast amounts of data comprehensible in a multidimensional view that is easy to understand. Visualization is an innovative way to show changes in data. It requires lots of interaction, conversations, and joint efforts between big data analysts and business-domain experts to make the visualization meaningful.
  • Value: This refers to the knowledge gained from data analysis on big data. The value of big data is how organizations turn themselves into big data-driven companies and use the insight from big data analysis for their decision-making.

In summary, big data is not just about lots of data, it is a practice to discover new insight from existing data and guide the analysis of new data. A big-data-driven business will be more agile and competitive to overcome challenges and win competitions.

You have been reading a chapter from
Apache Hive Essentials - Second Edition
Published in: Jun 2018
Publisher: Packt
ISBN-13: 9781788995092
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime