Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Spark for Data Science

You're reading from   Spark for Data Science Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Arrow left icon
Product type Paperback
Published in Sep 2016
Publisher Packt
ISBN-13 9781785885655
Length 344 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Bikramaditya Singhal Bikramaditya Singhal
Author Profile Icon Bikramaditya Singhal
Bikramaditya Singhal
Srinivas Duvvuri Srinivas Duvvuri
Author Profile Icon Srinivas Duvvuri
Srinivas Duvvuri
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Big Data and Data Science – An Introduction FREE CHAPTER 2. The Spark Programming Model 3. Introduction to DataFrames 4. Unified Data Access 5. Data Analysis on Spark 6. Machine Learning 7. Extending Spark with SparkR 8. Analyzing Unstructured Data 9. Visualizing Big Data 10. Putting It All Together 11. Building Data Science Applications

Big data overview

Much has already been spoken and written about what big data is, but there is no specific standard as such to clearly define it. It is actually a relative term to some extent. Whether small or big, your data can be leveraged only if you can analyze it properly. To make some sense out of your data, the right set of analysis techniques is needed and selecting the right tools and techniques is of utmost importance in data analytics. However, when the data itself becomes a part of the problem and the computational challenges need to be addressed prior to performing data analysis, it becomes a big data problem.

A revolution took place in the World Wide Web, also referred to as Web 2.0, which changed the way people used the Internet. Static web pages became interactive websites and started collecting more and more data. Technological advancements in cloud computing, social media, and mobile computing created an explosion of data. Every digital device started emitting data and many other sources started driving the data deluge. The dataflow from every nook and corner generated varieties of voluminous data, at speed! The formation of big data in this fashion was a natural phenomenon, because this is how the World Wide Web had evolved and no explicit efforts were involved in specifics. This is about the past! If you consider the change that is happening now, and is going to happen in future, the volume and speed of data generation is beyond what one can anticipate. I am propelled to make such a statement because every device is getting smarter these days, thanks to the Internet of Things (IoT).

The IT trend was such that the technological advancements also facilitated the data explosion. Data storage had experienced a paradigm shift with the advent of cheaper clusters of online storage pools and the availability of commodity hardware with bare minimal price. Storing data from disparate sources in its native form in a single data lake was rapidly gaining over carefully designed data marts and data warehouses. Usage patterns also shifted from rigid schema-driven, RDBMS-based approaches to schema-less, continuously available NoSQL data-store-driven solutions. As a result, the rate of data creation, whether structured, semi-structured, or unstructured, started accelerating like never before.

Organizations are very much convinced that not only can specific business questions be answered by leveraging big data; it also brings in opportunities to cover the uncovered possibilities in businesses and address the uncertainties associated with this. So, apart from the natural data influx, organizations started devising strategies to generate more and more data to maintain their competitive advantages and to be future ready. Here, an example would help to understand this better. Imagine sensors are installed on the machines of a manufacturing plant which are constantly emitting data, and hence the status of the machine parts, and a company is able to predict when the machine is going to fail. It lets the company prevent a failure or damage and avoid unplanned downtime, saving a lot of money.

You have been reading a chapter from
Spark for Data Science
Published in: Sep 2016
Publisher: Packt
ISBN-13: 9781785885655
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image