You're reading from Spark for Data Science Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Product type Paperback

Published in Sep 2016

Publisher Packt

ISBN-13 9781785885655

Length 344 pages

Edition 1st Edition

Languages

Scala

Tools

IPython

Concepts

Data Science

Authors (2):

Bikramaditya Singhal

Srinivas Duvvuri

View More author details

Table of Contents (12) Chapters

Preface

1. Big Data and Data Science – An Introduction FREE CHAPTER

2. The Spark Programming Model

3. Introduction to DataFrames

4. Unified Data Access

5. Data Analysis on Spark

6. Machine Learning

7. Extending Spark with SparkR

8. Analyzing Unstructured Data

9. Visualizing Big Data

10. Putting It All Together

11. Building Data Science Applications

Big data overview

Much has already been spoken and written about what big data is, but there is no specific standard as such to clearly define it. It is actually a relative term to some extent. Whether small or big, your data can be leveraged only if you can analyze it properly. To make some sense out of your data, the right set of analysis techniques is needed and selecting the right tools and techniques is of utmost importance in data analytics. However, when the data itself becomes a part of the problem and the computational challenges need to be addressed prior to performing data analysis, it becomes a big data problem.

A revolution took place in the World Wide Web, also referred to as Web 2.0, which changed the way people used the Internet. Static web pages became interactive websites and started collecting more and more data. Technological advancements in cloud computing, social media, and mobile computing created an explosion of data. Every digital device started emitting data and many other sources started driving the data deluge. The dataflow from every nook and corner generated varieties of voluminous data, at speed! The formation of big data in this fashion was a natural phenomenon, because this is how the World Wide Web had evolved and no explicit efforts were involved in specifics. This is about the past! If you consider the change that is happening now, and is going to happen in future, the volume and speed of data generation is beyond what one can anticipate. I am propelled to make such a statement because every device is getting smarter these days, thanks to the Internet of Things (IoT).

The IT trend was such that the technological advancements also facilitated the data explosion. Data storage had experienced a paradigm shift with the advent of cheaper clusters of online storage pools and the availability of commodity hardware with bare minimal price. Storing data from disparate sources in its native form in a single data lake was rapidly gaining over carefully designed data marts and data warehouses. Usage patterns also shifted from rigid schema-driven, RDBMS-based approaches to schema-less, continuously available NoSQL data-store-driven solutions. As a result, the rate of data creation, whether structured, semi-structured, or unstructured, started accelerating like never before.

Organizations are very much convinced that not only can specific business questions be answered by leveraging big data; it also brings in opportunities to cover the uncovered possibilities in businesses and address the uncertainties associated with this. So, apart from the natural data influx, organizations started devising strategies to generate more and more data to maintain their competitive advantages and to be future ready. Here, an example would help to understand this better. Imagine sensors are installed on the machines of a manufacturing plant which are constantly emitting data, and hence the status of the machine parts, and a company is able to predict when the machine is going to fail. It lets the company prevent a failure or damage and avoid unplanned downtime, saving a lot of money.

You're reading from Spark for Data Science Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Table of Contents (12) Chapters

Big data overview

Authors (2)

Personalised recommendations for you