You're reading from Spark for Data Science Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Product type Paperback

Published in Sep 2016

Publisher Packt

ISBN-13 9781785885655

Length 344 pages

Edition 1st Edition

Languages

Scala

Tools

IPython

Concepts

Data Science

Authors (2):

Bikramaditya Singhal

Srinivas Duvvuri

View More author details

Table of Contents (12) Chapters

Preface

1. Big Data and Data Science – An Introduction FREE CHAPTER

2. The Spark Programming Model

3. Introduction to DataFrames

4. Unified Data Access

5. Data Analysis on Spark

6. Machine Learning

7. Extending Spark with SparkR

8. Analyzing Unstructured Data

9. Visualizing Big Data

10. Putting It All Together

11. Building Data Science Applications

Challenges with big data analytics

There are broadly two types of formidable challenges in the analysis of big data. The first challenge is the requirement for a massive computation platform, and once it is in place, the second challenge is to analyze and make sense out of huge data at scale.

Computational challenges

With the increase in data, the storage requirement for big data also grew more and more. Data management became a cumbersome task. The latency involved in accessing the disk storage due to the seek time became the major bottleneck even though the processing speed of the processor and the frequency of RAM were up to the mark.

Fetching structured and unstructured data from across the gamut of business applications and data silos, consolidating them, and processing them to find useful business insights was challenging. There were only a few applications that could address any one area, or just a few areas of diversified business requirement. However, integrating those applications to address most of the business requirements in a unified way only increased the complexity.

To address these challenges, people turned to the distributed computing framework with distributed file system, for example, Hadoop and Hadoop Distributed File System (HDFS). This could eliminate the latency due to disk I/O, as the data could be read in parallel across the cluster of machines.

Distributed computing technologies had existed for decades before, but gained more prominence only after the importance of big data was realized in the industry. So, technology platforms such as Hadoop and HDFS or Amazon S3 became the industry standard. On top of Hadoop, many other solutions such as Pig, Hive, Sqoop, and others were developed to address different kinds of industry requirements such as storage, Extract, Transform, and Load (ETL), and data integration to make Hadoop a unified platform.

Analytical challenges

Analyzing data to find some hidden insights has always been challenging because of the additional intricacies involved in dealing with huge datasets. The traditional BI and OLAP solutions could not address most of the challenges that arose due to big data. As an example, if there were multiple dimensions to a dataset, say 100, it got really difficult to compare these variables with one another to draw a conclusion because there would be around 100C2 combinations for it. Such cases required statistical techniques such as correlation and the like to find the hidden patterns.

Though there were statistical solutions to many problems, it got really difficult for data scientists or analytics professionals to slice and dice the data to find intelligent insights unless they loaded the entire dataset into a DataFrame in memory. The major roadblock was that most of the general-purpose algorithms for statistical analysis and machine learning were single-threaded and written at a time when datasets were usually not so huge and could fit in the RAM on a single computer. Those algorithms written in R or Python were no longer very useful in their native form to be deployed on a distributed computing environment because of the limitation of in-memory computation.

To address this challenge, statisticians and computer scientists had to work together to rewrite most of the algorithms that would work well in a distributed computing environment. Consequently, a library called Mahout for machine learning algorithms was developed on Hadoop for parallel processing. It had most of the common algorithms that were being used most often in the industry. Similar initiatives were taken for other distributed computing frameworks.

You're reading from Spark for Data Science Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Table of Contents (12) Chapters

Challenges with big data analytics

Computational challenges

Analytical challenges

Authors (2)

Personalised recommendations for you