You're reading from Big Data Analytics Real time analytics using Apache Spark and Hadoop

Product type Paperback

Published in Sep 2016

Publisher Packt

ISBN-13 9781785884696

Length 326 pages

Edition 1st Edition

Tools

Hadoop

Concepts

Big Data

Author (1):

Venkat Ankam

View More author details

Table of Contents (12) Chapters

Preface

1. Big Data Analytics at a 10,000-Foot View

2. Getting Started with Apache Hadoop and Apache Spark FREE CHAPTER

3. Deep Dive into Apache Spark

4. Big Data Analytics with Spark SQL, DataFrames, and Datasets

5. Real-Time Analytics with Spark Streaming and Structured Streaming

6. Notebooks and Dataflows with Spark and Hadoop

7. Machine Learning with Spark and Hadoop

8. Building Recommendation Systems with Spark and Mahout

9. Graph Analytics with GraphX

10. Interactive Analytics with SparkR

Index

Big Data science and the role of Hadoop and Spark

Data science is all about the following two aspects:

Extracting deep meaning from the data
Creating data products

Extracting deep meaning from data means fetching the value using statistical algorithms. A data product is a software system whose core functionality depends on the application of statistical analysis and machine learning to the data. Google AdWords or Facebook's People You May Know are a couple of examples of data products.

A fundamental shift from data analytics to data science

A fundamental shift from data analytics to data science is due to the rising need for better predictions and creating better data products.

Let's consider an example use case that explains the difference between data analytics and data science.

Problem: A large telecoms company has multiple call centers that collect caller information and store it in databases and filesystems. The company has already implemented data analytics on the call center data, which provided the following insights:

Service availability
The average speed of answering, average hold time, average wait time, and average call time
The call abandon rate
The first call resolution rate and cost per call
Agent occupancy

Now, the telecoms company would like to reduce the customer churn, improve customer experience, improve service quality, and cross-sell and up-sell by understanding the customers in near real-time.

Solution: Analyze the customer voice. The customer voice has deeper insights than any other information. Convert all calls to text using tools such as CMU Sphinx and scale out on the Hadoop platform. Perform text analytics to derive insights from the data, to gain high accuracy in call-to-text conversion, create models (language and acoustic) that are suitable for the company, and retrain models on a frequent basis with any changes. Also, create models for text analytics using machine learning and natural language processing (NLP) to come up with the following metrics while combining data analytics metrics:

Top reasons for customer churn
Customer sentiment analysis
Customer and problem segmentation
360-degree view of the customer

Notice that the business requirement of this use case created a fundamental shift from data analytics to data science implementing machine learning and NLP algorithms. To implement this solution, new tools and techniques are used and a new role, data scientist, is needed.

A data scientist has a combination of multiple skill sets—statistics, software programming, and business expertise. Data scientists create data products and extract value from the data. Let's see how data scientists differ from other roles. This will help us in understanding roles and tasks performed in data science and data analytics projects.

Data scientists versus software engineers

The difference between the data scientist and software engineer roles is as follows:

Software engineers develop general-purpose software for applications based on business requirements
Data scientists don't develop application software, but they develop software to help them solve problems
Typically, software engineers use Java, C++, and C# programming languages
Data scientists tend to focus more on scripting languages such as Python and R

Data scientists versus data analysts

The difference between the data scientist and data analyst roles is as follows:

Data analysts perform descriptive and diagnostic analytics using SQL and scripting languages to create reports and dashboards.
Data scientists perform predictive and prescriptive analytics using statistical techniques and machine learning algorithms to find answers. They typically use tools such as Python, R, SPSS, SAS, MLlib, and GraphX.

Data scientists versus business analysts

The difference between the data scientist and business analyst roles is as follows:

Both have a business focus, so they may ask similar questions
Data scientists have the technical skills to find answers

A typical data science project life cycle

Let's learn how to approach and execute a typical data science project.

The typical data science project life cycle shown in Figure 1.4 explains that a data science project's life cycle is iterative, but a data analytics project's life cycle, as shown in Figure 1.3, is not iterative. Defining problems and outcomes and communicating phases are not in the iterations while improving the outcomes of the project. However, the overall project life cycle is iterative, which needs to be improved from time to time after production implementation.

A typical data science project life cycle

Figure 1.4: A data science project life cycle

Defining problems and outcomes in the data preprocessing phase is similar to the data analytics project, which is explained in Figure 1.3. So, let's discuss the new steps required for data science projects.

Hypothesis and modeling

Given the problem, consider all the possible solutions that could match the desired outcome. This typically involves a hypothesis about the root cause of the problem. So, questions around the business problem arise, such as why customers are canceling the service, why support calls are increasing significantly, and why customers are abandoning shopping carts.

A hypothesis would identify the appropriate model given a deeper understanding of the data. This involves understanding the attributes of the data and their relationships and building the environment for the modeling by defining datasets for testing, training, and production. Create the appropriate model using machine learning algorithms such as logistic regression, k-means clustering, decision trees, or Naive Bayes.

Measuring the effectiveness

Execute the model by running the identified model against the datasets. Measure the effectiveness of the model by checking the results against the desired outcome. Use test data to verify the results and create metrics such as Mean Squared Error (MSE) to measure effectiveness.

Making improvements

Measurements will illustrate how much improvement is required. Consider what you might change. You can ask yourself the following questions: 

Was the hypothesis around the root cause correct?
Ingesting additional datasets would provide better results?
Would other solutions provide better results?

Once you've implemented your improvements, test them again and compare them with the previous measurements in order to refine the solution further.

Communicating the results

Communication of the results is an important step in the data science project life cycle. The data scientist tells the story found within the data by correlating the story to business problems. Reports and dashboards are common tools to communicate the results.

The role of Hadoop and Spark

Apache Hadoop provides you with distributed storage and resource management, while Spark provides you with in-memory performance for data science applications. Hadoop and Spark have the following advantages for data science projects:

A wide range of applications and third-party packages
A machine learning algorithms library for easy usage
Spark integrations with deep learning libraries such as H2O and TensorFlow
Scala, Python, and R for interactive analytics using the shell
A unification feature—using SQL, machine learning, and streaming together

The rest of the chapter is locked