You're reading from Practical Data Analysis Pandas, MongoDB, Apache Spark, and more

Product type Paperback

Published in Sep 2016

Publisher

ISBN-13 9781785289712

Length 338 pages

Edition 2nd Edition

Languages

Python

Tools

Apache Spark

Concepts

Data Analysis

Authors (2):

Hector Cuesta

Dr. Sampath Kumar

View More author details

Table of Contents (16) Chapters

Preface

1. Getting Started FREE CHAPTER

2. Preprocessing Data

3. Getting to Grips with Visualization

4. Text Classification

5. Similarity-Based Image Retrieval

6. Simulation of Stock Prices

7. Predicting Gold Prices

8. Working with Support Vector Machines

9. Modeling Infectious Diseases with Cellular Automata

10. Working with Social Graphs

11. Working with Twitter Data

12. Data Processing and Aggregation with MongoDB

13. Working with MapReduce

14. Online Data Analysis with Jupyter and Wakari

15. Understanding Data Processing using Apache Spark

The data analysis process

When you have a good understanding of a phenomenon it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of following steps:

The statement of problem
Collecting your data
Cleaning the data
Normalizing the data
Transforming the data
Exploratory statistics
Exploratory visualization
Predictive modeling
Validating your model
Visualizing and interpreting your results
Deploying your solution

All of these activities can be grouped as is shown in the following image:

The problem

The problem definition starts with high-level business domain questions, such as how to track differences in behavior between groups of customers or knowing what the gold price will be in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.

Types of data analysis questions include:

Inferential
Predictive
Descriptive
Exploratory
Causal
Correlational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take up a lot of time. In Chapter 11, Working with Twitter Data, we will go into more detail about working with data, using OpenRefine to address complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are as follows:

Complete
Coherent
Ambiguity elimination
Countable
Correct
Standardized
Redundancy elimination

Data exploration

Data exploration is essentially looking at the processed data in a graphical or statistical form and trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3, Getting to Grips with Visualization, we will present a JavaScript visualization framework (D3.js) and implement some examples of how to use visualization as a data exploration tool.

Predictive modeling

From the galaxy of information we have to extract usable hidden patterns and trends using relevant algorithms. To extract the future behavior of these hidden patterns, we can use predictive modeling. Predictive modeling is a statistical technique to predict future behavior by analyzing existing information, that is, historical data. We have to use proper statistical models that best forecast the hidden patterns of the data or information.

Predictive modeling is a process used in data analysis to create or choose a statistical model to try to best predict the probability of an outcome. Using predictive modeling, we can assess the future behavior of the customer. For this, we require past performance data of that customer. For example, in the retail sector, predictive analysis can play an important role in getting better profitability. Retailers can store galaxies of historical data. After developing different predicting models using this data, we can forecast to improve promotional planning, optimize sales channels, optimize store areas, and enhance demand planning.

Initially, building predictive models requires expertise views. After building relevant predicting models, we can use them automatically for forecasts. Predicting models give better forecasts when we concentrate on a careful combination of predictors. In fact, if the data size increases, we get more precise prediction results.

In this book we will use a variety of those models, and we can group them into three categories based on their outcomes:

Model	Chapter	Algorithm
Categorical outcome (Classification)	4	Naïve Bayes Classifier
	11	Natural Language Toolkit and Naïve Bayes Classifier
Numerical outcome (Regression)	6	Random walk
	8	Support vector machines
	8	Distance-based approach and k-nearest neighbor
	9	Cellular automata
Descriptive modeling (Clustering)	5	Fast Dynamic Time Warping (FDTW) + distance metrics
	10	Force layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is finishing the evaluating model we chose as optimal for the particular problem.

Model assumptions are important for the quality of the predictions model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data, and evaluation preferably focuses on the validity of the predictions. The strength of the evidence for validity is usually considered to be stronger.

The no free lunch theorem proposed by Wolpert in 1996 said:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good".

But extracting valuable information from the data means the predictive model should be accurate. There are many different tests to determine if the predictive models we create are accurate, meaningful representations that will prove valuable information.

The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways of validating the model:

Cross-validation: Here, we divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
Hold-out: Here, a large dataset is arbitrarily divided into three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process. When we present model output results, visualization tools can play an important role. The visualization results are an important piece of our technological architecture. As the database is the core of our architecture, various technologies and methods for the visualization of data can be employed.

In an explanatory data analysis process, simple visualization techniques are very useful for discovering patterns, since the human eye plays an important role. Sometimes, we have to generate a three-dimensional plot for finding the visual pattern. But, for getting better visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In practice, the hypothesis of the study, dimensionality of the feature space, and data all play important roles in ensuring a good visualization technique.

In this book, we will focus in the univariate and multivariate graphical models. Using a variety of visualization tools like bar charts, pie charts, scatterplots, line charts, and multiple line charts, all implemented in D3.js; we will also learn how to use standalone plotting in Python with Matplotlib.