Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Data Analysis

You're reading from   Practical Data Analysis Pandas, MongoDB, Apache Spark, and more

Arrow left icon
Product type Paperback
Published in Sep 2016
Publisher
ISBN-13 9781785289712
Length 338 pages
Edition 2nd Edition
Languages
Arrow right icon
Authors (2):
Arrow left icon
Hector Cuesta Hector Cuesta
Author Profile Icon Hector Cuesta
Hector Cuesta
Dr. Sampath Kumar Dr. Sampath Kumar
Author Profile Icon Dr. Sampath Kumar
Dr. Sampath Kumar
Arrow right icon
View More author details
Toc

Table of Contents (16) Chapters Close

Preface 1. Getting Started FREE CHAPTER 2. Preprocessing Data 3. Getting to Grips with Visualization 4. Text Classification 5. Similarity-Based Image Retrieval 6. Simulation of Stock Prices 7. Predicting Gold Prices 8. Working with Support Vector Machines 9. Modeling Infectious Diseases with Cellular Automata 10. Working with Social Graphs 11. Working with Twitter Data 12. Data Processing and Aggregation with MongoDB 13. Working with MapReduce 14. Online Data Analysis with Jupyter and Wakari 15. Understanding Data Processing using Apache Spark

The data analysis process

When you have a good understanding of a phenomenon it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of following steps:

  • The statement of problem
  • Collecting your data
  • Cleaning the data
  • Normalizing the data
  • Transforming the data
  • Exploratory statistics
  • Exploratory visualization
  • Predictive modeling
  • Validating your model
  • Visualizing and interpreting your results
  • Deploying your solution

All of these activities can be grouped as is shown in the following image:

The data analysis process

The problem

The problem definition starts with high-level business domain questions, such as how to track differences in behavior between groups of customers or knowing what the gold price will be in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.

Types of data analysis questions include:

  • Inferential
  • Predictive
  • Descriptive
  • Exploratory
  • Causal
  • Correlational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take up a lot of time. In Chapter 11, Working with Twitter Data, we will go into more detail about working with data, using OpenRefine to address complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are as follows:

  • Complete
  • Coherent
  • Ambiguity elimination
  • Countable
  • Correct
  • Standardized
  • Redundancy elimination

Data exploration

Data exploration is essentially looking at the processed data in a graphical or statistical form and trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3, Getting to Grips with Visualization, we will present a JavaScript visualization framework (D3.js) and implement some examples of how to use visualization as a data exploration tool.

Predictive modeling

From the galaxy of information we have to extract usable hidden patterns and trends using relevant algorithms. To extract the future behavior of these hidden patterns, we can use predictive modeling. Predictive modeling is a statistical technique to predict future behavior by analyzing existing information, that is, historical data. We have to use proper statistical models that best forecast the hidden patterns of the data or information.

Predictive modeling is a process used in data analysis to create or choose a statistical model to try to best predict the probability of an outcome. Using predictive modeling, we can assess the future behavior of the customer. For this, we require past performance data of that customer. For example, in the retail sector, predictive analysis can play an important role in getting better profitability. Retailers can store galaxies of historical data. After developing different predicting models using this data, we can forecast to improve promotional planning, optimize sales channels, optimize store areas, and enhance demand planning.

Initially, building predictive models requires expertise views. After building relevant predicting models, we can use them automatically for forecasts. Predicting models give better forecasts when we concentrate on a careful combination of predictors. In fact, if the data size increases, we get more precise prediction results.

In this book we will use a variety of those models, and we can group them into three categories based on their outcomes:

Model

Chapter

Algorithm

Categorical outcome

(Classification)

4

Naïve Bayes Classifier

11

Natural Language Toolkit and Naïve Bayes Classifier

Numerical outcome

(Regression)

6

Random walk

8

Support vector machines

8

Distance-based approach and k-nearest neighbor

9

Cellular automata

Descriptive modeling

(Clustering)

5

Fast Dynamic Time Warping (FDTW) + distance metrics

10

Force layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is finishing the evaluating model we chose as optimal for the particular problem.

Model assumptions are important for the quality of the predictions model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data, and evaluation preferably focuses on the validity of the predictions. The strength of the evidence for validity is usually considered to be stronger.

The no free lunch theorem proposed by Wolpert in 1996 said:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good".

But extracting valuable information from the data means the predictive model should be accurate. There are many different tests to determine if the predictive models we create are accurate, meaningful representations that will prove valuable information.

The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways of validating the model:

  • Cross-validation: Here, we divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
  • Hold-out: Here, a large dataset is arbitrarily divided into three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process. When we present model output results, visualization tools can play an important role. The visualization results are an important piece of our technological architecture. As the database is the core of our architecture, various technologies and methods for the visualization of data can be employed.

In an explanatory data analysis process, simple visualization techniques are very useful for discovering patterns, since the human eye plays an important role. Sometimes, we have to generate a three-dimensional plot for finding the visual pattern. But, for getting better visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In practice, the hypothesis of the study, dimensionality of the feature space, and data all play important roles in ensuring a good visualization technique.

In this book, we will focus in the univariate and multivariate graphical models. Using a variety of visualization tools like bar charts, pie charts, scatterplots, line charts, and multiple line charts, all implemented in D3.js; we will also learn how to use standalone plotting in Python with Matplotlib.

You have been reading a chapter from
Practical Data Analysis - Second Edition
Published in: Sep 2016
Publisher:
ISBN-13: 9781785289712
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime