The data analysis process
When you have a good understanding of a phenomenon it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.
The data analysis process is composed of following steps:
- The statement of problem
- Collecting your data
- Cleaning the data
- Normalizing the data
- Transforming the data
- Exploratory statistics
- Exploratory visualization
- Predictive modeling
- Validating your model
- Visualizing and interpreting your results
- Deploying your solution
All of these activities can be grouped as is shown in the following image:
The problem
The problem definition starts with high-level business domain questions, such as how to track differences in behavior between groups of customers or knowing what the gold price will be in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.
Types of data analysis questions include:
- Inferential
- Predictive
- Descriptive
- Exploratory
- Causal
- Correlational
Data preparation
Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take up a lot of time. In Chapter 11, Working with Twitter Data, we will go into more detail about working with data, using OpenRefine to address complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.
The characteristics of good data are as follows:
- Complete
- Coherent
- Ambiguity elimination
- Countable
- Correct
- Standardized
- Redundancy elimination
Data exploration
Data exploration is essentially looking at the processed data in a graphical or statistical form and trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3, Getting to Grips with Visualization, we will present a JavaScript visualization framework (D3.js) and implement some examples of how to use visualization as a data exploration tool.
Predictive modeling
From the galaxy of information we have to extract usable hidden patterns and trends using relevant algorithms. To extract the future behavior of these hidden patterns, we can use predictive modeling. Predictive modeling is a statistical technique to predict future behavior by analyzing existing information, that is, historical data. We have to use proper statistical models that best forecast the hidden patterns of the data or information.
Predictive modeling is a process used in data analysis to create or choose a statistical model to try to best predict the probability of an outcome. Using predictive modeling, we can assess the future behavior of the customer. For this, we require past performance data of that customer. For example, in the retail sector, predictive analysis can play an important role in getting better profitability. Retailers can store galaxies of historical data. After developing different predicting models using this data, we can forecast to improve promotional planning, optimize sales channels, optimize store areas, and enhance demand planning.
Initially, building predictive models requires expertise views. After building relevant predicting models, we can use them automatically for forecasts. Predicting models give better forecasts when we concentrate on a careful combination of predictors. In fact, if the data size increases, we get more precise prediction results.
In this book we will use a variety of those models, and we can group them into three categories based on their outcomes:
Model |
Chapter |
Algorithm |
Categorical outcome
(Classification)
|
4 |
Naïve Bayes Classifier |
11 |
Natural Language Toolkit and Naïve Bayes Classifier | |
Numerical outcome
(Regression)
|
6 |
Random walk |
8 |
Support vector machines | |
8 |
Distance-based approach and k-nearest neighbor | |
9 |
Cellular automata | |
Descriptive modeling
(Clustering)
|
5 |
Fast Dynamic Time Warping (FDTW) + distance metrics |
10 |
Force layout and Fruchterman-Reingold layout |
Another important task we need to accomplish in this step is finishing the evaluating model we chose as optimal for the particular problem.
Model assumptions are important for the quality of the predictions model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data, and evaluation preferably focuses on the validity of the predictions. The strength of the evidence for validity is usually considered to be stronger.
The no free lunch theorem proposed by Wolpert in 1996 said:
"No Free Lunch theorems have shown that learning algorithms cannot be universally good".
But extracting valuable information from the data means the predictive model should be accurate. There are many different tests to determine if the predictive models we create are accurate, meaningful representations that will prove valuable information.
The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways of validating the model:
- Cross-validation: Here, we divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
- Hold-out: Here, a large dataset is arbitrarily divided into three subsets: training set, validation set, and test set.
Visualization of results
This is the final step in our analysis process. When we present model output results, visualization tools can play an important role. The visualization results are an important piece of our technological architecture. As the database is the core of our architecture, various technologies and methods for the visualization of data can be employed.
In an explanatory data analysis process, simple visualization techniques are very useful for discovering patterns, since the human eye plays an important role. Sometimes, we have to generate a three-dimensional plot for finding the visual pattern. But, for getting better visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In practice, the hypothesis of the study, dimensionality of the feature space, and data all play important roles in ensuring a good visualization technique.
In this book, we will focus in the univariate and multivariate graphical models. Using a variety of visualization tools like bar charts, pie charts, scatterplots, line charts, and multiple line charts, all implemented in D3.js; we will also learn how to use standalone plotting in Python with Matplotlib.