Chapter 1. Getting Started

Data analysis is the process in which raw data is ordered and organized to be used in methods that help to evaluate and explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses based on logical and analytical methods. Data analysis is a multidisciplinary field that combines computer science, artificial intelligence, machine learning, statistics, mathematics, and business domain, as shown in the following figure:

All of these skills are important for gaining a good understanding of the problem and its optimal solutions, so let's define those fields.

Computer science

Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills like programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to follow the chapters in this book.

Artificial intelligence

According to Stuart Russell and Peter Norvig:

"Artificial intelligence has to do with smart programs, so let's get on and write some".

In other words, Artificial intelligence (AI) studies the algorithms that can simulate an intelligent behavior. In data analysis we use AI to perform those activities that require intelligence, like inference, similarity search, or unsupervised classification. Fields like deep learning rely on artificial intelligence algorithms; some of its current uses are chatbots, recommendation engines, image classification, and so on.

Machine learning

Machine learning (ML) is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959):

"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed".

ML has a large amount of algorithms generally split into three groups depending how the algorithms are training. They are as follows:

Supervised learning
Unsupervised learning
Reinforcement learning

The relevant number of algorithms is used throughout the book and they are combined with practical examples, leading the reader through the process from the initial data problem to its programming solution.

Statistics

In January 2009, Google's Chief Economist Hal Varian said:

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

Statistics is the development and application of methods to collect, analyze, and interpret data. Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis, and clustering.

Mathematics

Data analysis makes use of a lot of mathematical techniques like linear algebra (vector and matrix, factorization, eigenvalue), numerical methods, and conditional probability, in algorithms. In this book, all the chapters are self-contained and include the necessary math involved.

Knowledge domain

One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost every domain, including finance, administration, business, social media, government, and science.

Data, information, and knowledge

Data is facts of the world. Data represents a fact or statement of an event without relation to other things. Data comes in many forms, such as web pages, sensors, devices, audio, video, networks, log files, social media, transactional applications, and much more. Most of these data are generated in real time and on a very large-scale. Although it is generally alphanumeric (text, numbers, and symbols), it can consist of images or sound. Data consists of raw facts and figures. It does not have any meaning until it is processed. For example, financial transactions, age, temperature, and the number of steps from my house to my office are simply numbers. The information appears when we work with those numbers and we can find value and meaning.

Information can be considered as an aggregation of data. Information has usually got some meaning and purpose. The information can help us to make decisions easier. After processing the data, we can get the information within a context in order to give proper meaning. In computer jargon, a relational database makes information from the data stored within it.

Knowledge is information with meaning. Knowledge happens only when human experience and insight is applied to data and information. We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies the theoretical or practical understanding of a subject. The ultimate purpose of knowledge is for value creation.

Inter-relationship between data, information, and knowledge

We can observe that the relationship between data, information, and knowledge looks like cyclical behavior. The following diagram demonstrates the relationship between them. This diagram also explains the transformation of data into information and vice versa, similarly information and knowledge. If we apply valuable information based on context and purpose, it reflects knowledge. At the same time, the processed and analyzed data will give the information. When looking at the transformation of data to information and information to knowledge, we should concentrate on the context, purpose, and relevance of the task.

Now I would like to discuss these relationships with a real-life example:

Our students conducted a survey for their project with the purpose of collecting data related to customer satisfaction of a product and to see the conclusion of reducing the price of that product. As it was a real project, our students got to make the final decision to satisfy the customers. Data collected by the survey was processed and a final report was prepared. Based on the project report, the manufacturer of that product has since reduced the cost. Let's take a look at the following:

Data: Facts from the survey.
- For example: Number of customers purchased the product, satisfaction levels, competitor information, and so on.
Information: Project report.
- For example: Satisfaction level related to price based on the competitor product.
Knowledge: The manufacturer learned what to do for customer satisfaction and increase product sales.
- For example: The manufacturing cost of the product, transportation cost, quality of the product, and so on.

Finally, we can say that the data-information-knowledge hierarchy seemed like a great idea. However, by using predictive analytics we can simulate an intelligent behavior and provide a good approximation. In the following image is an example of how to turn data into knowledge:

The nature of data

Data is the plural of datum, so is always treated as plural. We can find data in all situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter account. In fact, data can be seen as the essential raw material to any kind of human activity. According to the Oxford English Dictionary, data are

"known facts or things used as basis for inference or reckoning".

As it is shown in the following image, we can see data in two distinct ways, Categorical and Numerical:

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable with two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations can be counted and are distinct and separate, for example, the number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval, for example, an economic time series like historic gold prices.

The kinds of datasets used in this book are the following:

E-mails (unstructured, discrete)
Digital images (unstructured, discrete)
Stock market logs (structured, continuous)
Historic gold prices (structured, continuous)
Credit approval records (structured, discrete)
Social media friends relationships (unstructured, discrete)
Tweets and treading topics (unstructured, continuous)
Sales records (structured, continuous)

For each of the projects in this book we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

The data analysis process

When you have a good understanding of a phenomenon it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of following steps:

The statement of problem
Collecting your data
Cleaning the data
Normalizing the data
Transforming the data
Exploratory statistics
Exploratory visualization
Predictive modeling
Validating your model
Visualizing and interpreting your results
Deploying your solution

All of these activities can be grouped as is shown in the following image:

The problem

The problem definition starts with high-level business domain questions, such as how to track differences in behavior between groups of customers or knowing what the gold price will be in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.

Types of data analysis questions include:

Inferential
Predictive
Descriptive
Exploratory
Causal
Correlational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take up a lot of time. In Chapter 11, Working with Twitter Data, we will go into more detail about working with data, using OpenRefine to address complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are as follows:

Complete
Coherent
Ambiguity elimination
Countable
Correct
Standardized
Redundancy elimination

Data exploration

Data exploration is essentially looking at the processed data in a graphical or statistical form and trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3, Getting to Grips with Visualization, we will present a JavaScript visualization framework (D3.js) and implement some examples of how to use visualization as a data exploration tool.

Predictive modeling

From the galaxy of information we have to extract usable hidden patterns and trends using relevant algorithms. To extract the future behavior of these hidden patterns, we can use predictive modeling. Predictive modeling is a statistical technique to predict future behavior by analyzing existing information, that is, historical data. We have to use proper statistical models that best forecast the hidden patterns of the data or information.

Predictive modeling is a process used in data analysis to create or choose a statistical model to try to best predict the probability of an outcome. Using predictive modeling, we can assess the future behavior of the customer. For this, we require past performance data of that customer. For example, in the retail sector, predictive analysis can play an important role in getting better profitability. Retailers can store galaxies of historical data. After developing different predicting models using this data, we can forecast to improve promotional planning, optimize sales channels, optimize store areas, and enhance demand planning.

Initially, building predictive models requires expertise views. After building relevant predicting models, we can use them automatically for forecasts. Predicting models give better forecasts when we concentrate on a careful combination of predictors. In fact, if the data size increases, we get more precise prediction results.

In this book we will use a variety of those models, and we can group them into three categories based on their outcomes:

Model	Chapter	Algorithm
Categorical outcome (Classification)	4	Naïve Bayes Classifier
	11	Natural Language Toolkit and Naïve Bayes Classifier
Numerical outcome (Regression)	6	Random walk
	8	Support vector machines
	8	Distance-based approach and k-nearest neighbor
	9	Cellular automata
Descriptive modeling (Clustering)	5	Fast Dynamic Time Warping (FDTW) + distance metrics
	10	Force layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is finishing the evaluating model we chose as optimal for the particular problem.

Model assumptions are important for the quality of the predictions model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data, and evaluation preferably focuses on the validity of the predictions. The strength of the evidence for validity is usually considered to be stronger.

The no free lunch theorem proposed by Wolpert in 1996 said:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good".

But extracting valuable information from the data means the predictive model should be accurate. There are many different tests to determine if the predictive models we create are accurate, meaningful representations that will prove valuable information.

The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways of validating the model:

Cross-validation: Here, we divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
Hold-out: Here, a large dataset is arbitrarily divided into three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process. When we present model output results, visualization tools can play an important role. The visualization results are an important piece of our technological architecture. As the database is the core of our architecture, various technologies and methods for the visualization of data can be employed.

In an explanatory data analysis process, simple visualization techniques are very useful for discovering patterns, since the human eye plays an important role. Sometimes, we have to generate a three-dimensional plot for finding the visual pattern. But, for getting better visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In practice, the hypothesis of the study, dimensionality of the feature space, and data all play important roles in ensuring a good visualization technique.

In this book, we will focus in the univariate and multivariate graphical models. Using a variety of visualization tools like bar charts, pie charts, scatterplots, line charts, and multiple line charts, all implemented in D3.js; we will also learn how to use standalone plotting in Python with Matplotlib.

Quantitative versus qualitative data analysis

Quantitative data are numerical measurements expressed in terms of numbers.

Qualitative data are categorical measurements expressed in terms of natural language descriptions.

As is shown in the following image, we can observe the differences between quantitative and qualitative analysis:

Quantitative versus qualitative data analysis

Quantitative analytics involves analysis of numerical data. The type of the analysis will depend on the level of measurement. There are four kinds of measurements:

Nominal data has no logical order and is used as classification data.
Ordinal data has a logical order and differences between values are not constant.
Interval data is continuous and depends on logical order. The data has standardized differences between values, but do not include zero.
Ratio data is continuous with logical order as well as regular intervals differences between values and may include zero.

Qualitative analysis can explore the complexity and meaning of social phenomena. Data for qualitative study may include written texts (for example, documents or e-mail) and/or audible and visual data (digital images or sounds). In Chapter 11, Working with Twitter Data, we will present a sentiment analysis from Twitter data as an example of qualitative analysis.

Importance of data visualization

The goal of data visualization is to expose something new about the underlying patterns and relationships contained within the data. The visualization not only needs to be beautiful but also meaningful in order to help organizations make better decisions. Visualization is an easy way to jump into a complex dataset (small or big) to describe and explore the data efficiently. Many kinds of data visualization are available, such as bar charts, histograms, line charts, pie charts, heat maps, frequency Wordles (as is shown in the following image), and so on, for one variable, two variables, many variables in one, and even two or three dimensions:

Data visualization is an important part of our data analysis process because it is a fast and easy way to perform exploratory data analysis through summarizing their main characteristics with a visual graph.

The goals of exploratory data analysis are as follows:

Detection of data errors
Checking of assumptions
Finding hidden patters (like tendency)
Preliminary selection of appropriate models
Determining relationships between the variables

We will go into more detail about data visualization and exploratory data analysis in Chapter 3, Getting to Grips with Visualization.

What about big data?

Big data is a term used when the data exceeds the processing capacity of a typical database. The integration of computer technology into science and daily life has enabled the collection of massive volumes of data, such as climate data, website transaction logs, customer data, and credit card records. However, such big datasets cannot be practically managed on a single commodity computer because their sizes are too large to fit in memory, or it takes more time to process the data. To avoid this obstacle, one may have to resort to parallel and distributed architectures, with multicore and cloud computing platforms providing access to hundreds or thousands of processors. For the storing and manipulation of big data, parallel and distributed architectures show new capabilities.

Now, big data is a truth: the variety, volume, and velocity of data coming from the Web, sensors, devices, audio, video, networks, log files, social media, and transactional applications reach exceptional levels. Now big data has also hit the business, government, and science sectors. This phenomenal growth means that not only must we understand big data in order to interpret the information that truly counts, but also the possibilities of big data analytics.

There are three main features of big data:

Volume: Large amounts of data
Variety: Different types of structured, unstructured, and multistructured data
Velocity: Needs to be analyzed quickly

As is shown in the following image, we can see the interaction between these three Vs:

We need big data analytics when data grows fast and needs to uncover hidden patterns, unknown correlations, and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions cannot in order to transform business decisions for the future. Big data analytics is a workflow that distils terabytes of low-value data.

Big data is an opportunity for any company to take advantage of data aggregation, data exhaustion, and metadata. This makes big data a useful business analytics tool, but there is a common misunderstanding of what big data actually is.

The most common architecture for big data processing is through Map-Reduce, which is a programming model for processing large datasets in parallel using a distributed cluster.

Apache Hadoop is the most popular implementation of MapReduce, and it is used to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of three classes of technologies that store and manage big data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. In this book we will implement MapReduce functions and NoSQL storage through MongoDB in Chapter 12, Data Processing and Aggregation with MongoDB, and Chapter 13, Working with MapReduce.

MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.

A paper published by IEEE in 2009 The Unreasonable Effectiveness of Data says the following:

"But invariably, simple models and a lot of data trump over more elaborate models based on less data."

This is a fundamental idea in big data (you can find the full paper at http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf). The trouble with real-world data is that the probability of finding false correlations is high and gets higher as the datasets grows. That's why, in this book, we will focus on meaningful data instead of big data.

One of the main challenges for big data is how to store, protect, back up, organize, and catalog the data in a petabyte scale. Another of the main challenges of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras, the amount of data available for each person increases every minute. Big data must be able to process all those data in real time:

Quantified self

Quantified self is self-knowledge through self-tracking with technology. In this aspect, one can collect daily activities data on his own in terms of inputs, states, and performance. For example, input means food consumption or quality of surrounding air, states means mood or blood pressure, and performance means mental or physical condition. To collect these data, we can use wearable sensors and life logging. Quantified self-process allows individuals to quantify biometrics that they never knew existed, as well as make data collection cheaper and more convenient. One can track their insulin and cortisol levels and sequence DNA. Using quantified self data, one can be cautious about one's overall health, diet, and level of physical activity.

These days, wearing self-tracking gadgets is rapidly increasing. If we pooled the quantified self-data of a specific group of people, we can apply predictive algorithms on this data to diagnose patients in that location. That means quantified self data is very useful in certain medication contexts.

In the following screenshot, we can see some electronics gadgets that gather quantitative data:

Sensors and cameras

Interaction with the outside world is highly important in data analysis. Using sensors like Radio-Frequency Identification (RFID) or a smartphone to scan a QR code (Quick Response) code are easy ways of interacting directly with the customer, making recommendations, and analyzing consumer trends.

On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-Based Image Retrieval, we will use these digital images to perform a search by image. This can be used, for example, in face recognition or for finding recommendations of a restaurant just by taking a picture of the front door.

This interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.

Social network analysis

Nowadays, the Internet brings people together in many ways (that is, using social media); for example, Facebook, Twitter, LinkedIn, and so on. Using these social networks, users are working, playing, socializing online, and demonstrating new forms of collaboration and more. Social networks play a crucial role in reshaping business models and opening up numerous possibilities of studying human interaction and collective behavior.

In fact, if we intended to understand how to identify key individuals in social systems, we can generate models using analytical techniques on social network data and extract the information mentioned previously. This process is called Social Network Analysis (SNA).

Formally, the SNA performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals. Social networks create groups of related individuals (friendships) based on different aspects of their interaction. We can find out important information such as hobbies (for product recommendation) or who has the most influential opinion in a group (centrality). We will present in Chapter 10, Working with Social Graphs, a project, Who is your closest friend?, and we will show a solution for Twitter clustering.

Social networks are strongly connected, and these connections are often asymmetric. This makes SNA computationally expensive, and so it needs to be addressed with high-performance solutions that are less statistical and more algorithmic. The visualization of a social network can help us gain a good insight into how people are connected. The exploration of a graph is done through displaying nodes and ties in various colors, sizes, and distributions. D3.js has animation capabilities that enable us to visualize a social graph with interactive animations. These help us to simulate behaviors like information diffusion or the distance between nodes.

Facebook processes more than 500 TB of data daily (images, text, video, likes, and relationships), and this amount of data needs non-conventional treatment like NoSQL databases and MapReduce frameworks. In this book, we will work with MongoDB, a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.

Tools and toys for this book

The main goal of this book is to provide the reader with self-contained projects ready to deploy, and in order to do this, as you go through the book we will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository:

https://github.com/hmcuesta

You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.

Why Python?

Python is a "scripting language" - an interpreted language with its own built-in memory management and good facilities for calling and co-operating with other programs. There are two popular versions, 2.7 or 3.x, and in this book we will be focusing on the 3.x version, because this is under active development and has already seen over two years of stable releases.

Python is multi-platform, runs on Windows, Linux/Unix, and Mac OS X, and has been ported to Java and .NET virtual machines. Python has powerful standard libs and a wealth of third-party packages for numerical computation and machine learning, such as NumPy, SciPy, pandas, SciKit, mlpy, and so on.

Python is excellent for beginners, yet great for experts, is highly scalable, and is also suitable for large projects as well as small ones. It is also easily extensible and object-oriented.

Python is widely used by organizations like Google, Yahoo maps, NASA, Red Hat, Raspberry Pi, IBM, and many more.

http://wiki.python.org/moin/OrganizationsUsingPython

Python has excellent documentation and examples:

http://docs.python.org/3/

The latest Python software is available for free, even for commercial products, and can be downloaded from here:

http://python.org/

Why mlpy?

mlpy (Machine Learning Python) is a module built on top of NumPy, SciPy, and the GNU scientific libraries. It is open source and supports Python 3.x. mlpy has a large number of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are as follows:

Regression: Support Vector Machines (SVM)
Classification: SVM, k-nearest-neighbor (k-NN), classification tree
Clustering: k-means, multidimensional scaling
Dimensionality Reduction: Principal Component Analysis (PCA)
Misc: Dynamic Time Warping (DTW) distance

We can download the latest version of mlpy from here here:http://mlpy.sourceforge.net/

Reference: D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, C. Furlanello. mlpy: Machine Learning Python, 2012: http://arxiv.org/abs/1202.6548.

Why D3.js?

D3.js (data-driven documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM, and it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has excellent documentation, examples and community:

We can download the latest version of D3.js from:

https://d3js.org/

Why MongoDB?

NoSQL is a term that covers different types of data storage technology that are used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in web 2.0 and in social media applications.

MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents. That gives you the possibility to store the view models almost exactly as you model them in the application. You can also perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust, and works perfectly with JavaScript-based web applications because you can store your data in a JSON document and implement a flexible schema, which makes it perfect for unstructured data.

MongoDB is used by well-known corporations like Foursquare, Craigslist, Firebase, SAP, and Forbes; we can see a detailed list of users at:

https://www.mongodb.com/industries

MongoDB has a big and active community, as well as well-written documentation:

http://docs.mongodb.org/manual/

MongoDB is easy to learn and it's free. We can download MongoDB from here:

http://www.mongodb.org/downloads

Summary

In this chapter, we presented an overview of the data analysis ecosystem and explained the basic concepts of the data analysis process and tools, as well as some insight into the practical applications of data analysis. We have also provided an overview of the different kinds of data, both numerical and categorical. We got into the nature of data: structured (databases, logs, and reports) and unstructured (image collections, social networks, and text mining). Then, we introduced the importance of data visualization and how a fine visualization can help us with exploratory data analysis. Finally, we explored some of the concepts of big data, quantified self-, and social network-analytics.

In the next chapter we will look at the cleaning, processing, and transforming of data using Python and OpenRefine.