Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Practical Data Analysis
Practical Data Analysis

Practical Data Analysis: Pandas, MongoDB, Apache Spark, and more , Second Edition

Arrow left icon
Profile Icon Dr. Sampath Kumar Profile Icon Cuesta
Arrow right icon
€41.99
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5 (2 Ratings)
Paperback Sep 2016 338 pages 2nd Edition
eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Dr. Sampath Kumar Profile Icon Cuesta
Arrow right icon
€41.99
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5 (2 Ratings)
Paperback Sep 2016 338 pages 2nd Edition
eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m
eBook
€8.99 €32.99
Paperback
€41.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Practical Data Analysis

Chapter 1. Getting Started

Data analysis is the process in which raw data is ordered and organized to be used in methods that help to evaluate and explain the past and predict the future. Data analysis is not about the numbers, it is about making/asking questions, developing explanations, and testing hypotheses based on logical and analytical methods. Data analysis is a multidisciplinary field that combines computer science, artificial intelligence, machine learning, statistics, mathematics, and business domain, as shown in the following figure:

Getting Started

All of these skills are important for gaining a good understanding of the problem and its optimal solutions, so let's define those fields.

Computer science

Computer science creates the tools for data analysis. The vast amount of data generated has made computational analysis critical and has increased the demand for skills like programming, database administration, network administration, and high-performance computing. Some programming experience in Python (or any high-level programming language) is needed to follow the chapters in this book.

Artificial intelligence

According to Stuart Russell and Peter Norvig:

"Artificial intelligence has to do with smart programs, so let's get on and write some".

In other words, Artificial intelligence (AI) studies the algorithms that can simulate an intelligent behavior. In data analysis we use AI to perform those activities that require intelligence, like inference, similarity search, or unsupervised classification. Fields like deep learning rely on artificial intelligence algorithms; some of its current uses are chatbots, recommendation engines, image classification, and so on.

Machine learning

Machine learning (ML) is the study of computer algorithms to learn how to react in a certain situation or recognize patterns. According to Arthur Samuel (1959):

"Machine Learning is a field of study that gives computers the ability to learn without being explicitly programmed".

ML has a large amount of algorithms generally split into three groups depending how the algorithms are training. They are as follows:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

The relevant number of algorithms is used throughout the book and they are combined with practical examples, leading the reader through the process from the initial data problem to its programming solution.

Statistics

In January 2009, Google's Chief Economist Hal Varian said:

"I keep saying the sexy job in the next ten years will be statisticians. People think I'm joking, but who would've guessed that computer engineers would've been the sexy job of the 1990s?"

Statistics is the development and application of methods to collect, analyze, and interpret data. Data analysis encompasses a variety of statistical techniques such as simulation, Bayesian methods, forecasting, regression, time-series analysis, and clustering.

Mathematics

Data analysis makes use of a lot of mathematical techniques like linear algebra (vector and matrix, factorization, eigenvalue), numerical methods, and conditional probability, in algorithms. In this book, all the chapters are self-contained and include the necessary math involved.

Knowledge domain

One of the most important activities in data analysis is asking questions, and a good understanding of the knowledge domain can give you the expertise and intuition needed to ask good questions. Data analysis is used in almost every domain, including finance, administration, business, social media, government, and science.

Data, information, and knowledge

Data is facts of the world. Data represents a fact or statement of an event without relation to other things. Data comes in many forms, such as web pages, sensors, devices, audio, video, networks, log files, social media, transactional applications, and much more. Most of these data are generated in real time and on a very large-scale. Although it is generally alphanumeric (text, numbers, and symbols), it can consist of images or sound. Data consists of raw facts and figures. It does not have any meaning until it is processed. For example, financial transactions, age, temperature, and the number of steps from my house to my office are simply numbers. The information appears when we work with those numbers and we can find value and meaning.

Information can be considered as an aggregation of data. Information has usually got some meaning and purpose. The information can help us to make decisions easier. After processing the data, we can get the information within a context in order to give proper meaning. In computer jargon, a relational database makes information from the data stored within it.

Knowledge is information with meaning. Knowledge happens only when human experience and insight is applied to data and information. We can talk about knowledge when the data and the information turn into a set of rules to assist the decisions. In fact, we can't store knowledge because it implies the theoretical or practical understanding of a subject. The ultimate purpose of knowledge is for value creation.

Inter-relationship between data, information, and knowledge

We can observe that the relationship between data, information, and knowledge looks like cyclical behavior. The following diagram demonstrates the relationship between them. This diagram also explains the transformation of data into information and vice versa, similarly information and knowledge. If we apply valuable information based on context and purpose, it reflects knowledge. At the same time, the processed and analyzed data will give the information. When looking at the transformation of data to information and information to knowledge, we should concentrate on the context, purpose, and relevance of the task.

Inter-relationship between data, information, and knowledge

Now I would like to discuss these relationships with a real-life example:

Our students conducted a survey for their project with the purpose of collecting data related to customer satisfaction of a product and to see the conclusion of reducing the price of that product. As it was a real project, our students got to make the final decision to satisfy the customers. Data collected by the survey was processed and a final report was prepared. Based on the project report, the manufacturer of that product has since reduced the cost. Let's take a look at the following:

  • Data: Facts from the survey.
    • For example: Number of customers purchased the product, satisfaction levels, competitor information, and so on.

  • Information: Project report.
    • For example: Satisfaction level related to price based on the competitor product.

  • Knowledge: The manufacturer learned what to do for customer satisfaction and increase product sales.
    • For example: The manufacturing cost of the product, transportation cost, quality of the product, and so on.

Finally, we can say that the data-information-knowledge hierarchy seemed like a great idea. However, by using predictive analytics we can simulate an intelligent behavior and provide a good approximation. In the following image is an example of how to turn data into knowledge:

Inter-relationship between data, information, and knowledge

The nature of data

Data is the plural of datum, so is always treated as plural. We can find data in all situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter account. In fact, data can be seen as the essential raw material to any kind of human activity. According to the Oxford English Dictionary, data are

"known facts or things used as basis for inference or reckoning".

As it is shown in the following image, we can see data in two distinct ways, Categorical and Numerical:

The nature of data

Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable with two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder).

Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations can be counted and are distinct and separate, for example, the number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval, for example, an economic time series like historic gold prices.

The kinds of datasets used in this book are the following:

  • E-mails (unstructured, discrete)
  • Digital images (unstructured, discrete)
  • Stock market logs (structured, continuous)
  • Historic gold prices (structured, continuous)
  • Credit approval records (structured, discrete)
  • Social media friends relationships (unstructured, discrete)
  • Tweets and treading topics (unstructured, continuous)
  • Sales records (structured, continuous)

For each of the projects in this book we try to use a different kind of data. This book is trying to give the reader the ability to address different kinds of data problems.

The data analysis process

When you have a good understanding of a phenomenon it is possible to make predictions about it. Data analysis helps us to make this possible through exploring the past and creating predictive models.

The data analysis process is composed of following steps:

  • The statement of problem
  • Collecting your data
  • Cleaning the data
  • Normalizing the data
  • Transforming the data
  • Exploratory statistics
  • Exploratory visualization
  • Predictive modeling
  • Validating your model
  • Visualizing and interpreting your results
  • Deploying your solution

All of these activities can be grouped as is shown in the following image:

The data analysis process

The problem

The problem definition starts with high-level business domain questions, such as how to track differences in behavior between groups of customers or knowing what the gold price will be in the next month. Understanding the objectives and requirements from a domain perspective is the key for a successful data analysis project.

Types of data analysis questions include:

  • Inferential
  • Predictive
  • Descriptive
  • Exploratory
  • Causal
  • Correlational

Data preparation

Data preparation is about how to obtain, clean, normalize, and transform the data into an optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous, out-of-range, or missing values. This process can take up a lot of time. In Chapter 11, Working with Twitter Data, we will go into more detail about working with data, using OpenRefine to address complicated tasks. Analyzing data that has not been carefully prepared can lead you to highly misleading results.

The characteristics of good data are as follows:

  • Complete
  • Coherent
  • Ambiguity elimination
  • Countable
  • Correct
  • Standardized
  • Redundancy elimination

Data exploration

Data exploration is essentially looking at the processed data in a graphical or statistical form and trying to find patterns, connections, and relations in the data. Visualization is used to provide overviews in which meaningful patterns may be found. In Chapter 3, Getting to Grips with Visualization, we will present a JavaScript visualization framework (D3.js) and implement some examples of how to use visualization as a data exploration tool.

Predictive modeling

From the galaxy of information we have to extract usable hidden patterns and trends using relevant algorithms. To extract the future behavior of these hidden patterns, we can use predictive modeling. Predictive modeling is a statistical technique to predict future behavior by analyzing existing information, that is, historical data. We have to use proper statistical models that best forecast the hidden patterns of the data or information.

Predictive modeling is a process used in data analysis to create or choose a statistical model to try to best predict the probability of an outcome. Using predictive modeling, we can assess the future behavior of the customer. For this, we require past performance data of that customer. For example, in the retail sector, predictive analysis can play an important role in getting better profitability. Retailers can store galaxies of historical data. After developing different predicting models using this data, we can forecast to improve promotional planning, optimize sales channels, optimize store areas, and enhance demand planning.

Initially, building predictive models requires expertise views. After building relevant predicting models, we can use them automatically for forecasts. Predicting models give better forecasts when we concentrate on a careful combination of predictors. In fact, if the data size increases, we get more precise prediction results.

In this book we will use a variety of those models, and we can group them into three categories based on their outcomes:

Model

Chapter

Algorithm

Categorical outcome

(Classification)

4

Naïve Bayes Classifier

11

Natural Language Toolkit and Naïve Bayes Classifier

Numerical outcome

(Regression)

6

Random walk

8

Support vector machines

8

Distance-based approach and k-nearest neighbor

9

Cellular automata

Descriptive modeling

(Clustering)

5

Fast Dynamic Time Warping (FDTW) + distance metrics

10

Force layout and Fruchterman-Reingold layout

Another important task we need to accomplish in this step is finishing the evaluating model we chose as optimal for the particular problem.

Model assumptions are important for the quality of the predictions model. Better predictions will result from a model that satisfies its underlying assumptions. However, assumptions can never be fully met in empirical data, and evaluation preferably focuses on the validity of the predictions. The strength of the evidence for validity is usually considered to be stronger.

The no free lunch theorem proposed by Wolpert in 1996 said:

"No Free Lunch theorems have shown that learning algorithms cannot be universally good".

But extracting valuable information from the data means the predictive model should be accurate. There are many different tests to determine if the predictive models we create are accurate, meaningful representations that will prove valuable information.

The model evaluation helps us to ensure that our analysis is not overoptimistic or over fitted. In this book we are going to present two different ways of validating the model:

  • Cross-validation: Here, we divide the data into subsets of equal size and test the predictive model in order to estimate how it is going to perform in practice. We will implement cross-validation in order to validate the robustness of our model as well as evaluate multiple models to identify the best model based on their performance.
  • Hold-out: Here, a large dataset is arbitrarily divided into three subsets: training set, validation set, and test set.

Visualization of results

This is the final step in our analysis process. When we present model output results, visualization tools can play an important role. The visualization results are an important piece of our technological architecture. As the database is the core of our architecture, various technologies and methods for the visualization of data can be employed.

In an explanatory data analysis process, simple visualization techniques are very useful for discovering patterns, since the human eye plays an important role. Sometimes, we have to generate a three-dimensional plot for finding the visual pattern. But, for getting better visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In practice, the hypothesis of the study, dimensionality of the feature space, and data all play important roles in ensuring a good visualization technique.

In this book, we will focus in the univariate and multivariate graphical models. Using a variety of visualization tools like bar charts, pie charts, scatterplots, line charts, and multiple line charts, all implemented in D3.js; we will also learn how to use standalone plotting in Python with Matplotlib.

Quantitative versus qualitative data analysis

Quantitative data are numerical measurements expressed in terms of numbers.

Qualitative data are categorical measurements expressed in terms of natural language descriptions.

As is shown in the following image, we can observe the differences between quantitative and qualitative analysis:

Quantitative versus qualitative data analysis

Quantitative analytics involves analysis of numerical data. The type of the analysis will depend on the level of measurement. There are four kinds of measurements:

  • Nominal data has no logical order and is used as classification data.
  • Ordinal data has a logical order and differences between values are not constant.
  • Interval data is continuous and depends on logical order. The data has standardized differences between values, but do not include zero.
  • Ratio data is continuous with logical order as well as regular intervals differences between values and may include zero.

Qualitative analysis can explore the complexity and meaning of social phenomena. Data for qualitative study may include written texts (for example, documents or e-mail) and/or audible and visual data (digital images or sounds). In Chapter 11, Working with Twitter Data, we will present a sentiment analysis from Twitter data as an example of qualitative analysis.

Importance of data visualization

The goal of data visualization is to expose something new about the underlying patterns and relationships contained within the data. The visualization not only needs to be beautiful but also meaningful in order to help organizations make better decisions. Visualization is an easy way to jump into a complex dataset (small or big) to describe and explore the data efficiently. Many kinds of data visualization are available, such as bar charts, histograms, line charts, pie charts, heat maps, frequency Wordles (as is shown in the following image), and so on, for one variable, two variables, many variables in one, and even two or three dimensions:

Importance of data visualization

Data visualization is an important part of our data analysis process because it is a fast and easy way to perform exploratory data analysis through summarizing their main characteristics with a visual graph.

The goals of exploratory data analysis are as follows:

  • Detection of data errors
  • Checking of assumptions
  • Finding hidden patters (like tendency)
  • Preliminary selection of appropriate models
  • Determining relationships between the variables

We will go into more detail about data visualization and exploratory data analysis in Chapter 3, Getting to Grips with Visualization.

What about big data?

Big data is a term used when the data exceeds the processing capacity of a typical database. The integration of computer technology into science and daily life has enabled the collection of massive volumes of data, such as climate data, website transaction logs, customer data, and credit card records. However, such big datasets cannot be practically managed on a single commodity computer because their sizes are too large to fit in memory, or it takes more time to process the data. To avoid this obstacle, one may have to resort to parallel and distributed architectures, with multicore and cloud computing platforms providing access to hundreds or thousands of processors. For the storing and manipulation of big data, parallel and distributed architectures show new capabilities.

Now, big data is a truth: the variety, volume, and velocity of data coming from the Web, sensors, devices, audio, video, networks, log files, social media, and transactional applications reach exceptional levels. Now big data has also hit the business, government, and science sectors. This phenomenal growth means that not only must we understand big data in order to interpret the information that truly counts, but also the possibilities of big data analytics.

There are three main features of big data:

  • Volume: Large amounts of data
  • Variety: Different types of structured, unstructured, and multistructured data
  • Velocity: Needs to be analyzed quickly

As is shown in the following image, we can see the interaction between these three Vs:

What about big data?

We need big data analytics when data grows fast and needs to uncover hidden patterns, unknown correlations, and other useful information that can be used to make better decisions. With big data analytics, data scientists and others can analyze huge volumes of data that conventional analytics and business intelligence solutions cannot in order to transform business decisions for the future. Big data analytics is a workflow that distils terabytes of low-value data.

Big data is an opportunity for any company to take advantage of data aggregation, data exhaustion, and metadata. This makes big data a useful business analytics tool, but there is a common misunderstanding of what big data actually is.

The most common architecture for big data processing is through Map-Reduce, which is a programming model for processing large datasets in parallel using a distributed cluster.

Apache Hadoop is the most popular implementation of MapReduce, and it is used to solve large-scale distributed data storage, analysis, and retrieval tasks. However, MapReduce is just one of three classes of technologies that store and manage big data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. In this book we will implement MapReduce functions and NoSQL storage through MongoDB in Chapter 12, Data Processing and Aggregation with MongoDB, and Chapter 13, Working with MapReduce.

MongoDB provides us with document-oriented storage, high availability, and map/reduce flexible aggregation for data processing.

A paper published by IEEE in 2009 The Unreasonable Effectiveness of Data says the following:

"But invariably, simple models and a lot of data trump over more elaborate models based on less data."

This is a fundamental idea in big data (you can find the full paper at http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf). The trouble with real-world data is that the probability of finding false correlations is high and gets higher as the datasets grows. That's why, in this book, we will focus on meaningful data instead of big data.

One of the main challenges for big data is how to store, protect, back up, organize, and catalog the data in a petabyte scale. Another of the main challenges of big data is the concept of data ubiquity. With the proliferation of smart devices with several sensors and cameras, the amount of data available for each person increases every minute. Big data must be able to process all those data in real time:

What about big data?

Quantified self

Quantified self is self-knowledge through self-tracking with technology. In this aspect, one can collect daily activities data on his own in terms of inputs, states, and performance. For example, input means food consumption or quality of surrounding air, states means mood or blood pressure, and performance means mental or physical condition. To collect these data, we can use wearable sensors and life logging. Quantified self-process allows individuals to quantify biometrics that they never knew existed, as well as make data collection cheaper and more convenient. One can track their insulin and cortisol levels and sequence DNA. Using quantified self data, one can be cautious about one's overall health, diet, and level of physical activity.

These days, wearing self-tracking gadgets is rapidly increasing. If we pooled the quantified self-data of a specific group of people, we can apply predictive algorithms on this data to diagnose patients in that location. That means quantified self data is very useful in certain medication contexts.

In the following screenshot, we can see some electronics gadgets that gather quantitative data:

Quantified self

Sensors and cameras

Interaction with the outside world is highly important in data analysis. Using sensors like Radio-Frequency Identification (RFID) or a smartphone to scan a QR code (Quick Response) code are easy ways of interacting directly with the customer, making recommendations, and analyzing consumer trends.

On the other hand, people are using their smartphones all the time, using their cameras as a tool. In Chapter 5, Similarity-Based Image Retrieval, we will use these digital images to perform a search by image. This can be used, for example, in face recognition or for finding recommendations of a restaurant just by taking a picture of the front door.

This interaction with the real world can give you a competitive advantage and a real-time data source directly from the customer.

Social network analysis

Nowadays, the Internet brings people together in many ways (that is, using social media); for example, Facebook, Twitter, LinkedIn, and so on. Using these social networks, users are working, playing, socializing online, and demonstrating new forms of collaboration and more. Social networks play a crucial role in reshaping business models and opening up numerous possibilities of studying human interaction and collective behavior.

In fact, if we intended to understand how to identify key individuals in social systems, we can generate models using analytical techniques on social network data and extract the information mentioned previously. This process is called Social Network Analysis (SNA).

Formally, the SNA performs the analysis of social relationships in terms of network theory, with nodes representing individuals and ties representing relationships between the individuals. Social networks create groups of related individuals (friendships) based on different aspects of their interaction. We can find out important information such as hobbies (for product recommendation) or who has the most influential opinion in a group (centrality). We will present in Chapter 10, Working with Social Graphs, a project, Who is your closest friend?, and we will show a solution for Twitter clustering.

Social networks are strongly connected, and these connections are often asymmetric. This makes SNA computationally expensive, and so it needs to be addressed with high-performance solutions that are less statistical and more algorithmic. The visualization of a social network can help us gain a good insight into how people are connected. The exploration of a graph is done through displaying nodes and ties in various colors, sizes, and distributions. D3.js has animation capabilities that enable us to visualize a social graph with interactive animations. These help us to simulate behaviors like information diffusion or the distance between nodes.

Facebook processes more than 500 TB of data daily (images, text, video, likes, and relationships), and this amount of data needs non-conventional treatment like NoSQL databases and MapReduce frameworks. In this book, we will work with MongoDB, a document-based NoSQL database, which also has great functions for aggregations and MapReduce processing.

Tools and toys for this book

The main goal of this book is to provide the reader with self-contained projects ready to deploy, and in order to do this, as you go through the book we will use and implement tools such as Python, D3, and MongoDB. These tools will help you to program and deploy the projects. You also can download all the code from the author's GitHub repository:

https://github.com/hmcuesta

You can see a detailed installation and setup process of all the tools in Appendix, Setting Up the Infrastructure.

Why Python?

Python is a "scripting language" - an interpreted language with its own built-in memory management and good facilities for calling and co-operating with other programs. There are two popular versions, 2.7 or 3.x, and in this book we will be focusing on the 3.x version, because this is under active development and has already seen over two years of stable releases.

Python is multi-platform, runs on Windows, Linux/Unix, and Mac OS X, and has been ported to Java and .NET virtual machines. Python has powerful standard libs and a wealth of third-party packages for numerical computation and machine learning, such as NumPy, SciPy, pandas, SciKit, mlpy, and so on.

Python is excellent for beginners, yet great for experts, is highly scalable, and is also suitable for large projects as well as small ones. It is also easily extensible and object-oriented.

Python is widely used by organizations like Google, Yahoo maps, NASA, Red Hat, Raspberry Pi, IBM, and many more.

http://wiki.python.org/moin/OrganizationsUsingPython

Python has excellent documentation and examples:

http://docs.python.org/3/

The latest Python software is available for free, even for commercial products, and can be downloaded from here:

http://python.org/

Why mlpy?

mlpy (Machine Learning Python) is a module built on top of NumPy, SciPy, and the GNU scientific libraries. It is open source and supports Python 3.x. mlpy has a large number of machine learning algorithms for supervised and unsupervised problems.

Some of the features of mlpy that will be used in this book are as follows:

  • Regression: Support Vector Machines (SVM)
  • Classification: SVM, k-nearest-neighbor (k-NN), classification tree
  • Clustering: k-means, multidimensional scaling
  • Dimensionality Reduction: Principal Component Analysis (PCA)
  • Misc: Dynamic Time Warping (DTW) distance

We can download the latest version of mlpy from here here:http://mlpy.sourceforge.net/

Reference: D. Albanese, R. Visintainer, S. Merler, S. Riccadonna, G. Jurman, C. Furlanello. mlpy: Machine Learning Python, 2012: http://arxiv.org/abs/1202.6548.

Why D3.js?

D3.js (data-driven documents) was developed by Mike Bostock. D3 is a JavaScript library for visualizing data and manipulating the document object model that runs in a browser without a plugin. In D3.js you can manipulate all the elements of the DOM, and it is as flexible as the client-side web technology stack (HTML, CSS, and SVG).

D3.js supports large datasets and includes animation capabilities that make it a really good choice for web visualization.

D3 has excellent documentation, examples and community:

We can download the latest version of D3.js from:

https://d3js.org/

Why MongoDB?

NoSQL is a term that covers different types of data storage technology that are used when you can't fit your business model into a classical relational data model. NoSQL is mainly used in web 2.0 and in social media applications.

MongoDB is a document-based database. This means that MongoDB stores and organizes the data as a collection of documents. That gives you the possibility to store the view models almost exactly as you model them in the application. You can also perform complex searches for data and elementary data mining with MapReduce.

MongoDB is highly scalable, robust, and works perfectly with JavaScript-based web applications because you can store your data in a JSON document and implement a flexible schema, which makes it perfect for unstructured data.

MongoDB is used by well-known corporations like Foursquare, Craigslist, Firebase, SAP, and Forbes; we can see a detailed list of users at:

https://www.mongodb.com/industries

MongoDB has a big and active community, as well as well-written documentation:

http://docs.mongodb.org/manual/

MongoDB is easy to learn and it's free. We can download MongoDB from here:

http://www.mongodb.org/downloads

Summary

In this chapter, we presented an overview of the data analysis ecosystem and explained the basic concepts of the data analysis process and tools, as well as some insight into the practical applications of data analysis. We have also provided an overview of the different kinds of data, both numerical and categorical. We got into the nature of data: structured (databases, logs, and reports) and unstructured (image collections, social networks, and text mining). Then, we introduced the importance of data visualization and how a fine visualization can help us with exploratory data analysis. Finally, we explored some of the concepts of big data, quantified self-, and social network-analytics.

In the next chapter we will look at the cleaning, processing, and transforming of data using Python and OpenRefine.

Left arrow icon Right arrow icon

Key benefits

  • Learn to use various data analysis tools and algorithms to classify, cluster, visualize, simulate, and forecast your data
  • Apply Machine Learning algorithms to different kinds of data such as social networks, time series, and images
  • A hands-on guide to understanding the nature of data and how to turn it into insight

Description

Beyond buzzwords like Big Data or Data Science, there are a great opportunities to innovate in many businesses using data analysis to get data-driven products. Data analysis involves asking many questions about data in order to discover insights and generate value for a product or a service. This book explains the basic data algorithms without the theoretical jargon, and you’ll get hands-on turning data into insights using machine learning techniques. We will perform data-driven innovation processing for several types of data such as text, Images, social network graphs, documents, and time series, showing you how to implement large data processing with MongoDB and Apache Spark.

Who is this book for?

This book is for developers who want to implement data analysis and data-driven algorithms in a practical way. It is also suitable for those without a background in data analysis or data processing. Basic knowledge of Python programming, statistics, and linear algebra is assumed.

What you will learn

  • Acquire, format, and visualize your data
  • Build an image-similarity search engine
  • Generate meaningful visualizations anyone can understand
  • Get started with analyzing social network graphs
  • Find out how to implement sentiment text analysis
  • Install data analysis tools such as Pandas, MongoDB, and Apache Spark
  • Get to grips with Apache Spark
  • Implement machine learning algorithms such as classification or forecasting
Estimated delivery fee Deliver to Luxembourg

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 30, 2016
Length: 338 pages
Edition : 2nd
Language : English
ISBN-13 : 9781785289712
Category :
Languages :
Concepts :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to Luxembourg

Premium delivery 7 - 10 business days

€17.95
(Includes tracking information)

Product Details

Publication date : Sep 30, 2016
Length: 338 pages
Edition : 2nd
Language : English
ISBN-13 : 9781785289712
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 122.97
Practical Data Analysis
€41.99
Practical Data Analysis Cookbook
€41.99
Practical Machine Learning
€38.99
Total 122.97 Stars icon
Banner background image

Table of Contents

15 Chapters
1. Getting Started Chevron down icon Chevron up icon
2. Preprocessing Data Chevron down icon Chevron up icon
3. Getting to Grips with Visualization Chevron down icon Chevron up icon
4. Text Classification Chevron down icon Chevron up icon
5. Similarity-Based Image Retrieval Chevron down icon Chevron up icon
6. Simulation of Stock Prices Chevron down icon Chevron up icon
7. Predicting Gold Prices Chevron down icon Chevron up icon
8. Working with Support Vector Machines Chevron down icon Chevron up icon
9. Modeling Infectious Diseases with Cellular Automata Chevron down icon Chevron up icon
10. Working with Social Graphs Chevron down icon Chevron up icon
11. Working with Twitter Data Chevron down icon Chevron up icon
12. Data Processing and Aggregation with MongoDB Chevron down icon Chevron up icon
13. Working with MapReduce Chevron down icon Chevron up icon
14. Online Data Analysis with Jupyter and Wakari Chevron down icon Chevron up icon
15. Understanding Data Processing using Apache Spark Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.5
(2 Ratings)
5 star 50%
4 star 0%
3 star 0%
2 star 50%
1 star 0%
Jose Arturo Mora Soto Feb 10, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Becoming a data scientist is not trivial, definitely one of the firts steps is to learn how to manipulate data to obtain initial insights, I found this book a great source to start handling data with python, I really recommend this book but be aware that in order to have a better understanding you should need previous experience with python.
Amazon Verified review Amazon
Amazon Customer Oct 21, 2017
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
The authors may be experts in data analysis but they are not doing a good job of explaining it. If you are new to data analysis, this book will get you totally confused.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela