Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with Python

Product type Book

Published in Aug 2016

Publisher

ISBN-13 9781785882715

Pages 334 pages

Edition 1st Edition

Languages

Python

Concepts

Predictive Analytics

Author (1):

Joseph Babcock

Table of Contents (16) Chapters

Mastering Predictive Analytics with Python

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

1. From Data to Decisions – Getting Started with Analytic Applications

2. Exploratory Data Analysis and Visualization in Python

3. Finding Patterns in the Noise – Clustering and Unsupervised Learning

4. Connecting the Dots with Models – Regression Methods

5. Putting Data in its Place – Classification Methods and Analysis

6. Words and Pixels – Working with Unstructured Data

7. Learning from the Bottom Up – Deep Networks and Unsupervised Features

8. Sharing Models with Prediction Services

9. Reporting and Testing – Iterating on Analytic Systems

Index

Chapter 1. From Data to Decisions – Getting Started with Analytic Applications

From quarterly financial projections to customer surveys, analytics help businesses to make decisions and plan for the future. While data visualizations such as pie charts and trend lines using spreadsheet programs have been used for decades, recent years have seen a growth in both the volume and diversity of data sources available to the business analyst and the sophistication of tools used to interpret this information.

The rapid growth of the Internet, through e-commerce and social media platforms, has generated a wealth of data, which is available faster than ever before for analysis. Photographs, search queries, and online forum posts are all examples of unstructured data that can't be easily examined in a traditional spreadsheet program. With the proper tools, these kinds of data offer new insights, in conjunction with or beyond traditional data sources.

Traditionally, data such as historical customer records appear in a structured, tabular form that is stored in an electronic data warehouse and easily imported into a spreadsheet program. Even in the case of such tabular data, the volume of records and the rate at which they are available are increasing in many industries. While the analyst might have historically transformed raw data through interactive manipulation, robust analytics increasingly requires automated processing that can scale with the volume and velocity of data being received by a business.

Along with the data itself, the methods used to examine it have become more powerful and complex. Beyond summarizing historical patterns or projecting future events using trend lines derived from a few key input variables, advanced analytics emphasizes the use of sophisticated predictive modeling (see the goals of predictive analytics, as follows) to understand the present and forecast near and long-term outcomes.

Diverse methods for generating such predictions typically require the following common elements:

An outcome or target that we are trying to predict, such as a purchase or a click-through-rate (CTR) on a search result.
A set of columns that comprise features, also known as predictors (for example, a customer's demographic information, past transactions on a sales account, or click behavior on a type of ad) describing individual properties of each record in our dataset (for example, an account or ad).
A procedure that finds the model or set of models which best maps these features to the outcome of interest on a given sample of data.
A way to evaluate the performance of the model on new data.

While predictive modeling techniques can be used in powerful analytic applications to discover complex relationships between seemingly unrelated inputs, they also present a new set of challenges to the business analyst:

What method is the best suited for a particular problem?
How does one correctly evaluate the performance of these techniques on historical and new data?
What are the preferred strategies for tuning the performance of a given method?
How does one robustly scale these techniques for both one-off analysis and ongoing insight?

In this book, we will show you how to address these challenges by developing analytic solutions that transform data into powerful insights for you and your business. The main tasks involved in building these applications are:

Transforming raw data into a sanitized form that can be used for modeling. This may involve both cleaning anomalous data and converting unstructured data into a structured format.
Feature engineering, by transforming these sanitized inputs into the format that is used to develop a predictive model.
Calibrating a predictive model on a subset of this data and assessing its performance.
Scoring new data while evaluating the ongoing performance of the model.
Automating the transformation and modeling steps for regular updates.
Exposing the output of the model to other systems and users, usually through a web application.
Generating reports for the analyst and business user that distills the data and model into regular and robust insights.

Throughout this volume, we will use open-source tools written in the Python programming language to build these sorts of applications. Why Python? The Python language strikes an attractive balance between robust compiled languages such as Java, C++, and Scala, and pure statistical packages such as R, SAS, or MATLAB. We can work interactively with Python using the command line (or, as we will use in subsequent chapters, browser-based notebook environments), plotting data, and prototyping commands. Python also provides extensive libraries, allowing us to transform this exploratory work into web applications (such as Flask, CherryPy, and Celery, as we will see in Chapter 8, Sharing Models with Prediction Services), or scale them to large datasets (using PySpark, as we will explore in future chapters). Thus we can both analyze data and develop software applications within the same language.

Before diving into the technical details of these tools, let's take a high-level look at the concepts behind these applications and how they are structured. In this chapter, we will:

Define the elements of an analytic pipeline: data transformation, sanity checking, preprocessing, model development, scoring, automation, deployment, and reporting.
Explain the differences between batch-oriented and stream processing and their implications at each step of the pipeline.
Examine how batch and stream processing can be jointly accommodated within the Lambda Architecture for data processing.
Explore an example stream-processing pipeline to perform sentiment analysis of social media feeds.
Explore an example of a batch-processing pipeline to generate targeted e-mail marketing campaigns.

Tip

The goals of predictive analytics

The term predictive analytics, along with others such as data mining and machine learning, are often used to describe the techniques used in this book to build analytic solutions. However, it is important to keep in mind that there are two distinct goals these methods can address. Inference involves building models in order to evaluate the significance of a parameter on an outcome and emphasizes interpretation and transparency over predictive performance. For example, the coefficients of a regression model (Chapter 4, Connecting the Dots with Models – Regression Methods) can be used to estimate the effect of variation in a particular model input (for example, customer age or income) on an output variable (for example, sales). The predictions from a model developed for inference may be less accurate than other techniques, but provide valuable conceptual insights that may guide business decisions. Conversely, prediction emphasizes the accuracy of the estimated outcome, even if the model itself is a black box where the connection between an input and the resulting output is not always clear. For example, Deep Learning (Chapter 7, Learning from the Bottom Up – Deep Networks and Unsupervised Features) can produce state-of-the-art models and extremely accurate predictions from complex sets of inputs, but the connection between the input parameters and the prediction may be hard to interpret.