What this book covers
Chapter 1, From Data to Decisions – Getting Started with Analytic Applications, teaches you to describe the core components of an analytic pipeline and the ways in which they interact. We also examine the differences between batch and streaming processes, and some use cases in which each type of application is well-suited. We walk through examples of both basic applications using both paradigms and the design decisions needed at each step.
Chapter 2, Exploratory Data Analysis and Visualization in Python, examines many of the tasks needed to start building analytical applications. Using the IPython notebook, we'll cover how to load data in a file into a data frame in pandas, rename columns in the dataset, filter unwanted rows, convert types, and create new columns. In addition, we'll join data from different sources and perform some basic statistical analyses using aggregations and pivots.
Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, shows you how to identify groups of similar items in a dataset. It's an exploratory analysis that we might frequently use as a first step in deciphering new datasets. We explore different ways of calculating the similarity between data points and describe what kinds of data these metrics might best apply to. We examine both divisive clustering algorithms, which split the data into smaller components starting from a single group, and agglomerative methods, where every data point starts as its own cluster. Using a number of datasets, we show examples where these algorithms will perform better or worse, and some ways to optimize them. We also see our first (small) data pipeline, a clustering application in PySpark using streaming data.
Chapter 4, Connecting the Dots with Models – Regression Methods, examines the fitting of several regression models, including transforming input variables to the correct scale and accounting for categorical features correctly. We fit and evaluate a linear regression, as well as regularized regression models. We also examine the use of tree-based regression models, and how to optimize parameter choices in fitting them. Finally, we will look at a sample of random forest modeling using PySpark, which can be applied to larger datasets.
Chapter 5, Putting Data in its Place – Classification Methods and Analysis, explains how to use classification models and some of the strategies for improving model performance. In addition to transforming categorical features, we look at the interpretation of logistic regression accuracy using the ROC curve. In an attempt to improve model performance, we demonstrate the use of SVMs. Finally, we will achieve good performance on the test set through Gradient-Boosted Decision Trees.
Chapter 6, Words and Pixels – Working with Unstructured Data, examines complex, unstructured data. Then we cover dimensionality reduction techniques such as the HashingVectorizer; matrix decompositions such as PCA, CUR, and NMR; and probabilistic models such as LDA. We also examine image data, including normalization and thresholding operations, and see how we can use dimensionality reduction techniques to find common patterns among images.
Chapter 7, Learning from the Bottom Up – Deep Networks and Unsupervised Features, introduces deep neural networks as a way to generate models for complex data types where features are difficult to engineer. We'll examine how neural networks are trained through back-propagation, and why additional layers make this optimization intractable.
Chapter 8, Sharing Models with Prediction Services, describes the three components of a basic prediction service, and discusses how this design will allow us to share the results of predictive modeling with other users or software systems.
Chapter 9, Reporting and Testing – Iterating on Analytic Systems, teaches several strategies for monitoring the performance of predictive models following initial design, and we look at a number of scenarios where the performance or components of the model change over time.