Packt+ | Advance your knowledge in tech

You're reading from Mastering Predictive Analytics with Python

Product type Book

Published in Aug 2016

Publisher

ISBN-13 9781785882715

Pages 334 pages

Edition 1st Edition

Languages

Python

Concepts

Predictive Analytics

Author (1):

Joseph Babcock

Table of Contents (16) Chapters

Mastering Predictive Analytics with Python

Credits

About the Author

About the Reviewer

www.PacktPub.com

Preface

1. From Data to Decisions – Getting Started with Analytic Applications

2. Exploratory Data Analysis and Visualization in Python

3. Finding Patterns in the Noise – Clustering and Unsupervised Learning

4. Connecting the Dots with Models – Regression Methods

5. Putting Data in its Place – Classification Methods and Analysis

6. Words and Pixels – Working with Unstructured Data

7. Learning from the Bottom Up – Deep Networks and Unsupervised Features

8. Sharing Models with Prediction Services

9. Reporting and Testing – Iterating on Analytic Systems

Index

A

A/B testing
- models, iterating / Iterating on models through A/B testing
- experimental allocation / Experimental allocation – assigning customers to experiments
- sample size, deciding / Deciding a sample size
- multiple hypothesis testing / Multiple hypothesis testing
adjacency matrix / Where agglomerative clustering fails
affinity propagation
- cluster numbers, selecting automatically / Affinity propagation – automatically choosing cluster numbers
agglomerative clustering
- about / Agglomerative clustering
- failures / Where agglomerative clustering fails
Alternating Least Squares (ALS) / Case Study: Training a Recommender System in PySpark
Amazon Web Services (AWS) / Working in the cloud
analytic pipeline
- data splitting / Modeling layer
- parameter tuning / Modeling layer
- model performance / Modeling layer
- model persistence / Modeling layer
analytic solution, advanced
- designing / Designing an advanced analytic solution
- data layer / Data layer: warehouses, lakes, and streams
- modeling layer / Modeling layer
- deployment layer / Deployment layer
- reporting layer / Reporting layer
application layer / Deployment layer
Area Under Curve (AUC) / Evaluating changes in model performance
area under curve (AUC)
- about / Evaluating classification models
auto-regressive moving average (ARMA) / Time series data

B

back-propagation
- about / Parameter fitting with back-propagation
boosting
- about / Fitting and SVM to the census data, Boosting – combining small models to improve accuracy
broker / Persisting information with database systems

C

categorical data
- similarity metrics / Similarity metrics for categorical data
- normalizing / Similarity metrics for categorical data
Celery library
- URL / The web application
Classification and Regression Trees (CART) algorithm / Decision trees
classification models
- evaluating / Evaluating classification models
- improving / Strategies for improving classification models
client layer / Deployment layer
client requests
- handling / Clients and making requests
- GET requests, implementing / The GET requests
- POST request, implementing / The POST request
- HEAD request, implementing / The HEAD request
- PUT request, implementing / The PUT request
- DELETE request, implementing / The DELETE request
communication
- guidelines / Guidelines for communication
- terms, translating to business values / Translate terms to business values
- results, visualizing / Visualizing results
convexity
- about / Jointly optimizing all parameters with second-order methods
convolutional network
- about / Convolutional networks and rectified units
- input layer / Convolutional networks and rectified units
- convolutional layer / Convolutional networks and rectified units
- rectifying layer / Convolutional networks and rectified units
- downsampling layer / Convolutional networks and rectified units
- fully connected layer / Convolutional networks and rectified units
correlation similarity metrics
- about / Correlation similarity metrics and time series
covariance / Correlation similarity metrics and time series
curl command
- about / The architecture of a prediction service
- URL / The architecture of a prediction service

D

database systems
- using / Persisting information with database systems
data layer / Designing an advanced analytic solution
decision trees
- about / Decision trees
dendrograms / Agglomerative clustering
deployment layer / Deployment layer
digit recognition / The TensorFlow library and digit recognition
distance metrics
- about / Similarity and distance metrics
- numerical distance metrics / Numerical distance metrics
- time series / Correlation similarity metrics and time series
- blending / Similarity metrics for categorical data
Dow Jones Industrial Average (DJIA) / Correlation similarity metrics and time series
Driver / Creating the SparkContext
Dynamic Time Warping (DTW) / Correlation similarity metrics and time series

E

e-mail campaigns, case study
- about / Case study: targeted e-mail campaigns
- data input and transformation / Data input and transformation
- sanity checking / Sanity checking
- model development / Model development
- scoring / Scoring
- visualization and reporting / Visualization and reporting
Executors / Creating the SparkContext

F

false positive rate (FPR)
- about / Evaluating classification models
familywise error rate (FWER) / Multiple hypothesis testing
Flask
- URL / Application – the engine of the predictive services

G

Gaussian kernel
- about / Fitting and SVM to the census data
Gauss Markov Theorem / Linear regression
generalized linear models
- about / Generalized linear models
Generalized Linear Models (GLMs) / Logistic regression
Generalize Estimating Equations (GEE)
- about / Generalize estimating equations
geospatial data
- about / Working with geospatial data
- loading / Loading geospatial data
- cloud, working in / Working in the cloud
gradient boosted decision trees
- about / Gradient boosted decision trees
- versus, support vector machines and logistic regression / Comparing classification methods
gradient boosted machine (GBM) / Evaluating changes in model performance
graphical user interface (GUI) / Cleaning textual data
graphics processing unit (GPU) / The TensorFlow library and digit recognition

H

H20
- URL / Joining signals and correlation
Hadoop distributed file system (HDFS) / Creating an RDD
hierarchical clustering / Agglomerative clustering
hinge loss
- about / Separating Nonlinear boundaries with Support vector machines
horizontal scaling / Server – the web traffic controller
HTTP Status Codes / The GET requests
hypertext transfer protocol (HTTP)
- about / The architecture of a prediction service

I

images
- about / Images
- image data, cleaning / Cleaning image data
- thresholding, for highlighting objects / Thresholding images to highlight objects
- dimensionality reduction, for image analysis / Dimensionality reduction for image analysis
Indicator Function / Extracting features from textual data
Internet Movie Database
- URL / Exploring categorical and numerical data in IPython
IPython notebook
- about / Exploring categorical and numerical data in IPython
- installing / Installing IPython notebook
- interface / The notebook interface
- data, loading / Loading and inspecting data
- data, inspecting / Loading and inspecting data
- basic manipulations / Basic manipulations – grouping, filtering, mapping, and pivoting
- Matplotlib, charting with / Charting with Matplotlib
iteratively reweighted least squares (IRLS)
- about / Jointly optimizing all parameters with second-order methods

K

K-means ++ / K-means clustering
K-means clustering
- about / K-means clustering
k-medoids
- about / k-medoids
kernel function
- about / Separating Nonlinear boundaries with Support vector machines

L

Labeled RDD / Streaming clustering in Spark
Latent Dirichlet Allocation (LDA)
- about / Latent Dirichlet Allocation
Latent Semantic Indexing (LSI) / Principal component analysis
linear regression
- about / Linear regression
- data, preparing / Data preparation
- evaluation / Model fitting and evaluation
- model, fitting / Model fitting and evaluation
- statistical significance / Statistical significance of regression outputs
- Generalize Estimating Equations (GEE) / Generalize estimating equations
- mixed effects models / Mixed effects models
- time series data / Time series data
- generalized linear models / Generalized linear models
- regularization, applying to linear models / Applying regularization to linear models
linkage metric / Where agglomerative clustering fails
link functions
- Logit / Generalized linear models
- Poisson / Generalized linear models
- Exponential / Generalized linear models
logistic regression
- about / Logistic regression
- multiclass logistic classifiers / Multiclass logistic classifiers: multinomial regression
- dataset, formatting for classification problems / Formatting a dataset for classification problems
- stochastic gradient descent (SGD) / Learning pointwise updates with stochastic gradient descent
- parameters, optimizing with second-order methods / Jointly optimizing all parameters with second-order methods
- model, fitting / Fitting the model
- versus, support vector machines and gradient boosted decision trees / Comparing classification methods
logistic regression service
- as case study / Case study – logistic regression service
- database, setting up / Setting up the database
- web server, setting up / The web server
- web application, setting up / The web application
- model, training / The flow of a prediction service – training a model
- on-demand and bulk prediction, obtaining / On-demand and bulk prediction
Long Short Term Memory Networks (LSTM) / Optimizing the learning rate

M

Matplotlib
- charting with / Charting with Matplotlib
message passing / Affinity propagation – automatically choosing cluster numbers
Mixed National Institute of Standards and Technology (MNIST) database / The MNIST data
modeling layer / Modeling layer
model performance
- checking, with diagnostic / Checking the health of models with diagnostics
- changes, evaluating / Evaluating changes in model performance
- changes in feature importance, evaluating / Changes in feature importance
- unsupervised model performance, changes / Changes in unsupervised model performance
models
- iterating, through A/B testing / Iterating on models through A/B testing
multiclass logistic classifiers
- about / Multiclass logistic classifiers: multinomial regression
multidimensional scaling (MDS) / Numerical distance metrics
multinomial regression / Multiclass logistic classifiers: multinomial regression

N

natural language toolkit (NLTK) library / Cleaning textual data
neural networks
- patterns, learning with / Learning patterns with neural networks
- perceptron / A network of one – the perceptron
- perceptrons, combining / Combining perceptrons – a single-layer neural network
- single-layer neural network / Combining perceptrons – a single-layer neural network
- parameter fitting, with back-propagation / Parameter fitting with back-propagation
- discriminative, versus generative models / Discriminative versus generative models
- gradients, vanishing / Vanishing gradients and explaining away
- belief networks, pretraining / Pretraining belief networks
- regularizing, dropout used / Using dropout to regularize networks
- convolutional networks / Convolutional networks and rectified units
- rectified units / Convolutional networks and rectified units
- data compressing, with autoencoder networks / Compressing Data with autoencoder networks
- learning rate, optimizing / Optimizing the learning rate
neurons / Combining perceptrons – a single-layer neural network
Newton methods
- about / Jointly optimizing all parameters with second-order methods
non-relational database / Persisting information with database systems
numerical distance metrics
- about / Numerical distance metrics

O

Ordinary Least Squares (OLS) / Linear regression

P

prediction service
- architecture / The architecture of a prediction service
- sever, using / Server – the web traffic controller
- application, setting up / Application – the engine of the predictive services
- information, persisting with database systems / Persisting information with database systems
Principal Component Analysis (PCA)
- about / Principal component analysis
- Latent Dirichlet Allocation (LDA) / Latent Dirichlet Allocation
- dimensionality reduction, using in predective modeling / Using dimensionality reduction in predictive modeling
pseudo-residuals / Gradient boosted decision trees
pyspark
- classifier models, implementing / Case study: fitting classifier models in pyspark
PySpark
- URL / Joining signals and correlation, Introduction to PySpark
- about / Introduction to PySpark, Scaling out with PySpark – predicting year of song release
- SparkContext, creating / Creating the SparkContext
- RDD, creating / Creating an RDD
- Spark DataFrame, creating / Creating a Spark DataFrame
- example / Scaling out with PySpark – predicting year of song release
Python requests library
- URL / The GET requests

R

RabbitMQ
- URL / The web application
random forest
- about / Random forest
RDD
- creating / Creating an RDD
Receiver-Operator-Characteristic (ROC) / Evaluating changes in model performance
receiver operator characteristic (ROC) / Logistic regression
Receiver Operator Characteristic (ROC) curve
- about / Evaluating classification models
recommender system training, in PySpark
- case study / Case Study: Training a Recommender System in PySpark
Rectified Linear Unit (ReLU) / Convolutional networks and rectified units
Recurrent Neural Networks (RNNs) / Optimizing the learning rate
Redis
- URL / Setting up the database
relational database / Persisting information with database systems
reporting layer / Reporting layer
reporting service
- about / Case Study: building a reporting service
- report server, setting up / The report server
- report application, setting up / The report application
- visualization layer, using / The visualization layer
Resilient Distributed Dataset (RDD) / Streaming clustering in Spark
Resilient Distributed Datasets (RDDs) / Introduction to PySpark

S

second-order methods
- about / Formatting a dataset for classification problems
- parameters, optimizing / Jointly optimizing all parameters with second-order methods
server
- used, for communicating with external systems / Server – the web traffic controller
similarity metrics
- about / Similarity and distance metrics
- correlation similarity metrics / Correlation similarity metrics and time series
- for categorical data / Similarity metrics for categorical data
Singular Value Decomposition (SVD) / Numerical distance metrics, Principal component analysis
social media feeds, case study
- about / Case study: sentiment analysis of social media feeds
- data input and transformation / Data input and transformation
- sanity checking / Sanity checking
- model development / Model development
- scoring / Scoring
- visualization and reporting / Visualization and reporting
soft-margin formulation / Separating Nonlinear boundaries with Support vector machines
Spark
- streaming clustering / Streaming clustering in Spark
SparkContext
- creating / Creating the SparkContext
Spark DataFrame
- creating / Creating a Spark DataFrame
spectral clustering / Where agglomerative clustering fails
statsmodels
- URL / Model fitting and evaluation
stochastic gradient descent
- about / Learning pointwise updates with stochastic gradient descent
stochastic gradient descent (SGD)
- about / Formatting a dataset for classification problems
streaming clustering
- about / Streaming clustering in Spark
support-vector networks
- about / Separating Nonlinear boundaries with Support vector machines
support vector machine (SVM)
- nonlinear boundaries, separating / Separating Nonlinear boundaries with Support vector machines
- implementing, to census data / Fitting and SVM to the census data
- boosting / Boosting – combining small models to improve accuracy
- versus, logistic regression and gradient boosted decision trees / Comparing classification methods

T

TensorFlow library
- about / The TensorFlow library and digit recognition
- MNIST data / The MNIST data
- network, constructing / Constructing the network
term-frequency-inverse document frequency (tf-idf) / Extracting features from textual data
textual data
- working with / Working with textual data
- cleaning / Cleaning textual data
- features, extracting from / Extracting features from textual data
- dimensionality reduction, used for simplyfying datasets / Using dimensionality reduction to simplify datasets
time series
- about / Correlation similarity metrics and time series
time series analysis
- about / Time series analysis
- cleaning and converting / Cleaning and converting
- time series diagnostics / Time series diagnostics
- signals and correlation, joining / Joining signals and correlation
transformations and operations
- URL / Creating an RDD
tree methods
- about / Tree methods
- decision trees / Decision trees
- random forest / Random forest
true positive rate (TPR)
- about / Evaluating classification models

U

units / Combining perceptrons – a single-layer neural network
Unweighted Pair Group Method with Arithmetic Mean (UPGMA) / Agglomerative clustering

V

vertical scaling / Server – the web traffic controller

W

Web Server Gateway Interface (WSGI)
- about / The architecture of a prediction service

X

XGBoost
- URL / Joining signals and correlation

The rest of the chapter is locked

You're reading from Mastering Predictive Analytics with Python

Table of Contents (16) Chapters

Index

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

R

S

T

U

V

W

X

Authors (1)

Personalised recommendations for you

You're reading from Mastering Predictive Analytics with Python

Table of Contents (16) Chapters

Index

A

B

C

D

E

F

G

H

I

K

L

M

N

O

P

R

S

T

U

V

W

X

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you