Search icon CANCEL
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Mastering Predictive Analytics with Python

You're reading from  Mastering Predictive Analytics with Python

Product type Book
Published in Aug 2016
Publisher
ISBN-13 9781785882715
Pages 334 pages
Edition 1st Edition
Languages
Author (1):
Joseph Babcock Joseph Babcock
Profile icon Joseph Babcock

Table of Contents (16) Chapters

Mastering Predictive Analytics with Python
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
1. From Data to Decisions – Getting Started with Analytic Applications 2. Exploratory Data Analysis and Visualization in Python 3. Finding Patterns in the Noise – Clustering and Unsupervised Learning 4. Connecting the Dots with Models – Regression Methods 5. Putting Data in its Place – Classification Methods and Analysis 6. Words and Pixels – Working with Unstructured Data 7. Learning from the Bottom Up – Deep Networks and Unsupervised Features 8. Sharing Models with Prediction Services 9. Reporting and Testing – Iterating on Analytic Systems Index

Index

A

  • A/B testing
    • models, iterating / Iterating on models through A/B testing
    • experimental allocation / Experimental allocation – assigning customers to experiments
    • sample size, deciding / Deciding a sample size
    • multiple hypothesis testing / Multiple hypothesis testing
  • adjacency matrix / Where agglomerative clustering fails
  • affinity propagation
    • cluster numbers, selecting automatically / Affinity propagation – automatically choosing cluster numbers
  • agglomerative clustering
    • about / Agglomerative clustering
    • failures / Where agglomerative clustering fails
  • Alternating Least Squares (ALS) / Case Study: Training a Recommender System in PySpark
  • Amazon Web Services (AWS) / Working in the cloud
  • analytic pipeline
    • data splitting / Modeling layer
    • parameter tuning / Modeling layer
    • model performance / Modeling layer
    • model persistence / Modeling layer
  • analytic solution, advanced
    • designing / Designing an advanced analytic solution
    • data layer / Data layer: warehouses, lakes, and streams
    • modeling layer / Modeling layer
    • deployment layer / Deployment layer
    • reporting layer / Reporting layer
  • application layer / Deployment layer
  • Area Under Curve (AUC) / Evaluating changes in model performance
  • area under curve (AUC)
    • about / Evaluating classification models
  • auto-regressive moving average (ARMA) / Time series data

B

  • back-propagation
    • about / Parameter fitting with back-propagation
  • boosting
    • about / Fitting and SVM to the census data, Boosting – combining small models to improve accuracy
  • broker / Persisting information with database systems

C

  • categorical data
    • similarity metrics / Similarity metrics for categorical data
    • normalizing / Similarity metrics for categorical data
  • Celery library
    • URL / The web application
  • Classification and Regression Trees (CART) algorithm / Decision trees
  • classification models
    • evaluating / Evaluating classification models
    • improving / Strategies for improving classification models
  • client layer / Deployment layer
  • client requests
    • handling / Clients and making requests
    • GET requests, implementing / The GET requests
    • POST request, implementing / The POST request
    • HEAD request, implementing / The HEAD request
    • PUT request, implementing / The PUT request
    • DELETE request, implementing / The DELETE request
  • communication
    • guidelines / Guidelines for communication
    • terms, translating to business values / Translate terms to business values
    • results, visualizing / Visualizing results
  • convexity
    • about / Jointly optimizing all parameters with second-order methods
  • convolutional network
    • about / Convolutional networks and rectified units
    • input layer / Convolutional networks and rectified units
    • convolutional layer / Convolutional networks and rectified units
    • rectifying layer / Convolutional networks and rectified units
    • downsampling layer / Convolutional networks and rectified units
    • fully connected layer / Convolutional networks and rectified units
  • correlation similarity metrics
    • about / Correlation similarity metrics and time series
  • covariance / Correlation similarity metrics and time series
  • curl command
    • about / The architecture of a prediction service
    • URL / The architecture of a prediction service

D

  • database systems
    • using / Persisting information with database systems
  • data layer / Designing an advanced analytic solution
  • decision trees
    • about / Decision trees
  • dendrograms / Agglomerative clustering
  • deployment layer / Deployment layer
  • digit recognition / The TensorFlow library and digit recognition
  • distance metrics
    • about / Similarity and distance metrics
    • numerical distance metrics / Numerical distance metrics
    • time series / Correlation similarity metrics and time series
    • blending / Similarity metrics for categorical data
  • Dow Jones Industrial Average (DJIA) / Correlation similarity metrics and time series
  • Driver / Creating the SparkContext
  • Dynamic Time Warping (DTW) / Correlation similarity metrics and time series

E

  • e-mail campaigns, case study
    • about / Case study: targeted e-mail campaigns
    • data input and transformation / Data input and transformation
    • sanity checking / Sanity checking
    • model development / Model development
    • scoring / Scoring
    • visualization and reporting / Visualization and reporting
  • Executors / Creating the SparkContext

F

  • false positive rate (FPR)
    • about / Evaluating classification models
  • familywise error rate (FWER) / Multiple hypothesis testing
  • Flask
    • URL / Application – the engine of the predictive services

G

  • Gaussian kernel
    • about / Fitting and SVM to the census data
  • Gauss Markov Theorem / Linear regression
  • generalized linear models
    • about / Generalized linear models
  • Generalized Linear Models (GLMs) / Logistic regression
  • Generalize Estimating Equations (GEE)
    • about / Generalize estimating equations
  • geospatial data
    • about / Working with geospatial data
    • loading / Loading geospatial data
    • cloud, working in / Working in the cloud
  • gradient boosted decision trees
    • about / Gradient boosted decision trees
    • versus, support vector machines and logistic regression / Comparing classification methods
  • gradient boosted machine (GBM) / Evaluating changes in model performance
  • graphical user interface (GUI) / Cleaning textual data
  • graphics processing unit (GPU) / The TensorFlow library and digit recognition

H

  • H20
    • URL / Joining signals and correlation
  • Hadoop distributed file system (HDFS) / Creating an RDD
  • hierarchical clustering / Agglomerative clustering
  • hinge loss
    • about / Separating Nonlinear boundaries with Support vector machines
  • horizontal scaling / Server – the web traffic controller
  • HTTP Status Codes / The GET requests
  • hypertext transfer protocol (HTTP)
    • about / The architecture of a prediction service

I

  • images
    • about / Images
    • image data, cleaning / Cleaning image data
    • thresholding, for highlighting objects / Thresholding images to highlight objects
    • dimensionality reduction, for image analysis / Dimensionality reduction for image analysis
  • Indicator Function / Extracting features from textual data
  • Internet Movie Database
    • URL / Exploring categorical and numerical data in IPython
  • IPython notebook
    • about / Exploring categorical and numerical data in IPython
    • installing / Installing IPython notebook
    • interface / The notebook interface
    • data, loading / Loading and inspecting data
    • data, inspecting / Loading and inspecting data
    • basic manipulations / Basic manipulations – grouping, filtering, mapping, and pivoting
    • Matplotlib, charting with / Charting with Matplotlib
  • iteratively reweighted least squares (IRLS)
    • about / Jointly optimizing all parameters with second-order methods

K

  • K-means ++ / K-means clustering
  • K-means clustering
    • about / K-means clustering
  • k-medoids
    • about / k-medoids
  • kernel function
    • about / Separating Nonlinear boundaries with Support vector machines

L

  • Labeled RDD / Streaming clustering in Spark
  • Latent Dirichlet Allocation (LDA)
    • about / Latent Dirichlet Allocation
  • Latent Semantic Indexing (LSI) / Principal component analysis
  • linear regression
    • about / Linear regression
    • data, preparing / Data preparation
    • evaluation / Model fitting and evaluation
    • model, fitting / Model fitting and evaluation
    • statistical significance / Statistical significance of regression outputs
    • Generalize Estimating Equations (GEE) / Generalize estimating equations
    • mixed effects models / Mixed effects models
    • time series data / Time series data
    • generalized linear models / Generalized linear models
    • regularization, applying to linear models / Applying regularization to linear models
  • linkage metric / Where agglomerative clustering fails
  • link functions
    • Logit / Generalized linear models
    • Poisson / Generalized linear models
    • Exponential / Generalized linear models
  • logistic regression
    • about / Logistic regression
    • multiclass logistic classifiers / Multiclass logistic classifiers: multinomial regression
    • dataset, formatting for classification problems / Formatting a dataset for classification problems
    • stochastic gradient descent (SGD) / Learning pointwise updates with stochastic gradient descent
    • parameters, optimizing with second-order methods / Jointly optimizing all parameters with second-order methods
    • model, fitting / Fitting the model
    • versus, support vector machines and gradient boosted decision trees / Comparing classification methods
  • logistic regression service
    • as case study / Case study – logistic regression service
    • database, setting up / Setting up the database
    • web server, setting up / The web server
    • web application, setting up / The web application
    • model, training / The flow of a prediction service – training a model
    • on-demand and bulk prediction, obtaining / On-demand and bulk prediction
  • Long Short Term Memory Networks (LSTM) / Optimizing the learning rate

M

  • Matplotlib
    • charting with / Charting with Matplotlib
  • message passing / Affinity propagation – automatically choosing cluster numbers
  • Mixed National Institute of Standards and Technology (MNIST) database / The MNIST data
  • modeling layer / Modeling layer
  • model performance
    • checking, with diagnostic / Checking the health of models with diagnostics
    • changes, evaluating / Evaluating changes in model performance
    • changes in feature importance, evaluating / Changes in feature importance
    • unsupervised model performance, changes / Changes in unsupervised model performance
  • models
    • iterating, through A/B testing / Iterating on models through A/B testing
  • multiclass logistic classifiers
    • about / Multiclass logistic classifiers: multinomial regression
  • multidimensional scaling (MDS) / Numerical distance metrics
  • multinomial regression / Multiclass logistic classifiers: multinomial regression

N

  • natural language toolkit (NLTK) library / Cleaning textual data
  • neural networks
    • patterns, learning with / Learning patterns with neural networks
    • perceptron / A network of one – the perceptron
    • perceptrons, combining / Combining perceptrons – a single-layer neural network
    • single-layer neural network / Combining perceptrons – a single-layer neural network
    • parameter fitting, with back-propagation / Parameter fitting with back-propagation
    • discriminative, versus generative models / Discriminative versus generative models
    • gradients, vanishing / Vanishing gradients and explaining away
    • belief networks, pretraining / Pretraining belief networks
    • regularizing, dropout used / Using dropout to regularize networks
    • convolutional networks / Convolutional networks and rectified units
    • rectified units / Convolutional networks and rectified units
    • data compressing, with autoencoder networks / Compressing Data with autoencoder networks
    • learning rate, optimizing / Optimizing the learning rate
  • neurons / Combining perceptrons – a single-layer neural network
  • Newton methods
    • about / Jointly optimizing all parameters with second-order methods
  • non-relational database / Persisting information with database systems
  • numerical distance metrics
    • about / Numerical distance metrics

O

  • Ordinary Least Squares (OLS) / Linear regression

P

  • prediction service
    • architecture / The architecture of a prediction service
    • sever, using / Server – the web traffic controller
    • application, setting up / Application – the engine of the predictive services
    • information, persisting with database systems / Persisting information with database systems
  • Principal Component Analysis (PCA)
    • about / Principal component analysis
    • Latent Dirichlet Allocation (LDA) / Latent Dirichlet Allocation
    • dimensionality reduction, using in predective modeling / Using dimensionality reduction in predictive modeling
  • pseudo-residuals / Gradient boosted decision trees
  • pyspark
    • classifier models, implementing / Case study: fitting classifier models in pyspark
  • PySpark
    • URL / Joining signals and correlation, Introduction to PySpark
    • about / Introduction to PySpark, Scaling out with PySpark – predicting year of song release
    • SparkContext, creating / Creating the SparkContext
    • RDD, creating / Creating an RDD
    • Spark DataFrame, creating / Creating a Spark DataFrame
    • example / Scaling out with PySpark – predicting year of song release
  • Python requests library
    • URL / The GET requests

R

  • RabbitMQ
    • URL / The web application
  • random forest
    • about / Random forest
  • RDD
    • creating / Creating an RDD
  • Receiver-Operator-Characteristic (ROC) / Evaluating changes in model performance
  • receiver operator characteristic (ROC) / Logistic regression
  • Receiver Operator Characteristic (ROC) curve
    • about / Evaluating classification models
  • recommender system training, in PySpark
    • case study / Case Study: Training a Recommender System in PySpark
  • Rectified Linear Unit (ReLU) / Convolutional networks and rectified units
  • Recurrent Neural Networks (RNNs) / Optimizing the learning rate
  • Redis
    • URL / Setting up the database
  • relational database / Persisting information with database systems
  • reporting layer / Reporting layer
  • reporting service
    • about / Case Study: building a reporting service
    • report server, setting up / The report server
    • report application, setting up / The report application
    • visualization layer, using / The visualization layer
  • Resilient Distributed Dataset (RDD) / Streaming clustering in Spark
  • Resilient Distributed Datasets (RDDs) / Introduction to PySpark

S

  • second-order methods
    • about / Formatting a dataset for classification problems
    • parameters, optimizing / Jointly optimizing all parameters with second-order methods
  • server
    • used, for communicating with external systems / Server – the web traffic controller
  • similarity metrics
    • about / Similarity and distance metrics
    • correlation similarity metrics / Correlation similarity metrics and time series
    • for categorical data / Similarity metrics for categorical data
  • Singular Value Decomposition (SVD) / Numerical distance metrics, Principal component analysis
  • social media feeds, case study
    • about / Case study: sentiment analysis of social media feeds
    • data input and transformation / Data input and transformation
    • sanity checking / Sanity checking
    • model development / Model development
    • scoring / Scoring
    • visualization and reporting / Visualization and reporting
  • soft-margin formulation / Separating Nonlinear boundaries with Support vector machines
  • Spark
    • streaming clustering / Streaming clustering in Spark
  • SparkContext
    • creating / Creating the SparkContext
  • Spark DataFrame
    • creating / Creating a Spark DataFrame
  • spectral clustering / Where agglomerative clustering fails
  • statsmodels
    • URL / Model fitting and evaluation
  • stochastic gradient descent
    • about / Learning pointwise updates with stochastic gradient descent
  • stochastic gradient descent (SGD)
    • about / Formatting a dataset for classification problems
  • streaming clustering
    • about / Streaming clustering in Spark
  • support-vector networks
    • about / Separating Nonlinear boundaries with Support vector machines
  • support vector machine (SVM)
    • nonlinear boundaries, separating / Separating Nonlinear boundaries with Support vector machines
    • implementing, to census data / Fitting and SVM to the census data
    • boosting / Boosting – combining small models to improve accuracy
    • versus, logistic regression and gradient boosted decision trees / Comparing classification methods

T

  • TensorFlow library
    • about / The TensorFlow library and digit recognition
    • MNIST data / The MNIST data
    • network, constructing / Constructing the network
  • term-frequency-inverse document frequency (tf-idf) / Extracting features from textual data
  • textual data
    • working with / Working with textual data
    • cleaning / Cleaning textual data
    • features, extracting from / Extracting features from textual data
    • dimensionality reduction, used for simplyfying datasets / Using dimensionality reduction to simplify datasets
  • time series
    • about / Correlation similarity metrics and time series
  • time series analysis
    • about / Time series analysis
    • cleaning and converting / Cleaning and converting
    • time series diagnostics / Time series diagnostics
    • signals and correlation, joining / Joining signals and correlation
  • transformations and operations
    • URL / Creating an RDD
  • tree methods
    • about / Tree methods
    • decision trees / Decision trees
    • random forest / Random forest
  • true positive rate (TPR)
    • about / Evaluating classification models

U

  • units / Combining perceptrons – a single-layer neural network
  • Unweighted Pair Group Method with Arithmetic Mean (UPGMA) / Agglomerative clustering

V

  • vertical scaling / Server – the web traffic controller

W

  • Web Server Gateway Interface (WSGI)
    • about / The architecture of a prediction service

X

  • XGBoost
    • URL / Joining signals and correlation
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}