Packt+ | Advance your knowledge in tech

You're reading from Learning Predictive Analytics with Python

Product type Book

Published in Feb 2016

Publisher

ISBN-13 9781783983261

Pages 354 pages

Edition 1st Edition

Languages

Python

Concepts

Predictive Analytics

Authors (2):

Ashish Kumar

Gary Dougan

View More author details

Table of Contents (19) Chapters

Learning Predictive Analytics with Python

Credits

Foreword

About the Author

Acknowledgments

About the Reviewer

www.PacktPub.com

Preface

1. Getting Started with Predictive Modelling

2. Data Cleaning

3. Data Wrangling

4. Statistical Concepts for Predictive Modelling

5. Linear Regression with Python

6. Logistic Regression with Python

7. Clustering with Python

8. Trees and Random Forests with Python

9. Best Practices for Predictive Modelling

A List of Links

Index

A

algorithms
- best practices / Best practices for algorithms
Anaconda
- about / Anaconda
ANOVA / Best practices for statistics
applications and examples, predictive modelling
- about / Applications and examples of predictive modelling
- People also viewed feature, LinkedIn / LinkedIn's "People also viewed" feature, What it does?
- online ads, correct targeting / Correct targeting of online ads, How is it done?
- Santa Cruz predictive policing / Santa Cruz predictive policing
- smartphone user activity, determining / Determining the activity of a smartphone user using accelerometer data
- sport and fantasy leagues / Sport and fantasy leagues

B

Bagging
- about / Understanding and implementing random forests
Bell Curve
- about / Cumulative density function
best practices
- for coding / Best practices for coding
- for data handling / Best practices for data handling
- for algorithms / Best practices for algorithms
- for statistics / Best practices for statistics
- for business context / Best practices for business contexts
best practices, for coding
- about / Best practices for coding
- codes, commenting / Commenting the codes
- functions, defining for substantial individual tasks / Defining functions for substantial individual tasks
- examples, of functions / Defining functions for substantial individual tasks, Example 3
- hard-coding of variables, avoiding / Avoid hard-coding of variables as much as possible
- version control / Version control
- standard libraries / Using standard libraries, methods, and formulas
- methods / Using standard libraries, methods, and formulas
- formulas / Using standard libraries, methods, and formulas
boxplots
- about / Boxplots
- plotting / Boxplots
business context
- best practices / Best practices for business contexts

C

chi-square test
- about / Chi-square tests, Chi-square test
- usage / Chi-square tests
/ Best practices for statistics
clustering
- about / What is clustering?
- using / How is clustering used?
- cases / Why do we do clustering?
clustering, fine-tuning
- about / Fine-tuning the clustering
- elbow method / The elbow method
- Silhouette Coefficient / Silhouette Coefficient
clustering, implementing with Python
- about / Implementing clustering using Python
- dataset, importing / Importing and exploring the dataset
- dataset, exporting / Importing and exploring the dataset
- values in dataset, normalizing / Normalizing the values in the dataset
- hierarchical clustering, using scikit-learn / Hierarchical clustering using scikit-learn
- k-Means clustering, using scikit-learn / K-Means clustering using scikit-learn
- cluster, interpreting / Interpreting the cluster
coding
- best practices / Best practices for coding
contingency table
- about / Contingency tables
- creating / Contingency tables
correlation
- about / Correlation
correlation coefficient
- about / Correlation
Correlation Matrix
- about / Correlation
Cumulative Density Function
- about / Cumulative density function
Customer Churn Model
- using / Method 1 – using the Customer Churn Model

D

data
- versus oil / Introducing predictive modelling
- reading / Reading the data – variations and examples
- summary / Basics – summary, dimensions, and structure
- structure / Basics – summary, dimensions, and structure
- dimensions / Basics – summary, dimensions, and structure
- concatenating / Concatenating and appending data
- appending / Concatenating and appending data
data collection
- about / How missing values are generated and propagated
data extraction
- about / How missing values are generated and propagated
Data frame
- about / Data frames
data grouping
- about / Grouping the data – aggregation, filtering, and transformation
- illustration / Grouping the data – aggregation, filtering, and transformation
- aggregation / Aggregation
- filtering / Filtering
- transformation / Transformation
- miscellaneous operations / Miscellaneous operations
data handling
- best practices / Best practices for data handling
data importing, in Python
- about / Various methods of importing data in Python
- dataset, reading with read_csv method / Case 1 – reading a dataset using the read_csv method
- dataset, reading with open method / Case 2 – reading a dataset using the open method of Python
- dataset, reading from URL / Case 3 – reading data from a URL
- miscellaneous cases / Case 4 – miscellaneous cases
dataset
- visualizing, by basic plotting / Visualizing a dataset by basic plotting
- sub-setting / Subsetting a dataset
- columns, selecting / Selecting columns
- rows, selecting / Selecting rows
- combination of rows and columns, selecting / Selecting a combination of rows and columns
- new columns, creating / Creating new columns
- merging/joining / Merging/joining datasets
dataset, reading with open method
- about / Case 2 – reading a dataset using the open method of Python
- reading line by line / Reading a dataset line by line
- delimiter, changing / Changing the delimiter of a dataset
decision tree
- about / Introducing decision trees, A decision tree
- using / A decision tree
- mathematics / Understanding the mathematics behind decision trees
decision tree, implementing with scikit-learn
- about / Implementing a decision tree with scikit-learn
- tree, visualizing / Visualizing the tree
- decision tree, cross-validating / Cross-validating and pruning the decision tree
- decision tree, pruning / Cross-validating and pruning the decision tree
delimiter
- about / Delimiters
distance matrix
- about / The distance matrix
distances, between two observations
- Euclidean distance / Euclidean distance
- Manhattan distance / Manhattan distance
- Minkowski distance / Minkowski distance
dummy data frame
- generating / Generating a dummy data frame
dummy variables
- creating / Creating dummy variables

E

elbow method / The elbow method
Euclidean distance
- about / Euclidean distance

F

F-statistics
- about / F-statistics
- significance / F-statistics

G

guidelines, for selecting predictor variables
- R2 / Summary of models
- p-values / Summary of models
- F-statistic / Summary of models
- RSE / Summary of models
- VIF / Summary of models

H

Harvard Business Review (HBR)
- about / Introducing predictive modelling
heteroscedasticity / Other considerations and assumptions for linear regression
hierarchical clustering
- about / Hierarchical clustering
histograms
- about / Histograms
- plotting / Histograms
hypothesis testing
- about / Hypothesis testing
- null hypothesis, versus alternate hypothesis / Null versus alternate hypothesis
- Z-statistic / Z-statistic and t-statistic
- t-statistic / Z-statistic and t-statistic
- confidence intervals / Confidence intervals, significance levels, and p-values
- significance levels / Confidence intervals, significance levels, and p-values
- p-values / Confidence intervals, significance levels, and p-values
- types / Different kinds of hypothesis test
- step-by-step guide / A step-by-step guide to do a hypothesis test
- example / An example of a hypothesis test
hypothesis tests
- left-tailed / Different kinds of hypothesis test
- right-tailed / Different kinds of hypothesis test
- two-tailed / Different kinds of hypothesis test

I

IDEs, for Python
- about / IDEs for Python
- IDLE / IDEs for Python
- IPython Notebook / IDEs for Python
- Spyder / IDEs for Python
IDLE
- about / IDEs for Python
- features / IDEs for Python
Inner Join
- characteristics / Inner Join
- about / Inner Join
- example / An example of the Inner Join
Inter Quartile Range(IQR) / Handling outliers
intra-cluster distance / The elbow method
IPython
- about / Python and its packages for predictive modelling
- URL / Python and its packages for predictive modelling
IPython Notebook
- about / IDEs for Python
- features / IDEs for Python
issues handling, in linear regression
- about / Handling other issues in linear regression
- categorical variables, handling / Handling categorical variables
- variable, transforming to fit non-linear relations / Transforming a variable to fit non-linear relations
- outliers, handling / Handling outliers

J

joins
- summarizing / Summary of Joins in terms of their length

K

k-Means clustering
- about / K-means clustering
knowledge matrix, predictive modelling
- about / Knowledge matrix for predictive modelling

L

left-tailed test
- about / Different kinds of hypothesis test
Left Join
- characteristics / Left Join
- about / Left Join
- example / An example of the Left Join
Likelihood Ratio Test statistic
- about / Likelihood Ratio Test statistic
linear regression
- issues, handling / Handling other issues in linear regression
- considerations / Other considerations and assumptions for linear regression
- assumptions / Other considerations and assumptions for linear regression
- versus logistic regression / Linear regression versus logistic regression
linear regression, implementing with Python
- about / Implementing linear regression with Python
- statsmodel library, using / Linear regression using the statsmodel library
- multiple linear regression / Multiple linear regression
- multi-collinearity / Multi-collinearity
- Variance Inflation Factor (VIF) / Variance Inflation Factor
linkage methods
- about / Linkage methods
- single linkage / Single linkage
- compete linkage / Compete linkage
- average linkage / Average linkage
- centroid linkage / Centroid linkage
- Ward's method / Ward's method
logistic regression
- scenarios / Linear regression versus logistic regression
- math / Understanding the math behind logistic regression
logistic regression, with Python
- implementing / Implementing logistic regression with Python
- data, processing / Processing the data
- data exploration / Data exploration
- data visualization / Data visualization
- dummy variables, creating for categorical variables / Creating dummy variables for categorical variables
- feature selection / Feature selection
- model, implementing / Implementing the model
logistic regression model
- validation / Model validation and evaluation
- evaluation / Model validation and evaluation
- cross validation / Cross validation
logistic regression parameters
- about / Making sense of logistic regression parameters
- Wald test / Wald test
- Likelihood Ratio Test statistic / Likelihood Ratio Test statistic
- chi-square test / Chi-square test

M

Manhattan distance
- about / Manhattan distance
math, behind logistic regression
- about / Understanding the math behind logistic regression
- contingency tables / Contingency tables
- conditional probability / Conditional probability
- odds ratio / Odds ratio
- moving to logistic regression / Moving on to logistic regression from linear regression
- estimation, using Maximum Likelihood Method / Estimation using the Maximum Likelihood Method, Log likelihood function:
- logistic regression model, building from scratch / Building the logistic regression model from scratch
mathematics, behind clustering
- about / Mathematics behind clustering
- distances, between two observations / Distances between two observations
- distance matrix / The distance matrix
- distances, normalizing / Normalizing the distances
- linkage methods / Linkage methods
- hierarchical clustering / Hierarchical clustering
- k-Means clustering / K-means clustering
mathematics, decision tree
- homogeneity / Homogeneity
- entropy / Entropy
- information gain / Information gain
- ID3 algorithm / ID3 algorithm to create a decision tree
- Gini index / Gini index
- Reduction in Variance / Reduction in Variance
- tree, puring / Pruning a tree
- continuous numerical variable, handling / Handling a continuous numerical variable
- missing value of attribute, handling / Handling a missing value of an attribute
maths, behind linear regression
- about / Understanding the maths behind linear regression
- simulated data, using / Linear regression using simulated data
- linear regression model, fitting / Fitting a linear regression model and checking its efficacy
- linear regression model efficacy, checking / Fitting a linear regression model and checking its efficacy
- optimum value of variable coefficients, finding / Finding the optimum value of variable coefficients
matplotlib
- about / Python and its packages for predictive modelling
- URL / Python and its packages for predictive modelling
miles per gallon (mpg) / Transforming a variable to fit non-linear relations
Minkowski distance
- about / Minkowski distance
miscellaneous cases, data reading
- reading, from .xls or .xlsx file / Reading from an .xls or .xlsx file
- CSV or Excel file, writing to / Writing to a CSV or Excel file
missing values
- handling / Handling missing values
- checking for / Checking for missing values
- about / What constitutes missing data?
- generating / How missing values are generated and propagated
- propagating / How missing values are generated and propagated
- treating / Treating missing values
- deletion / Deletion
- imputation / Imputation
model validation
- about / Model validation, Model validation
- data split, training / Training and testing data split
- data split, testing / Training and testing data split
- models, summarizing / Summary of models
- guidelines, for selecting variables / Summary of models
- linear regression with scikit-learn / Linear regression with scikit-learn
- feature selection, with scikit-learn / Feature selection with scikit-learn
Monte-Carlo simulation
- for finding value of pi / Using the Monte-Carlo simulation to find the value of pi
multi-collinearity
- about / Multi-collinearity

N

normal distribution
- about / Normal distribution
null hypothesis
- versus alternate hypothesis / Null versus alternate hypothesis
NumPy
- about / Python and its packages for predictive modelling
- URL / Python and its packages for predictive modelling

O

outliers
- about / Handling outliers
- handling / Handling outliers

P

p-values
- about / p-values
pandas
- about / Python and its packages for predictive modelling
- URL / Python and its packages for predictive modelling
parameters, random forest
- node size / Important parameters for random forests
- number of trees / Important parameters for random forests
- number of predictors sampled / Important parameters for random forests
pip
- installing / Installing pip
predictive analytics
- about / Introducing predictive modelling
predictive modelling
- about / Introducing predictive modelling
- scope / Scope of predictive modelling
- statistical algorithms / Ensemble of statistical algorithms
- statistical tools / Statistical tools
- historical data / Historical data
- mathematical function / Mathematical function
- business context / Business context
- knowledge matrix / Knowledge matrix for predictive modelling
- task matrix / Task matrix for predictive modelling
- applications and examples / Applications and examples of predictive modelling
predictor variables
- about / Multiple linear regression
- forward selection approach / Multiple linear regression
- backward selection approach / Multiple linear regression
Probability Density Function
- about / Probability density function
probability distributions
- about / Generating random numbers following probability distributions
- Probability Density Function / Probability density function
- Cumulative Density Function / Cumulative density function
Python packages
- about / Python and its packages – download and installation
- Anaconda / Anaconda
- Standalone Python / Standalone Python
- installing / Installing a Python package
- installing, with pip / Installing Python packages with pip
Python packages, for predictive modelling
- about / Python and its packages for predictive modelling
- pandas / Python and its packages for predictive modelling
- NumPy / Python and its packages for predictive modelling
- matplotlib / Python and its packages for predictive modelling
- IPython / Python and its packages for predictive modelling
- scikit-learn / Python and its packages for predictive modelling

R

random forest
- implementing, using Python / Implementing a random forest using Python
- features / Why do random forests work?
- parameters / Important parameters for random forests
random forest algorithm
- about / The random forest algorithm
random forests
- about / Understanding and implementing random forests
random numbers
- about / Generating random numbers and their usage
- generating / Generating random numbers and their usage
- usage / Generating random numbers and their usage
- methods, for generating / Various methods for generating random numbers
- seeding / Seeding a random number
- generating, following probability distributions / Generating random numbers following probability distributions
random sampling
- about / Random sampling – splitting a dataset in training and testing datasets
- dataset, testing / Random sampling – splitting a dataset in training and testing datasets
- dataset, splitting / Random sampling – splitting a dataset in training and testing datasets
- Customer Churn Model, using / Method 1 – using the Customer Churn Model
- sklearn, using / Method 2 – using sklearn
- shuffle function, using / Method 3 – using the shuffle function
- and central limit theorem / Random sampling and the central limit theorem
read_csv method
- about / Case 1 – reading a dataset using the read_csv method, The read_csv method
- filepath / The read_csv method
- sep / The read_csv method
- dtype / The read_csv method
- header / The read_csv method
- names / The read_csv method
- skiprows / The read_csv method
- index_col / The read_csv method
- skip_blank_lines / The read_csv method
- na-filter / The read_csv method
- use cases / Use cases of the read_csv method
Receiver Operating Characteristic (ROC) curve
- about / Model validation
Recursive Feature Elimination (RFE) / Feature selection with scikit-learn
regression tree algorithm
- about / Regression tree algorithm
regression trees
- about / Understanding and implementing regression trees
- advantages / Regression tree algorithm
- implementing, with Python / Implementing a regression tree using Python
Residual Standard Error (RSE)
- about / Residual Standard Error
result parameters
- about / Making sense of result parameters
- p-values / p-values
- F-statistics / F-statistics
- Residual Standard Error (RSE) / Residual Standard Error
retrospective analytics
- about / Introducing predictive modelling
right-tailed test
- about / Different kinds of hypothesis test
Right Join
- about / Right Join
- characteristics / Right Join
- example / An example of the Right Join
ROC curve
- about / The ROC curve
- confusion matrix / Confusion matrix

S

scatter plot
- about / Scatter plots
- plotting / Scatter plots
scikit-learn
- about / Python and its packages for predictive modelling
- features / Python and its packages for predictive modelling
- URL / Python and its packages for predictive modelling
Sensitivity (True Positive Rate) / The ROC curve
shuffle function
- using / Method 3 – using the shuffle function
Silhouette Coefficient / Silhouette Coefficient
sklearn
- using / Method 2 – using sklearn
Specificity (True Negative Rate) / The ROC curve
Spyder
- about / IDEs for Python
- features / IDEs for Python
Standalone Python
- about / Standalone Python
statistical algorithms, predictive modelling
- about / Ensemble of statistical algorithms
- supervised algorithms / Ensemble of statistical algorithms
- un-supervised algorithms / Ensemble of statistical algorithms
statistics
- best practices / Best practices for statistics

T

t-statistic
- about / Z-statistic and t-statistic
t-test / Best practices for statistics
t-test (Student-t distribution)
- about / Z-statistic and t-statistic
task matrix, predictive modelling
- about / Task matrix for predictive modelling
two-tailed test
- about / Different kinds of hypothesis test

U

uniform distribution
- about / Uniform distribution
use cases, read_csv method
- about / Use cases of the read_csv method
- directory address and filename, passing as variables / Passing the directory address and filename as variables
- .txt dataset, reading with comma delimiter / Reading a .txt dataset with a comma delimiter
- dataset column names, specifying from list / Specifying the column names of a dataset from a list