What this book covers
Chapter 1, Getting Started with Predictive Modelling, talks about aspects, scope, and applications of predictive modelling. It also discusses various Python packages commonly used in data science, Python IDEs, and the methods to install these on systems.
Chapter 2, Data Cleaning, describes the process of reading a dataset, getting a bird's eye view of the dataset, handling the missing values in the dataset, and exploring the dataset with basic plotting using the pandas and matplotlib packages in Python. The data cleaning and wrangling together constitutes around 80% of the modelling time.
Chapter 3, Data Wrangling, describes the methods to subset a dataset, concatenate or merge two or more datasets, group the dataset by categorical variables, split the dataset into training and testing sets, generate dummy datasets using random numbers, and create simulations using random numbers.
Chapter 4, Statistical Concepts for Predictive Modelling, explains the basic statistics needed to make sense of the model parameters resulting from the predictive models. This chapter deals with concepts like hypothesis testing, z-tests, t-tests, chi-square tests, p-values, and so on followed by a discussion on correlation.
Chapter 5, Linear Regression with Python, starts with a discussion on the mathematics behind the linear regression validating the mathematics behind it using a simulated dataset. It is then followed by a summary of implications and interpretations of various model parameters. The chapter also describes methods to implement linear regression using the stasmodel.api and scikit-learn packages and handling various related contingencies, such as multiple regression, multi-collinearity, handling categorical variables, non-linear relationships between predictor and target variables, handling outliers, and so on.
Chapter 6, Logistic Regression with Python, explains the concepts, such as odds ratio, conditional probability, and contingency tables leading ultimately to detailed discussion on mathematics behind the logistic regression model (using a code that implements the entire model from scratch) and various tests to check the efficiency of the model. The chapter also describes the methods to implement logistic regression in Python and drawing and understanding an ROC curve.
Chapter 7, Clustering with Python, discusses the concepts, such as distances, the distance matrix, and linkage methods to understand the mathematics and logic behind both hierarchical and k-means clustering. The chapter also describes the methods to implement both the types of clustering in Python and methods to fine tune the number of clusters.
Chapter 8, Trees and Random Forests with Python, starts with a discussion on topics, such as entropy, information gain, gini index, and so on. To illustrate the mathematics behind creating a decision tree followed by a discussion on methods to handle variations, such as a continuous numerical variable as a predictor variable and handling a missing value. This is followed by methods to implement the decision tree in Python. The chapter also gives a glimpse into understanding and implementing the regression tree and random forests.
Chapter 9, Best Practices for Predictive Modelling, entails the best practices to be followed in terms of coding, data handling, algorithms, statistics, and business context for getting good results in predictive modelling.
Appendix, A List of Links, contains a list of sources which have been directly or indirectly consulted or used in the book. It also contains the link to the folder which contains datasets used in the book.