Introducing predictive modelling
The breakneck speed at which the social media and Internet of Things have grown is reflected in the huge silos of data humans generate. The data about where we live, where we come from, what we like, what we buy, how much money we spend, where we travel, and so on. Whenever we interact with a social media or Internet of Things website, we leave a trail, which these websites gleefully log as their data. Every time you buy a book at Amazon, receive a payment through PayPal, write a review on Yelp, post a photo on Instagram, do a check-in on Facebook, apart from making business for these websites, you are creating data for them.
Scope of predictive modelling
Predictive modelling is an ensemble of statistical algorithms coded in a statistical tool, which when applied on historical data, outputs a mathematical function (or equation). It can in-turn be used to predict outcomes based on some inputs (on which the model operates) from the future to drive a goal in business context or enable better decision making in general.
To understand what predictive modelling entails, let us focus on the phrases highlighted previously.
Ensemble of statistical algorithms
Statistics are important to understand data. It tells volumes about the data. How is the data distributed? Is it centered with little variance or does it varies widely? Are two of the variables dependent on or independent of each other? Statistics helps us answer these questions. This book will expect a basic understanding of basic statistical terms, such as mean, variance, co-variance, and correlation. Advanced terms, such as hypothesis testing, Chi-Square tests, p-values, and so on will be explained as and when required. Statistics are the cog in the wheel called model.
Algorithms, on the other hand, are the blueprints of a model. They are responsible for creating mathematical equations from the historical data. They analyze the data, quantify the relationship between the variables, and convert it into a mathematical equation. There is a variety of them: Linear Regression, Logistic Regression, Clustering, Decision Trees, Time-Series Modelling, Naïve Bayes Classifiers, Natural Language Processing, and so on. These models can be classified under two classes:
- Supervised algorithms: These are the algorithms wherein the historical data has an output variable in addition to the input variables. The model makes use of the output variables from historical data, apart from the input variables. The examples of such algorithms include Linear Regression, Logistic Regression, Decision Trees, and so on.
- Un-supervised algorithms: These algorithms work without an output variable in the historical data. The example of such algorithms includes clustering.
The selection of a particular algorithm for a model depends majorly on the kind of data available. The focus of this book would be to explain methods of handling various kinds of data and illustrating the implementation of some of these models.
There are a many statistical tools available today, which are laced with inbuilt methods to run basic statistical chores. The arrival of open-source robust tools like R and Python has made them extremely popular, both in industry and academia alike. Apart from that, Python's packages are well documented; hence, debugging is easier.
Python has a number of libraries, especially for running the statistical, cleaning, and modelling chores. It has emerged as the first among equals when it comes to choosing the tool for the purpose of implementing preventive modelling. As the title suggests, Python will be the choice for this book, as well.
Our machinery (model) is built and operated on this oil called data. In general, a model is built on the historical data and works on future data. Additionally, a predictive model can be used to fill missing values in historical data by interpolating the model over sparse historical data. In many cases, during modelling stages, future data is not available. Hence, it is a common practice to divide the historical data into training (to act as historical data) and testing (to act as future data) through sampling.
As discussed earlier, the data might or might not have an output variable. However, one thing that it promises to be is messy. It needs to undergo a lot of cleaning and manipulation before it can become of any use for a modelling process.
Most of the data science algorithms have underlying mathematics behind them. In many of the algorithms, such as regression, a mathematical equation (of a certain type) is assumed and the parameters of the equations are derived by fitting the data to the equation.
For example, the goal of linear regression is to fit a linear model to a dataset and find the equation parameters of the following equation:
The purpose of modelling is to find the best values for the coefficients. Once these values are known, the previous equation is good to predict the output. The equation above, which can also be thought of as a linear function of Xi's (or the input variables), is the linear regression model.
Another example is of logistic regression. There also we have a mathematical equation or a function of input variables, with some differences. The defining equation for logistic regression is as follows:
Here, the goal is to estimate the values of a and b by fitting the data to this equation. Any supervised algorithm will have an equation or function similar to that of the model above. For unsupervised algorithms, an underlying mathematical function or criterion (which can be formulated as a function or equation) serves the purpose. The mathematical equation or function is the backbone of a model.
All the effort that goes into predictive analytics and all its worth, which accrues to data, is because it solves a business problem. A business problem can be anything and it will become more evident in the following examples:
- Tricking the users of the product/service to buy more from you by increasing the click through rates of the online ads
- Predicting the probable crime scenes in order to prevent them by aggregating an invincible lineup for a sports league
- Predicting the failure rates and associated costs of machinery components
- Managing the churn rate of the customers
The predictive analytics is being used in an array of industries to solve business problems. Some of these industries are, as follows:
- Banking
- Social media
- Retail
- Transport
- Healthcare
- Policing
- Education
- Travel and logistics
- E-commerce
- Human resource
By what quantum did the proposed solution make life better for the business, is all that matters. That is the reason; predictive analytics is becoming an indispensable practice for management consulting.
In short, predictive analytics sits at the sweet spot where statistics, algorithm, technology and business sense intersect. Think about it, a mathematician, a programmer, and a business person rolled in one.
Knowledge matrix for predictive modelling
As discussed earlier, predictive modelling is an interdisciplinary field sitting at the interface and requiring knowledge of four disciplines, such as Statistics, Algorithms, Tools, Techniques, and Business Sense. Each of these disciplines is equally indispensable to perform a successful task of predictive modelling.
These four disciplines of predictive modelling carry equal weights and can be better represented as a knowledge matrix; it is a symmetric 2 x 2 matrix containing four equal-sized squares, each representing a discipline.
Task matrix for predictive modelling
The tasks involved in predictive modelling follows the Pareto principle. Around 80% of the effort in the modelling process goes towards data cleaning and wrangling, while only 20% of the time and effort goes into implementing the model and getting the prediction. However, the meaty part of the modelling that is rich with almost 80% of results and insights is undoubtedly the implementation of the model. This information can be better represented as a matrix, which can be called a task matrix that will look something similar to the following figure:
Many of the data cleaning and exploration chores can be automated because they are alike most of the times, irrespective of the data. The part that needs a lot of human thinking is the implementation of a model, which is what makes the bulk of this book.