With the exponentially growing amounts of data the world has been observing, especially in the last decade, the number of related technologies and terms also started growing at a faster rate. Suddenly, people in industry, media, and academia started talking (sometimes maybe too much) about big data, data mining, analytics, machine learning, data science, data engineering, statistical learning, artificial intelligence, and many other related terms, and of course one of those terms is predictive analytics, the subject of this book.
There is still a lot of confusion about these terms and exactly what they mean, because they are relatively new. As there is some overlap between them, for our purposes, instead of attempting to define all these terms, I will give a working definition that we can keep in mind as we work through the content of this book. You can also use this definition to find out what predictive analytics is:
Let's break down and analyze this definition:
- Is an applied field: There is no such thing as Theoretical Predictive Analytics; the field of predictive analytics is always used to solve problems and it is being applied in virtually every industry and domain: finance, telecommunications, advertising, insurance, healthcare, education, entertainment, banking, and so on. So keep in mind that you will be always using predictive analytics to solve problems within a particular domain, which is why having the context of the problem and domain knowledge is a key aspect of doing predictive analytics. We will discuss more about this in the next chapter.
- Uses a variety of quantitative methods: When doing predictive analytics, you will be a user of the techniques, theorems, best practices, empirical findings, and theoretical results of mathematical sciences such as computer science and statistics and other sub-fields of those disciplines, and of mathematics such as optimization, probability theory, linear algebra, artificial intelligence, machine learning, deep learning, algorithms, data structures, statistical inference, visualization, and Bayesian inference, among others. I would like to stress that you will be a user of these many sub-fields; they will give you the analytical tools you will use to solve problems and you won't be producing any theoretical results when doing predictive analytics, but your results and conclusions must be consistent with the established theoretical results. This means that you must be able to use the tools properly, and for that, you need the proper conceptual foundation: you need to feel comfortable with the basics of some of the mentioned fields to be able to do predictive analytics correctly and rigorously. In the following chapter, we will discuss many of these fundamental topics at a high and intuitive level and we will provide you with proper sources if you need to go deeper in any of these topics.
- That makes use of data: If quantitative methods are the tools of predictive analytics, then data is the raw material out of which you will (literally) build the models. A key aspect of predictive analytics is the use of data to extract useful information from it. Using data has been proven highly valuable for guiding decision-making: all over the world, organizations of all types are adopting a data-driven approach for making decisions at all levels; rather than relying on intuition or gut feeling, organizations rely increasingly on data. Predictive analytics is another application that uses data, in this case, to make predictions that can then be used to solve problems which can have a measurable impact.
Since the operations and manipulations that need to be done in predictive analytics (or any other type of advanced analytics) usually go well beyond what a spreadsheet allows us to do, to properly carry out predictive analytics we need a programming language. Python and R have become popular choices (although people do use different ones, such as Julia, for instance).
In addition, you may need to work directly with the data storage systems such as relational or non-relational databases or any of the big data storage solutions, which is why you may need to be familiar with things such as SQL and Hadoop; however, since what is done with those technologies is out of the scope for this book, we won’t discuss them any further. We will start all the examples in the book assuming that we are given the data from a storage system and we won't be concerned with how the data was extracted. Starting from raw data, we will see some of the manipulations and transformations that are commonly done within the predictive analytics process. We will do everything using Python and related tools and we'll delve deeper into these manipulations in the coming sections and chapters.
- To make predictions: The last part of the definition seems straightforward, however, one clarification is needed here—in the context of predictive analytics, a prediction is an unknown event, not necessarily about the future as is understood in the colloquial sense. For instance, we can build a predictive model that is able to "predict", if a patient has the disease X using his clinical data. Now, when we gather the patient's data, the disease X is already present or not, so we are not "predicting" if the patient will have the disease X in the future; the model is giving an assessment (an educated guess) about the unknown event "the patient has disease X". Sometimes, of course, the prediction will actually be about the future, but keep in mind that won't be necessarily the case.
Let's take a look at some of the most important concepts in the field; we need a firm grasp of them before moving forward.