Statistics as a form of modeling
Statistics is about collecting, organizing, analyzing, and interpreting data, and hence statistical knowledge is essential for data analysis. Another useful skill when analyzing data is knowing how to write code in a programming language such as Python. Manipulating data is usually necessary given that we live in a messy world with even messier data, and coding helps to get things done. Even if your data is clean and tidy, programming will still be very useful since modern Bayesian statistics is mostly computational statistics.
Most introductory statistical courses, at least for non-statisticians, are taught as a collection of recipes that more or less go like this; go to the the statistical pantry, pick one can and open it, add data to taste and stir until obtaining a consistent p-value, preferably under 0.05 (if you don't know what a p-value is, don't worry; we will not use them in this book). The main goal in this type of course is to teach you how to pick the proper can. We will take a different approach: we will also learn some recipes, but this will be home-made rather than canned food; we will learn how to mix fresh ingredients that will suit different gastronomic occasions. But before we can cook, we must learn some statistical vocabulary and also some concepts.
Exploratory data analysis
Data is an essential ingredient of statistics. Data comes from several sources, such as experiments, computer simulations, surveys, field observations, and so on. If we are the ones that will be generating or gathering the data, it is always a good idea to first think carefully about the questions we want to answer and which methods we will use, and only then proceed to get the data. In fact, there is a whole branch of statistics dealing with data collection known as experimental design. In the era of data deluge, we can sometimes forget that gathering data is not always cheap. For example, while it is true that the Large Hadron Collider (LHC) produces hundreds of terabytes a day, its construction took years of manual and intellectual effort. In this book we will assume that we already have collected the data and also that the data is clean and tidy, something rarely true in the real world. We will make these assumptions in order to focus on the subject of this book. If you want to learn how to use Python for cleaning and manipulating data and also a primer on machine learning, you should probably read the book Python Data Science Handbook by Jake VanderPlas.
OK, so let's assume we have our dataset; usually, a good idea is to explore and visualize it in order to get some intuition about what we have in our hands. This can be achieved through what is known as Exploratory Data Analysis (EDA), which basically consists of the following:
- Descriptive statistics
- Data visualization
The first one, descriptive statistics, is about how to use some measures (or statistics) to summarize or characterize the data in a quantitative manner. You probably already know that you can describe data using the mean, mode, standard deviation, interquartile ranges, and so forth. The second one, data visualization, is about visually inspecting the data; you probably are familiar with representations such as histograms, scatter plots, and others. While EDA was originally thought of as something you apply to data before doing any complex analysis or even as an alternative to complex model-based analysis, through the book we will learn that EDA is also applicable to understanding, interpreting, checking, summarizing, and communicating the results of Bayesian analysis.
Inferential statistics
Sometimes, plotting our data and computing simple numbers, such as the average of our data, is all we need. Other times, we want to make a generalization based on our data. We may want to understand the underlying mechanism that could have generated the data, or maybe we want to make predictions for future (yet unobserved) data points, or we need to choose among several competing explanations for the same observations. That's the job of inferential statistics. To do inferential statistics we will rely on probabilistic models. There are many types of models and most of science, and I will add all of our understanding of the real world, is through models. The brain is just a machine that models reality (whatever reality might be) see this TED talk about the machine that builds the reality http://www.tedxriodelaplata.org/videos/m%C3%A1quina-construye-realidad.
What are models? Models are simplified descriptions of a given system (or process). Those descriptions are purposely designed to capture only the most relevant aspects of the system, and hence, most models do not pretend they are able to explain everything; on the contrary, if we have a simple and a complex model and both models explain the data more or less equally well, we will generally prefer the simpler one. This heuristic for simple models is known as Occam's razor, and we will discuss how it is related to Bayesian analysis in Chapter 6, Model Comparison.
Model building, no matter which type of model you are building, is an iterative process following more or less the same basic rules. We can summarize the Bayesian modeling process using three steps:
- Given some data and some assumptions on how this data could have been generated, we will build models. Most of the time, models will be crude approximations, but most of the time this is all we need.
- Then we will use Bayes' theorem to add data to our models and derive the logical consequences of mixing the data and our assumptions. We say we are conditioning the model on our data.
- Lastly, we will check that the model makes sense according to different criteria, including our data and our expertise on the subject we are studying.
In general, we will find ourselves performing these three steps in a non-linear iterative fashion. Sometimes we will retrace our steps at any given point: maybe we made a silly programming mistake, maybe we found a way to change the model and improve it, maybe we need to add more data.
Bayesian models are also known as probabilistic models because they are built using probabilities. Why probabilities? Because probabilities are the correct mathematical tool to model the uncertainty in our data, so let's take a walk through the garden of forking paths.