8.1 Linear models and non-linear data
In Chapter 4 and Chapter 6 we learned how to build models of the general form:
Here, θ is a parameter for some probability distribution, for example, the mean of a Gaussian, the p parameter of the binomial, the rate of a Poisson, and so on. We call the inverse link function and is some other function we use to potentially transform the data, like a square root, a polynomial function, or something else.
Fitting, or learning, a Bayesian model can be seen as finding the posterior distribution of the weights β, and thus this is known as the weight view of approximating functions. As we already saw with polynomial and splines regression, by letting be a non-linear function, we can map the inputs onto a feature space. We also saw that by using a polynomial of the proper degree, we can perfectly fit any function. But unless we apply some form of regularization, for example, using prior distributions, this will lead to models that memorize the data, or in other words models with very poor generalizing properties. We also mention that splines can be as flexible as polynomials but with better statistical properties. We will now discuss Gaussian processes, which provide a principled solution to modeling arbitrary functions by effectively letting the data decide on the complexity of the function, while avoiding, or at least minimizing, the chance of overfitting.
The following sections discuss Gaussian processes from a very practical point of view; we have avoided covering almost all the mathematics surrounding them. For a more formal explanation, you may read Gaussian Processes for Machine Learning by Rasmussen and Williams [2005].