Predicting the output value (that is, regression) or label (that is, classification) on future unseen data is a common final step in data mining projects.
Before reading the rest of this chapter, please be sure to digest the prerequisite concepts introduced in the Basic data terminology and Basic summary statistics sections in Chapter 2, Basic Terminology and Our End-to-End Example. In particular, the content on data types, variable types, and prediction metrics will be assumed as having been pre-learned throughout the entirety of the chapter.
The main strategy is to collect a training set and build a mapping function (that is, fit a model) from the input variables (X) to the output variable (y). Let's collect our assumptions before moving on:
- (Assumption) There is a relationship between X and y, namely that X are independent variables and...