Splitting the data
In the earlier discussion, we saw that partitioning the dataset can be of great benefit in reducing the noise in the data. The question is how does one begin with it? The explanatory variables can be discrete or continuous. We will begin with the continuous (numeric objects in R) variables.
For a continuous variable, the task is a bit simpler. First, identify the unique distinct values of the numeric object. Let us say, for example, that the distinct values of a numeric object, say height in cms, are 160
, 165
, 170
, 175
, and 180
. The data partitions are then obtained as follows:
data[Height<=160,],
data[Height>160,]
data[Height<=165,],
data[Height>165,]
data[Height<=170,],
data[Height>170,]
data[Height<=175,],
data[Height>175,]
The reader should try to understand the rationale behind the code, and certainly this is just an indicative one.
Now, we consider the discrete variables. Here, we have two types of variables, namely categorical and ordinal. In the...