Data is undoubtedly the most important component of machine learning. If there was no data, we wouldn't have a common purpose. In most cases, the purpose for which the data is collected defines the problem itself. As we know that the variable might be of several types, the way it is stored and organized is also very important.
Lee and Elder (1997) considered a series of datasets and introduced the need for ensemble models. We will begin by looking at the details of the datasets considered in their paper, and we will then refer to other important datasets later on in the book.
The hypothyroid dataset Hypothyroid.csv
is available in the book's code bundle packet, located at /…/Chapter01/Data
. While we have 26 variables in the dataset, we will only be using seven of these variables. Here, the number of observations is n = 3163. The dataset is downloaded from http://archive.ics.uci.edu/ml/datasets/thyroid+disease and the filename is hypothyroid.data
(http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data). After some tweaks to the order of relabeling certain values, the CSV file is made available in the book's code bundle. The purpose of the study is to classify a patient with a thyroid problem based on the information provided by other variables. There are multiple variants of the dataset and the reader can delve into details at the following web page: http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HELLO. Here, the column representing the variable of interest is named Hypothyroid
, which shows that we have 151 patients with thyroid problems. The remaining 3012 tested negative for it. Clearly, this dataset is an example of unbalanced data, which means that one of the two cases is outnumbered by a huge number; for each thyroid case, we have about 20 negative cases. Such problems need to be handled differently, and we need to get into the subtleties of the algorithms to build meaningful models. The additional variables or covariates that we will use while building the predictive models include Age
, Gender
, TSH
, T3
, TT4
, T4U
, and FTI
. The data is first imported into an R session and is subset according to the variables of interest as follows:
The first line of code imports the data from the Hypothyroid.csv
file using the read.csv
function. The dataset now has a lot of missing data in the variables, as seen here:
Consequently, we remove all the rows that have a missing value, and then split the data into training and testing datasets. We will also create a formula for the classification problem:
The set.seed
function ensures that the results are reproducible each time we run the program. After removing the missing observations with the na.omit
function, we split the hypothyroid data into training and testing parts. The former is used to build the model and the latter is used to validate it, using data that has not been used to build the model. Quinlan – the inventor of the popular tree algorithm C4.5 – used this dataset extensively.
This dataset is an example of a simulation study. Here, we have twenty-one variables as input or independent variables, and a class variable referred to as classes
. The data is generated using the mlbench.waveform
function from the mlbench
R package. For more details, refer to the following link: ftp://ftp.ics.uci.edu/pub/machine-learning-databases. We will simulate 5,000 observations for this dataset. As mentioned earlier, the set.seed
function guarantees reproducibility. Since we are solving binary classification problems, we will reduce the three classes generated by the waveform function to two, and then partition the data into training and testing parts for model building and testing purposes:
The R function mlbench.waveform
creates a new object of the mlbench
class. Since it consists of two sub-parts in x
and classes, we will convert it into data.frame
following some further manipulations. The cbind
function binds the two objects x
(a matrix) and classes (a numeric vector) into a single matrix. The data.frame
function converts the matrix object into a data frame, which is the class desired for the rest of the program.
After partitioning the data, we will create the required formula
for the waveform dataset:
Loans are not always repaid in full, and there are defaulters. In this case, it becomes important for the bank to identify potential defaulters based on the available information. Here, we adapt the GC
dataset from the RSADBE
package to properly reflect the labels of the factor variable. The transformed dataset is available as GC2.RData
in the data folder. The GC
dataset itself is mainly an adaptation of the version available at https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data). Here, we have 1,000 observations, and 20 covariate/independent variables such as the status of existing checking account, duration, and so forth. The final status of whether the loan was completely paid or not is available in the good_bad
column. We will partition the data into training and testing parts, and create the formula too:
Iris is probably the most famous classification dataset. The great statistician Sir R. A. Fisher popularized the dataset, which he used for classifying the three types of iris
plants based on length and width measurements of their petals and sepals. Fisher used this dataset to pioneer the invention of the statistical classifier linear discriminant analysis. Since there are three species of iris
, we converted this into a binary classification problem, separated the dataset, and created a formula as seen here:
Diabetes is a health hazard, which is mostly incurable, and patients who are diagnosed with it have to adjust their lifestyles in order to cater to this condition. Based on variables such as pregnant
, glucose
, pressure
, triceps
, insulin
, mass
, pedigree
, and age
, the problem here is to classify the person as diabetic or not. Here, we have 768 observations. This dataset is drawn from the mlbench
package:
The five datasets described up to this point are classification problems. We look at one example each for regression, time series, survival, clustering, and outlier detection problems.
A study of the crime rate per million of the population among the 47 different states of the US is undertaken here, and an attempt is made to find its dependency on 13 variables. These include age distribution, indicator of southern states, average number of schooling years, and so on. As with the earlier datasets, we will also partition this one into the following chunks of R program:
In each example discussed in this section thus far, we had a reason to believe that the observations are independent of each other. This assumption simply means that the regressands and regressors of one observation have no relationship with other observations' regressands and regressors. This is a simple and reasonable assumption. We have another class of observations/datasets where such assumptions are not practical. For example, the maximum temperature of a day is not completely independent of the previous day's temperature. If that were to be the case, we could have a scorchingly hot day, followed by winter, followed by another hot day, which in turn is followed by a very heavy rainy day. However, weather does not happen in this way as on successive days, the weather is dependent on previous days. In the next example, we consider the number of overseas visitors to New Zealand.
The New Zealand overseas dataset is dealt with in detail in Chapter 10 of Tattar, et al. (2017). Here, the number of overseas visitors is captured on a monthly basis from January 1977 to December 1995. We have visitors' data available for over 228 months. The osvisit.dat
file is available at multiple web links, including https://www.stat.auckland.ac.nz/~ihaka/courses/726-/osvisit.dat and https://github.com/AtefOuni/ts/blob/master/Data/osvisit.dat. It is also available in the book's code bundle. We will import the data in R, convert it into a time series object, and visualize it:
Here, the dataset is not partitioned! Time series data can't be arbitrarily partitioned into training and testing parts. The reason is quite simple: if we have five observations in a time sequential order y1, y2, y3, y4, y5, and we believe that the order of impact is y1→y2→y3→y4→y5, an arbitrary partition of y1, y2, y5, will have different behavior. It won't have the same information as three consecutive observations. Consequently, the time series partitioning has to preserve the dependency structure; we keep the most recent part of the time as the test data. For the five observations example, we choose a sample of y1, y2, y3, as the test data. The partitioning is simple, and we will cover this in Chapter 11, Ensembling Time Series Models.
Live testing experiments rarely yield complete observations. In reliability analysis, as well as survival analysis/clinical trials, the units/patients are observed up to a predefined time and a note is made regarding whether a specific event occurs, which is usually failure or death. A considerable fraction of observations would not have failed by the pre-decided time, and the analysis cannot wait for all units to fail. A reason to curtail the study might be that the time by which all units would have failed would be very large, and it would be expensive to continue the study until such a time. Consequently, we are left with incomplete observations; we only know that the lifetime of the units lasts for at least the predefined time before the study was called off, and the event of interest may occur sometime in the future. Consequently, some observations are censored and the data is referred to as censored data. Special statistical methods are required for the analysis of such datasets. We will give an example of these types of datasets next, and analyze them later, in Chapter 10, Ensembling Survival Models.
Primary Biliary Cirrhosis
The pbc
dataset from the survival package is a benchmark dataset in the domain of clinical trials. Mayo Clinic collected the data, which is concerned with the primary biliary cirrhosis (PBC) of the liver. The study was conducted between 1974 and 1984. More details can be found by running pbc
, followed by library(survival)
on the R terminal. Here, the main time to the event of interest is the number of days between registration and either death, transplantation, or study analysis in July 1986, and this is captured in the time variable. Similarly to a survival study, the events might be censored and the indicator is in the column status. The time to event needs to be understood, factoring in variables such as trt
, age
, sex
, ascites
, hepato
, spiders
, edema
, bili
, chol
, albumin
, copper
, alk.phos
, ast
, trig
, platelet
, protime
, and stage
.
The eight datasets discussed up until this point have a target variable, or a regressand/dependent variable, and are examples of the supervised learning problem. On the other hand, there are practical cases in which we simply attempt to understand the data and find useful patterns and groups/clusters in it. Of course, it is important to note that the purpose of clustering is to find an identical group and give it a sensible label. For instance, if we are trying to group cars based on their characteristics such as length, width, horsepower, engine cubic capacity, and so on, we may find groups that might be labeled as hatch, sedan, and saloon classes, while another clustering solutions might result in labels of basic, premium, and sports variant groups. The two main problems posed in clustering are the choice of the number of groups and the formation of robust clusters. We consider a simple dataset from the factoextra
R package.
The multishapes
dataset from the factoextra
package consists of three variables: x
, y
, and shape
. It consists of different shapes, with each shape forming a cluster. Here, we have two concurrent circle shapes, two parallel rectangles/beds, and one cluster of points at the bottom-right. Outliers are also added across scatterplots. Some brief R code gives a useful display:
This dataset includes a column named shape, as it is a hypothetical dataset. In true clustering problems, we will have neither a cluster group indicator nor the visualization luxury of only two variables. Later in this book, we will see how ensemble clustering techniques help overcome the problems of deciding the number of clusters and the consistency of cluster membership.
Although it doesn't happen that often, frustrations can arise when fine-tuning different parameters, fitting different models, and other tricks all fail to find a useful working model. The culprit of this is often the outlier. A single outlier is known to wreak havoc on an otherwise potentially useful model, and their detection is of paramount importance. Hitherto this, the parametric and nonparametric outlier detections would be a matter of deep expertise. In complex scenarios, the identification would be an insurmountable task. A consensus on an observation being an outlier can be achieved using the ensemble outlier framework. To consider this, the board stiffness dataset will be considered. We will see how an outlier is pinned down in the conclusion of this book.
The board stiffness dataset is available in the ACSWR
package through the stiff data.frame
stiff. The dataset consists of four measures of stiffness for 30 boards. The first measure of stiffness is obtained by sending a shock wave down the board, the second measure is obtained by vibrating the board, and the remaining two are obtained from static tests. A quick method of identifying the outliers in a multivariate dataset is by using the Mahalanobis distance function. The further the distance an observation is from the center, the more likely it is that the observation will be an outlier: