What this book covers
Chapter 1, Introduction to Ensemble Techniques, will give an exposition to the need for ensemble learning, important datasets, essential statistical and machine learning models, and important statistical tests. This chapter displays the spirit of the book.
Chapter 2, Bootstrapping, will introduce the two important concepts of jackknife and bootstrap. The chapter will help you carry out statistical inference related to unknown complex parameters. Bootstrapping of essential statistical models, such as linear regression, survival, and time series, is illustrated through R programs. More importantly, it lays the basis for resampling techniques that forms the core of ensemble methods.
Chapter 3, Bagging, will propose the first ensemble method of using a decision tree as a base model. Bagging is a combination of the words bootstrap aggregation. Pruning of decision trees is illustrated, and it will lay down the required foundation for later chapters. Bagging of decision trees and k-NN classifiers are illustrated in this chapter.
Chapter 4, Random Forests, will discuss the important ensemble extension of decision trees. Variable importance and proximity plots are two important components of random forests, and we carry out the related computations about them. The nuances of random forests are explained in depth. Comparison with the bagging method, missing data imputation, and clustering with random forests are also dealt with in this chapter.
Chapter 5, The Bare-Bones Boosting Algorithms, will first state the boosting algorithm. Using toy data, the chapter will then explain the detailed computations of the adaptive boosting algorithm. Gradient boosting algorithm is then illustrated for the regression problem. The use of the gbm
and adabag
packages shows implementations of other boosting algorithms. The chapter concludes with a comparison of the bagging, random forest, and boosting methods.
Chapter 6, Boosting Refinements, will begin with an explanation of the working of the boosting technique. The gradient boosting algorithm is then extended to count and survival datasets. The extreme gradient boosting implementation of the popular gradient boosting algorithm details are exhibited with clear programs. The chapter concludes with an outline of the important h2o
package.
Chapter 7, The General Ensemble Technique, will study the probabilistic reasons for the success of the ensemble technique. The success of the ensemble is explained for classification and regression problems.
Chapter 8, Ensemble Diagnostics, will examine the conditions for the diversity of an ensemble. Pairwise comparisons of classifiers and overall interrater agreement measures are illustrated here.
Chapter 9, Ensembling Regression Models, will discuss in detail the use of ensemble methods in regression problems. A complex housing dataset from kaggle
is used here. The regression data is modeled with multiple base learners. Bagging, random forest, boosting, and stacking are all illustrated for the regression data.
Chapter 10, Ensembling Survival Models, is where survival data is taken up. Survival analysis concepts are developed in considerable detail, and the traditional techniques are illustrated. The machine learning method of a survival tree is introduced, and then we build the ensemble method of random survival forests for this data structure.
Chapter 11, Ensembling Time Series Models, deals with another specialized data structure in which observations are dependent on each other. The core concepts of time series and the essential related models are developed. Bagging of a specialized time series model is presented, and we conclude the chapter with an ensemble of heterogeneous time series models.
Chapter 12, What's Next?, will discuss some of the unresolved topics in ensemble learning and the scope for future work.