What this book covers
Chapter 1, Data Characteristics, introduces the different types of data through a questionnaire and dataset. The need of statistical models is elaborated in some interesting contexts. This is followed by a brief explanation of the installation of R and Python and their related packages. Discrete and continuous random variables are discussed through introductory programs. The programs are available in both the languages and although they do not need to be followed, they are more expository in nature.
Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames, vectors, matrices, and lists are discussed with clear and simpler examples. Importing of data from external files in CSV, XLS, and other formats is elaborated next. Writing data/objects from R for other languages is considered and the chapter concludes with a dialogue on R session management. Python basics, mathematical operations, and other essential operations are explained. Reading data from different format of external file is also illustrated along with the session management required.
Chapter 3, Data Visualization, discusses efficient graphics separately for categorical and numeric datasets. This translates into techniques for bar chart, dot chart, spine and mosaic plot, and four fold plot for categorical data while histogram, box plot, and scatter plot for continuous/numeric data. A very brief introduction to ggplot2 is also provided here. Generating similar plots using both R and Python will be a treatise here.
Chapter 4, Exploratory Analysis, encompasses highly intuitive techniques for the preliminary analysis of data. The visualizing techniques of EDA such as stem-and-leaf, letter values, and the modeling techniques of resistant line, smoothing data, and median polish provide rich insight as a preliminary analysis step. This chapter is driven mainly in R only.
Chapter 5, Statistical Inference, begins with an emphasis on the likelihood function and computing the maximum likelihood estimate. Confidence intervals for parameters of interest is developed using functions defined for specific problems. The chapter also considers important statistical tests of z-test and t-test for comparison of means and chi-square tests and f-test for comparison of variances. The reader will learn how to create new R and Python functions.
Chapter 6, Linear Regression Analysis, builds a linear relationship between an output and a set of explanatory variables. The linear regression model has many underlying assumptions and such details are verified using validation techniques. A model may be affected by a single observation, or a single output value, or an explanatory variable. Statistical metrics are discussed in depth which helps remove one or more types of anomalies. Given a large number of covariates, the efficient model is developed using model selection techniques. While the stats core R package suffices, statsmodels package in Python is very useful.
Chapter 7, The Logistic Regression Model, is useful as a classification model when the output is a binary variable. Diagnostic and model validation through residuals are used which lead to an improved model. ROC curves are next discussed which helps in identifying of a better classification model. The R packages pscl and ROCR are useful while pysal and sklearn are useful in Python.
Chapter 8, Regression Models with Regularization, discusses the problem of over fitting, which arises from the use of models developed in the previous two chapters. Ridge regression significantly reduces the probability of an over fit model and the development of natural spine models also lays the basis for the models considered in the next chapter. Regularization in R is achieved using packages ridge and MASS while sklearn and statsmodels help in Python.
Chapter 9, Classification and Regression Trees, provides a tree-based regression model. The trees are initially built using raw R functions and the final trees are also reproduced using rudimentary codes leading to a clear understanding of the CART mechanism. The pruning procedure is illustrated through one of the languages and the reader should explore to find the fix in another.
Chapter 10, CART and Beyond, considers two enhancements to CART, using bagging and random forests. A consolidation of all the models from Chapter 6, Linear Regression Analysis, to Chapter 10, CART and Beyond, is also provided through a dataset. The ensemble methods is fast emerging as very effective and popular machine learning technique and doing it in both the languages will improve users confidence.