Previously, we have mentioned the R packages, which allow us to access a series of features to solve a specific problem. In this section, we will present some packages that contain valuable resources for regression analysis. These packages will be analyzed in detail in the following chapters, where we will provide practical applications.
R packages for regression
The R stats package
R stats is a package that contains many useful functions for statistical calculations and random number generation. In the following table you will see some of the information on this package:
Package |
stats |
Date |
October 3, 2017 |
Version |
3.5.0 |
Title |
The R stats package
|
Author |
R core team and contributors worldwide
|
There are so many functions in the package; we will only mention the ones that are closest to regression analysis. These are the most useful functions used in regression analysis:
- lm: This function is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance, and analysis of co-variance.
- summary.lm: This function returns a summary for linear model fits.
- coef: With the help of this function, coefficients from objects returned by modeling functions can be extracted. Coefficients is an alias for it.
- fitted: Fitted values are extracted by this function from objects returned by modeling functions fitted. Values are an alias for it.
- formula: This function provides a way of extracting formulae which have been included in other objects.
- predict: This function predicts values based on linear model objects.
- residuals: This function extracts model residuals from objects returned by modeling functions.
- confint: This function computes confidence intervals for one or more parameters in a fitted model. Base has a method for objects inheriting from the lm class.
- deviance: This function returns the deviance of a fitted model object.
- influence.measures: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models (GLM).
- lm.influence: This function provides the basic quantities used when forming a wide variety of diagnostics for checking the quality of regression fits.
- ls.diag: This function computes basic statistics, including standard errors, t-values, and p-values for the regression coefficients.
- glm: This function is used to fit GLMs, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
- loess: This function fits a polynomial surface determined by one or more numerical predictors, using local fitting.
- loess.control: This function sets control parameters for loess fits.
- predict.loess: This function extracts predictions from a loess fit, optionally with standard errors.
- scatter.smooth: This function plots and adds a smooth curve computed by loess to a scatter plot.
What we have analyzed are just some of the many functions contained in the stats package. As we can see, with the resources offered by this package we can build a linear regression model, as well as GLMs (such as multiple linear regression, polynomial regression, and logistic regression). We will also be able to make model diagnosis in order to verify the plausibility of the classic hypotheses underlying the regression model, but we can also address local regression models with a non-parametric approach that suits multiple regressions in the local neighborhood.
The car package
This package includes many functions for: ANOVA analysis, matrix and vector transformations, printing readable tables of coefficients from several regression models, creating residual plots, tests for the autocorrelation of error terms, and many other general interest statistical and graphing functions.
In the following table you will see some of the information on this package:
Package |
car |
Date |
June 25, 2017 |
Version |
2.1-5 |
Title |
Companion to Applied Regression |
Author |
John Fox, Sanford Weisberg, and many others |
The following are the most useful functions used in regression analysis contained in this package:
- Anova: This function returns ANOVA tables for linear and GLMs
- linear.hypothesis: This function is used for testing a linear hypothesis and methods for linear models, GLMs, multivariate linear models, and linear and generalized linear mixed-effects models
- cookd: This function returns Cook's distances for linear and GLMs
- outlier.test: This function reports the Bonferroni p-values for studentized residuals in linear and GLMs, based on a t-test for linear models and a normal-distribution test for GLMs
- durbin.watson: This function computes residual autocorrelations and generalized Durbin-Watson statistics and their bootstrapped p-values
- levene.test: This function computes Levene's test for the homogeneity of variance across groups
- ncv.test: This function computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors
What we have listed are just some of the many functions contained in the stats package. In this package, there are also many functions that allow us to draw explanatory graphs from information extracted from regression models as well as a series of functions that allow us to make variables transformations.
The MASS package
This package includes many useful functions and data examples, including functions for estimating linear models through generalized least squares (GLS), fitting negative binomial linear models, the robust fitting of linear models, and Kruskal's non-metric multidimensional scaling.
In the following table you will see some of the information on this package:
Package |
MASS |
Date |
October 2, 2017 |
Version |
7.3-47 |
Title |
Support Functions and Datasets for Venables and Ripley's MASS |
Author |
Brian Ripley, Bill Venables, and many others |
The following are the most useful functions used in regression analysis contained in this package:
- lm.gls: This function fits linear models by GLS
- lm.ridge: This function fist a linear model by Ridge regression
- glm.nb: This function contains a modification of the system function
- glm(): It includes an estimation of the additional parameter, theta, to give a negative binomial GLM
- polr: A logistic or probit regression model to an ordered factor response is fitted by this function
- lqs: This function fits a regression to the good points in the dataset, thereby achieving a regression estimator with a high breakdown point
- rlm: This function fits a linear model by robust regression using an M-estimator
- glmmPQL: This function fits a GLMM model with multivariate normal random effects, using penalized quasi-likelihood (PQL)
- boxcox: This function computes and optionally plots profile log-likelihoods for the parameter of the Box-Cox power transformation for linear models
As we have seen, this package contains many useful features in regression analysis; in addition there are numerous datasets that we can use for our examples that we will encounter in the following chapters.
The caret package
This package contains many functions to streamline the model training process for complex regression and classification problems. The package utilizes a number of R packages.
In the following table you will see listed some of the information on this package:
Package |
caret |
Date |
September 7, 2017 |
Version |
6.0-77 |
Title |
Classification and Regression Training |
Author |
Max Kuhn and many others |
The most useful functions used in regression analysis in this package are as follows:
- train: Predictive models over different tuning parameters are fitted by this function. It fits each model, sets up a grid of tuning parameters for a number of classification and regression routines, and calculates a resampling-based performance measure.
- trainControl: This function permits the estimation of parameter coefficients with the help of resampling methods like cross-validation.
- varImp: This function calculates variable importance for the objects produced by train and method-specific methods.
- defaultSummary: This function calculates performance across resamples. Given two numeric vectors of data, the mean squared error and R-squared error are calculated. For two factors, the overall agreement rate and Kappa are determined.
- knnreg: This function performs K-Nearest Neighbor (KNN) regression that can return the average value for the neighbors.
- plotObsVsPred: This function plots observed versus predicted results in regression and classification models.
- predict.knnreg: This function extracts predictions from the KNN regression model.
The caret package contains hundreds of machine learning algorithms (also for regression), and renders useful and convenient methods for data visualization, data resampling, model tuning, and model comparison, among other features.
The glmnet package
This package contains many extremely efficient procedures in order to fit the entire Lasso or ElasticNet regularization path for linear regression, logistic and multinomial regression models, Poisson regression, and the Cox model. Multiple response Gaussian and grouped multinomial regression are the two recent additions.
In the following table you will see listed some of the information on this package:
Package |
glmnet |
Date |
September 21, 2017 |
Version |
2.0-13 |
Title |
Lasso and Elastic-Net Regularized Generalized Linear Models |
Author |
Jerome Friedman, Trevor Hastie, Noah Simon, Junyang Qian, and Rob Tibshirani |
The following are the most useful functions used in regression analysis contained in this package:
- glmnet: A GLM is fit by this function via penalized maximum likelihood. The regularization path is computed for the Lasso or ElasticNet penalty at a grid of values for the regularization parameter lambda. This function can also deal with all shapes of data, including very large sparse data matrices. Finally, it fits linear, logistic and multinomial, Poisson, and Cox regression models.
- glmnet.control: This function views and/or changes the factory default parameters in glmnet.
- predict.glmnet: This function predicts fitted values, logits, coefficients, and more from a fitted glmnet object.
- print.glmnet: This function prints a summary of the glmnet path at each step along the path.
- plot.glmnet: This function produces a coefficient profile plot of the coefficient paths for a fitted glmnet object.
- deviance.glmnet: This function computes the deviance sequence from the glmnet object.
As we have mentioned, this package fits Lasso and ElasticNet model paths for regression, logistic, and multinomial regression using coordinate descent. The algorithm is extremely fast, and exploits sparsity in the input matrix where it exists. A variety of predictions can be made from the fitted models.
The sgd package
This package contains a fast and flexible set of tools for large scale estimation. It features many stochastic gradient methods, built-in models, visualization tools, automated hyperparameter tuning, model checking, interval estimation, and convergence diagnostics.
In the following table you will see listed some of the information on this package:
Package |
sgd |
Date |
January 5, 2016 |
Version |
1.1 |
Title |
Stochastic Gradient Descent for Scalable Estimation |
Author |
Dustin Tran, Panos Toulis, Tian Lian, Ye Kuang, and Edoardo Airoldi |
The following are the most useful functions used in regression analysis contained in this package:
- sgd: This function runs Stochastic Gradient Descent (SGD) in order to optimize the induced loss function given a model and data
- print.sgd: This function prints objects of the sgd class
- predict.sgd: This function forms predictions using the estimated model parameters from SGD
- plot.sgd: This function plots objects of the sgd class
The BLR package
This package performs a special case of linear regression named Bayesian linear regression. In Bayesian linear regression, the statistical analysis is undertaken within the context of a Bayesian inference.
In the following table you will see listed some of the information on this package:
Package |
BLR |
Date |
December 3, 2014 |
Version |
1.4 |
Title |
Bayesian Linear Regression |
Author |
Gustavo de los Campos, Paulino Perez Rodriguez |
The following are the most useful functions used in regression analysis contained in this package:
- BLR: This function was designed to fit parametric regression models using different types of shrinkage methods.
- sets: This is a vector (599x1) that assigns observations to ten disjointed sets; the assignment was generated at random. This is used later to conduct a 10-fold CV.
The Lars package
This package contains efficient procedures for fitting an entire Lasso sequence with the cost of a single least squares fit. Least angle regression and infinitesimal forward stagewise regression are related to the Lasso.
In the following table you will see listed some of the information on this package:
Package |
Lars |
Date |
April 23, 2013 |
Version |
1.2 |
Title |
Least Angle Regression, Lasso and Forward Stagewise |
Author |
Trevor Hastie and Brad Efron |
The following are the most useful functions used in regression analysis contained in this package:
- lars: This function fits least angle regression and Lasso and infinitesimal forward stagewise regression models.
- summary.lars: This function produces an ANOVA-type summary for a lars object.
- plot.lars: This function produce a plot of a lars fit. The default is a complete coefficient path.
- predict.lars: This function make predictions or extracts coefficients from a fitted lars model.