Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Regression Analysis with R
Regression Analysis with R

Regression Analysis with R: Design and develop statistical nodes to identify unique relationships within data at scale

eBook
$9.99 $35.99
Paperback
$43.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Regression Analysis with R

Getting Started with Regression

Regression analysis is the starting point in data science. This is because regression models represent the most well-understood models in numerical simulation. Once we experience the workings of regression models, we will be able to understand all other machine learning algorithms. Regression models are easily interpretable as they are based on solid mathematical bases (such as matrix algebra for example). We will see in the following sections that linear regression allows us to derive a mathematical formula representative of the corresponding model. Perhaps this is why such techniques are extremely easy to understand.

Regression analysis is a statistical process done to study the relationship between a set of independent variables (explanatory variables) and the dependent variable (response variable). Through this technique, it will be possible to understand how the value of the response variable changes when the explanatory variable is varied.

Consider some data that is collected about a group of students, on: number of study hours per day, attendance at school, and scores on the final exam obtained. Through regression techniques, we can quantify the average increase in the final exam score when we add one more hour of study. Lower attendance in school (decreasing the student's experience) lowers the scores in the final exam.

A regression analysis can have two objectives:

  • Explanatory analysis: To understand and weigh the effects of the independent variable on the dependent variable according to a particular theoretical model
  • Predictive analysis: To locate a linear combination of the independent variable to predict the value assumed by the dependent variable optimally

In this chapter, we will be introduced to the basic concepts of regression analysis, and then we'll take a tour of the different types of statistical processes. In addition to this, we will also introduce the R language and cover the basics of the R programming environment. Finally we will explore the essential tools that R provides for understanding the amazing world of regression.

We will cover the following topics:

  • The origin of regression
  • Types of algorithms
  • How to quickly set up R for data science
  • R packages used throughout the book

At the end of this chapter, we will provide you with a working environment that is able to run all the examples contained in the following chapters. You will also get a clear idea about why regression analysis is not just an underrated technique taken from statistics, but a powerful and effective data science algorithm.

Going back to the origin of regression

The word regression sounds like go back in some ways. If you've been trying to quit smoking but then yield to the desire to have a cigarette again, you are experiencing an episode of regression. The term regression has appeared in early writings since the late 1300s, and is derived from the Latin term regressus, which means a return. It is used in different fields with different meanings, but in any case, it is always referred to as the action of regressing.

In philosophy, regression is used to indicate the inverse logical procedure with respect to that of the normal apodictic (apodixis), in which we proceed from the general to the particular. Whereas in reality, in regression we go back from the particular to the general, from the effect to the cause, from the conditioned to the condition. In this way, we can draw completely general conclusions from a particular case. This is the first form of generalization that, as we will see later, represents a crucial part of a statistical analysis.

For example, there is marine regression, which is a geological process that occurs as a result of areas of submerged seafloor being exposed more than the sea level. And there is the opposite event, marine transgression; it occurs when flooding from the sea submerges land that was previously exposed.

The following image shows the results of marine regression on the Aral lake.  This lake lies between Kazakhstan and Uzbekistan and has been steadily shrinking since the 1960s, after the rivers that fed it were diverted by Soviet irrigation projects. By 1997, it had declined to ten percent of its original size:

However, in statistics, regression is related to the study of the relationship between the explanatory variables and the response variable. Don't worry! This is the topic we will discuss in this book.

In statistics, the term regression has an ancient origin and a very special meaning. The man who coined this term was a certain Francis Galton, geneticist, who in 1889 published an article in which he demonstrated how every characteristic of an individual is inherited by their offspring, but on an average to a lesser degree. For example, children with tall parents are also tall, but on an average their height will be comparatively less than that of their parents. This phenomenon, also graphically described, is called regression. Since then, this term has lived on to define statistical techniques that analyze relationships between two or more variables.

The following table shows a short summary of the data used by Galton for his study:

Family Father Mother Gender Height Kids
1 78.5 67 M 73.2 4
1 78.5 67 F 69.2 4
1 78.5 67 F 69 4
1 78.5 67 F 69 4
2 75.5 66.5 M 73.5 4
2 75.5 66.5 M 72.5 4
2 75.5 66.5 F 65.5 4
2 75.5 66.5 F 65.5 4
3 75 64 M 71 2
3 75 64 F 68 2
4 75 64 M 70.5 5
4 75 64 M 68.5 5
4 75 64 F 67 5
4 75 64 F 64.5 5
4 75 64 F 63 5
5 75 58.5 M 72 6
5 75 58.5 M 69 6
5 75 58.5 M 68 6
5 75 58.5 F 66.5 6
5 75 58.5 F 62.5 6
5 75 58.5 F 62.5 6

 

Galton depicted in a chart the average height of the parents and that of the children, noting what he called regression in the data.

The following figure shows this data and a regression line that confirms his thesis:

The figure shows a scatter plot of data with two lines. The darker straight line represents the equality line, that is, the average height of the parents is equal to that of the children. The straighter dotted line, on the other hand, represents the linear regression line (defined by Galton) showing a regression in the height of the children relative to that of the parents. In fact, that line has a slope less than that of the continuous line.

Indeed, the children with tall parents are also tall, but on an average they are not as tall as their parents (the regression line is less than the equality line to the right of the figure). On the contrary, children with short parents are also short, but on an average, they are taller than their parents (the regression line is greater than the equality line to the left of the figure). Galton noted that, in both cases, the height of the children approached the average of the group. In Galton's words, this corresponded to the concept of regression towards mediocrity, hence the name of the statistical analysis technique.

The Galton universal regression law was confirmed by Karl Pearson (Galton's pupil), who collected more than a thousand heights of members of family groups. He found that the average height of the children from a group of short parents was greater than the average height of their fathers' group, and that the average height of the children from a group of tall fathers was lower than the average height of their parents' group. That is, the heights of the tall and short children were regressing toward the average height of all people. This is exactly what we have learned from the observation of the previous graph.

Regression in the real world

In general, statistics—and more specifically, regressionis a math discipline. Its purpose is to obtain information from data about knowledge, decisions, control, and the forecasting of events and phenomena. Unfortunately, statistical culture, and in particular statistical reasoning, are scarce and uncommon. This is due to the institutions that have included the study of this discipline in their programs and study plans inadequately. Often, inadequate learning methods are adopted since this is a rather complex and not very popular topic (as is the case with mathematics in general). 

The difficulties faced by students are often due to outdated teaching methods that are not in tune with our modern needs. In this book, we will learn how to deal with such topics with a modern approach, based on practical examples. In this way, all the topics will seem simple and within our reach.

Yet regression, given its cross-disciplinary characteristics, has numerous and varied areas of application, from psychology to agrarianism, and from economics to medicine and business management, just to name a few.

The purpose of regression as a statistical tool are of two types, synthesize and generalize, as shown in the following figure:

synthesize means predisposing collected data into a form (tables, graphs, or numerical summaries), which allows you to better understand the phenomena on which the detection was performed. The synthesis is met by the need to simplify, which in turn results from the limited ability of the human mind to handle articulated, complex, or multidimensional information. In this way, we can use techniques that allow for a global study of a large number of quantitative and qualitative information to highlight features, ties, differences, or associations between detected variables.

The second purpose (generalize) is to extend the result of an analysis performed on data of a limited group of statistical units (sample) to the entire population group (population).

The contribution of regression is not limited to the data analysis phase. It's true that added value is expressed in the formulation of research hypotheses, argumentation of theses, adoption of appropriate solutions and methodologies, choices of methods of detection, formulation of the sample, and the procedure of extending the results to the reference universes.

Keeping these phases under control means producing reliable and economically useful results, and mastering descriptive statistics and data analysis as well as inferential ones. In this regard, we recall that the descriptive statistics are concerned with describing the experimental data with few significant numbers or graphs. Therefore, they photographs a given situation and summarizes its salient characteristics. The inferential statistics use statistical data, also appropriately summarized by the descriptive statistics, to make probabilistic forecasts on future or otherwise uncertain situations.

People, families, businesses, public administrations, mayors, ministers, and researchers constantly make decisions. For most of them, the outcome is uncertain, in the sense that it is not known exactly what will result, although the expectation is that they will achieve the (positive) effects they are hoping for. Decisions would be better and the effects expected closer to those desired if they were made on the basis of relevant data in a decision-making context. Here are some applications of regression in the real world:

  • A student who graduates this year must choose the faculty and university degree course on which he/she will enroll. Perhaps he/she has already gained a vocation for his future profession, or studies have confirmed his/her predisposition for a particular discipline. Maybe a well-established family tradition advises him/her to follow the parent's profession. In these cases, the uncertainty of choice will be greatly reduced. However, if the student does not have genuine vocations or is not geared particularly to specific choices, he or she may want to know something about the professional outcomes of the graduates. In this regard, some statistical study on graduate data from previous years may help him/her make the decision.
  • A distribution company, such as a supermarket chain, wants to open a new sales outlet in a big city and must choose the best location. It will use and analyze numerous statistical data on the density of the population in different neighborhoods, the presence of young families, the presence of children under the age of six (if it is interested in selling to this category of consumers), and the presence of schools, offices, other supermarkets, and retail outlets.
  • Another company wants to invest its profits. It must make a portfolio choice. It has to decide whether to invest in government bonds, national shares, foreign securities, funds, or real estate. To make this choice, it will first conduct an analysis of the returns and risks of different investment alternatives based on statistical data. 
  • National governments are often called upon to make choices and decisions. To do this, they have statistical production equipment. They have population data and forecasts about population evolution over the coming years, which will calibrate their interventions. A strong decline in birth rates will, for example, recommend school consolidation policies; the emergence of children from the non-community component will signal the need for reviewing multi-ethnic programs and, more generally, school integration policies. On the other hand, statistical data on the presence of national products in foreign markets will suggest the need to export support actions or interventions to promote innovation and business competitiveness.

In the examples we have seen so far, the usefulness of statistical techniques, and particularly of regression in the most diverse working situations, is clear. It is therefore clear how much more information and data companies are required to have to ensure the rationality of decisions and economic behaviors by those who direct them.

Understanding regression concepts

Regression is an inductive learning task that has been widely studied and is widely used in practical applications. Unlike classification processes, where you are trying to predict discrete class labels, regression models predict numeric values.

From a set of data, we can find a model that describes it by the use of the regression algorithms. For example, we can identify a correspondence between input variables and output variables of a given system. One way to do this is to postulate the existence of some kind of mechanism for the parametric generation of data; this, however, does not contain the exact values of the parameters. This process typically makes reference to statistical techniques.

The extraction of general laws from a set of observed data is called induction, as opposed to deduction in which we start from general laws and try to predict the value of a set of variables. Induction is the fundamental mechanism underlying the scientific method in which we want to derive general laws (typically described in mathematical terms) starting from the observation of phenomena. In the following figure, we can see Peirce's triangle, which represents a scheme of relationships between reasoning patterns:

The observation of the phenomena includes the measurement of a set of variables, and therefore the acquisition of data that describes the observed phenomena. Then, the resulting model can be used to make predictions on additional data. The overall process in which, starting from a set of observations, we aim to make predictions on new situations, is called inference.

Therefore, inductive learning starts with observations arising from the surrounding environment that, hopefully, are also valid for not-yet-observed cases.

We have already anticipated the stages of the inference process; now let's analyze them in detail through the workflow setting. When developing an application that uses regression algorithms, we will follow a procedure characterized by the following steps:

  1. Collect the data: Everything starts with the data—no doubt about itbut one might wonder where so much data comes from. In practice, data is collected through lengthy procedures that may, for example, be derived from measurement campaigns or face-to-face interviews. In all cases, data is collected in a database so that it can then be analyzed to obtain knowledge.

If we do not have specific requirements, and to save time and effort, we can use publicly available data. In this regard, a large collection of data is available at the UCI Machine Learning Repository, at the following link: http://archive.ics.uci.edu/ml.

The following figure shows the regression process workflow:

  1. Preparing the data: We have collected the data; now we have to prepare it for the next step. Once you have this data, you must make sure it is in a format usable by the algorithm you want to use. To do this, you may need to do some formatting. Recall that some algorithms need data in an integer format, whereas some require it in the form of strings, and finally others need it to be in a special format. We will get to this later, but the specific formatting is usually simple compared to data collection.
  1. Exploring the data: At this point, we can look at the data to verify that it is actually working and we do not have a bunch of empty values. In this step, through the use of plots, we can recognize any patterns or check whether there are some data points that are vastly different from the rest of the set. Plotting data in one, two, or three dimensions can also help.
  2. Training the algorithm: At this stage, it starts to get serious. The regression algorithm begins to work with the definition of the model and the next training step. The model starts to extract knowledge from the large amounts of data that we have available.
  3. Testing the algorithm: In this step, we use the information learned in the previous step to see whether the model actually works. The evaluation of an algorithm is for seeing how well the model approximates the real system. In the case of regression techniques, we have some known values that we can use to evaluate the algorithm. So, if we are not satisfied, we can return to the previous steps, change some things, and retry the test.
  4. Evaluating the algorithm: We have reached the point where we can apply what has been done so far. We can assess the approximation ability of the model by applying it to real data. The model, preventively trained and tested, is then valued in this phase.
  5. Improving algorithm performance: Finally, we can focus on finishing the work. We have verified that the model works, we have evaluated the performance, and now we are ready to analyze it completely to identify possible room for improvement.

The generalization ability of the regression model is crucial for all other machine learning algorithms as well. Regression algorithms must not only detect the relationships between the target function and attribute values ​​in the training set, but also generalize them so that they may be used to predict new data.

It should be emphasized that the learning process must be able to capture the underlying regimes from the training set and not the specific details. Once the learning process is completed through training, the effectiveness of the model is tested further on a dataset named testset.

Regression versus correlation

At the beginning of the chapter, we said that regression analysis is a statistical process for studying the relationship between variables. When considering two or more variables, one can examine the type and intensity of the relationships that exist between them. In the case where two quantitative variables are simultaneously detected for each individual, it is possible to check whether they vary simultaneously and what mathematical relationship exists between them.

The study of such relationships can be conducted through two types of analysis: regression analysis and correlation analysis. Let's understand the differences between these two types of analysis.

Regression analysis develops a statistical model that can be used to predict the values of a variable, called dependent, or more rarely predict and identify the effect based on the values of the other variable, called independent or explanatory, identified as the cause.

With correlation analysis, we can measure the intensity of the association between two quantitative variables that are usually not directly linked to the cause-effect and easily mediated by at least one third variable, but that vary jointly. In the following figure, the meanings of two analyses are shown:

In many cases, there are variables that tend to covariate, but what more can be said about this? If there is a direct dependency relation between two variables—such as whether a variable (dependent) can be determined as a function of a second variable (independent)then a regression analysis can be used to understand the degree of the relation. For example, blood pressure depends on the subject's age. The existing data used in regression and mathematical equations is defined. Using these equations, the value of one variable can be predicted for the value of one or more variables. This equation can therefore also be used to extract knowledge from the existing data and to predict the outcomes of observations that have not been seen or tested before.

Conversely, if there is no direct dependency relation between variablessuch as none of the two variables causing direct variations in the otherthe covariate tendency is measured in correlation. For example, there is no direct relationship between the length and weight of an organism, in the sense that the weight of an organism can increase independently of its length. Correlation measures the association between two variables and it quantifies the strength of the relationship. To do this, it evaluates only the existing data.

To better understand the differences, we will analyze in detail the two examples just suggested. For the first example, we list the blood pressure of subjects of different ages, as shown in the following table:

Age

Blood pressure

18

117

22

120

27

122

32

123

37

124

42

125

47

127

52

131

55

133

62

134

68

135

A suitable graphical representation called a scatter plot can be used to depict the relationships between quantitative variables detected on the same units. In the following chapters, we will see in practice how to plot a scatter plot, so let's just make sense of the meaning.

We display this data on a scatter plot, as shown in the following figure:

By analyzing the scatter plot, we can answer the following:

  • Is there a relationship that can be described by a straight line?
  • Is there a relationship that is not linear?
  • If the scatter plot of the variables looks similar to a cloud, it indicates that no relationship exists between both variables and one would stop at this point

The scatter plot shows a strong, positive, and linear association between Age and Blood pressure. In fact, as age increases, blood pressure increases.

Let's now look at what happens in the other example. As we have anticipated, there is no direct relationship between the length and weight of an organism, in the sense that the weight of an organism can increase independently of its length.

We can see an example of an organism's population here:

Weight is not related to length; this means that there is no correlation between the two variables, and we conclude that length is responsible for none of the changes in weight.

Correlation provides a numerical measure of the relationship between two continuous variables. The resulting correlation coefficient does not depend on the units of the variables and can therefore be used to compare any two variables regardless of their units.

Discovering different types of regression

As mentioned before, regression analysis is a statistical process for studying the relationship between a set of independent variables (explanatory variables) and the dependent variable (response variable). Through this technique, it will be possible to  understand how the value of the response variable changes when the explanatory variable is varied.

The power of regression techniques is due to the quality of their algorithms, which have been improved and updated over the years. These are divided into several main types, depending on the nature of the dependent and independent variables used or the shape of the regression line.

The reason for such a wide range of regression techniques is the variety of cases to be analyzed. Each case is based on data with specific characteristics, and each analysis is characterized by specific objectives. These specifications require the use of different types of regression techniques to obtain the best results.

How do we distinguish between different types of regression techniques? Previously, we said that a first distinction can be made based on the form of the regression line. Based on this feature, the regression analysis is divided into linear regression and nonlinear regression, as shown in the following figure (linear regression to the left and nonlinear quadratic regression to the right):

It's clear that the shape of the regression line is dependent on the distribution of data. There are cases where a straight line is the regression line that best approximates the data, while in other cases, you need to fall into a curve to get the best approximation. That said, it is easy to understand that a visual analysis of the distribution of data we are going to analyze is a good practice to be done in advance. By summarizing the shape of distribution, we can distinguish the type of regression between the following:

  • Linear regression
  • Nonlinear regression

Let us now analyze the nature of the variables involved. In this regard, a question arises spontaneously: can the number of explanatory variables affect the choice of regression technique? The answer to this question is surely positive. For example, in the case of linear regression, if there is only one input variable, then we will do simple linear regression. If, instead, the input variables are two or more, we will need to perform multiple linear regression.

By summarizing, a simple linear regression shows the relationship between a dependent variable Y and an independent variable X. A multiple regression model shows the relationship between a dependent variable Y and multiple independent variables X. In the following figure, the types of regression imposed from the Number of the explanatory variables are shown:

What if we have multiple response variables rather than explanatory variables? In that case, we move from univariate models to multivariate models. As suggested by the name itself, multivariate regression is a technique with the help of which a single regression model can be estimated with more than one response variable. When there is more than one explanatory variable in a multivariate regression model, the model is a multivariate multiple regression.

Finally, let's see what happens when we analyze the type of variables. Usually, regression analysis is used when you want to predict a continuous response variable from a number of explanatory variables, also continuous. But this is not a limitation of regression, in the sense that such analysis is also applicable when categorical variables are at stake.

In the case of a dichotomous explanatory variable (which takes a value of zero or one), the solution is immediate. There are already two numbers (zero and one) associated to this variable, so the regression is immediately applicable. Categorical explanatory variables with more than two values can also be used in regression analyses; however, before they can be used, they need to be converted into variables that have only two levels (such as zero and one). This is called dummy coding or indicator variables.

Logistic regression should be used if the response variable is dichotomous.

The R environment

After introducing the main topic of the book, it is time to discover the programming environment we will use for our regression analysis. As specified by the title of the book, we will perform our examples in the R environment.

R is an interpreted programming language that allows the use of common facilities for controlling the flow of information, and modular programming using a large number of functions. Most of the available features are written in R. It is also possible for the user to interface with procedures written in C, C ++, or FORTRAN.

R is a GNU project that draws its inspiration from the S language; in fact R is similar to S and can be considered as a different implementation of S. R was developed by John Chambers and his colleagues at Bell Laboratories. There are some significant differences between R and S, but a large amount of the code written for S runs unaltered under R too. R is available as free software under the terms of the Free Software Foundation's (FSF) GNU General Public License (GPL) in source code form.

Let's specify the definition we just introduced to present R. In fact, R represents a programming environment originally developed for statistical computation and for producing quality graphs. It consists of a language and runtime environment with a graphical interface, a debugger, and access to some system features. It provides the ability to run programs stored in script files. In the following figure, the R version 3.4.1 interface is shown:

R is an integrated computing system whose resources allow you to:

  • Specify a set of commands and require these to run
  • View results in text format
  • View the charts in an auxiliary window
  • Access external archives, even the web-based ones, to capture documents, data, and charts
  • Permanently store results and/or graphics

What makes R so useful and explains its quick appreciation from users? The reason lies in the fact that statisticians, engineers, and scientists who have used the software over time have developed a huge collection of scripts grouped in packages. Packages written in R are able to add advanced algorithms, textured color graphics, and data mining techniques to better analyze the information contained in a database.

Its popularity among developers has resulted in its rapid rise among the most widely used programming languages, so over the last 1 year, the TIOBE index has been well ahead by seven positions. The TIOBE programming community index is a way to measure how popular a programming language is, and is updated once a month. 

The TIOBE programming community index is available at the following link: https://www.tiobe.com/tiobe-index/.

These ratings are derived on the basis of the number of skilled engineers worldwide, courses, and third-party vendors, and are calculated using popular search engines such as Google, Bing, Yahoo!, Wikipedia, Amazon, YouTube, and Baidu. In the following figure, the first fifteen positions of the TIOBE index for January, 2018 are shown:

R is an open source program, and its popularity reflects a change in the type of software used within the company. We remind you that open source software is free from any constraints not only on its use, but more importantly, on its development. Large IT companies such as IBM, Hewlett-Packard, and Dell are able to earn billions of dollars a year by selling servers running the Linux open source operating system, which is Microsoft Windows' competitor.

The widespread use of open source software is now a consolidated reality. Most websites are managed through an open source web application called Apache, and companies are increasingly relying on the MySQL open source database to store information. Finally, many people see the end results of this technology using the open source Firefox web browser.

R is in many respects similar to other programming languages, such as C, Java, and Perl. Because of its features, it is extremely simple and fast to do a wide range of computing tasks, as it provides quick access through various commands. For statisticians, however, R is particularly useful because it contains a number of integrated mechanisms (built-in functions) to organize data, to perform calculations on information, and to create graphic representations of databases.

Indeed, R is highly extensible and provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and so on) and graphical techniques. There is also a wide range of features that provide a flexible graphical environment for creating various types of data presentations. Additional forms of (add-on) packages are available for a variety of specific purposes. These are topics that we will look at in the following sections.

Installing R

After that detailed description of the programming environment R, it is time to install it on our machine. To do this, we will have to get the installation package first.

The packages we will need to install are available on the official website of the language, Comprehensive R Archive Network (CRAN), at the following URL: https://www.r-project.org/.

CRAN is a network of File Transfer Protocol (FTP) and web servers located around the world that stores identical source and documentation versions of R. CRAN is directly accessible from R's site, and on this site you can also find information about R, some technical manuals, the R magazine, and details about R-developed packages that are stored in CRAN repositories.

Of course, before you download the software versions, we will have to inform you of the type of machine you need and the operating system that must be installed on it. Remember, however, that R is practically available for all operating systems in circulation. In the following screenshot, the CRAN web page is shown:

In the drafting period of this book, the current version of the environment R is 3.4.1, which represents the stable one, and that is why, in the examples that will accompany us in the subsequent sections, we will refer to that version.

The following list shows the OSs supported:

  • Windows
  • macOS
  • Unix

In computer science, installation is the procedure whereby the software is copied and configured on the machine. Generally, the software is distributed as a compressed file package, which includes an interface that facilitates and automates the installation (installer).

The installation creates folders on the disk, where all the files used for the program configuration are contained, and the links to make it easier to execute and write the necessary configuration parameters. In the following screenshot, we can see CRAN with all the tools needed for proper software installation:

There are essentially two ways to install R:

  • Using existing distributions in the form of binaries
  • Using source code

Using precompiled binary distribution

Binary distribution is the simplest choice; it works on most machines and will be the one we will use to make the job as simple as possible. This is a compiled version of R which can be downloaded and installed directly on our system.

Installing on Windows

For the Windows operating system, this version looks like a single EXE file (downloadable from the CRAN site), which can be easily installed with a double-click on it and by following the few steps of the installation. These are the automated installation procedures, the so-called installers, through which the installation phase of the software is reduced by the user to the need to have clicked on the buttons a number of times. Once the process is completed, you can start using R via the icon that will appear on the desktop or through the link available in the list of programs that can be used in our system.

Installing on macOS

Similarly, for macOS, R is available with a unique installation file with a PKG extension; it can be downloaded and installed on our system. The following screenshot shows the directory containing binaries for a base distribution and packages to run on macOS X (release 10.6 and later) extracted from the CRAN website:

Installing on Linux

For a Linux system, there are several versions of the installation file. In the download section, you must select the appropriate version of R, according to the Linux distribution installed on your machine. Installation packages are available in two main formats, .rpm file for Fedora, SUSE, and Mandriva, and .deb extensions for Ubuntu, Debian, and Linux Mint.

Installation from source code

R's installation from source code is available for all supported platforms, though it is not as easy to perform compared to the binary distribution we've just seen. It is especially hard on Windows, since the installation tools are not part of the system.

Detailed information on installation procedures from source code for Windows, and necessary tools, are available on the CRAN website, at https://cran.r-project.org/doc/manuals/r-release/R-admin.html.

On Unix-like systems, the process, on the other hand, is much simpler; the installation must be done following the usual procedure, which uses the following commands:

./configure
make
make install

These commands, assuming that compilers and support libraries are available, lead to the proper installation of the R environment on our system.

RStudio

To program with R, we can use any text editor and a simple command-line interface. Both of these tools are already present on any operating system, so if you don't want to install anything more, you will be able to ignore this step.

Some find it more convenient to use an Integrated Development Environment (IDE); in this case, as there are several available, both free and paid, you'll be spoilt for choice. Having to make a choice, I prefer the RStudio environment.

RStudio is a set of integrated tools designed to help you be more productive with R. It includes a console, syntax-highlighting editor that supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging, and managing your workspace.

RStudio is available at the following URL: https://www.rstudio.com/.

In the following screenshot you will see the main page of the RStudio website:

This is a popular IDE available in the open source, free, and commercial versions, which works on most operating systems. RStudio is probably the only development environment developed specifically for R. It is available for all major platforms (Windows, Linux, and macOS X) and can run on a local machine such as our computer or even on the web using RStudio Server. With RStudio Server, you can provide a browser-based interface (the so-called IDE) to an R version running on a remote Linux server. It integrates several features that are really useful, especially if you use R for more complex projects or if you want to have more than one developer on a project.

The environment is made up of four different areas:

  • Scripting area: In this area we can open, create, and write our scripts
  • Console area: This zone is the actual R console where commands are executed
  • Workspace History area: In this area you can find a list of all the objects created in the workspace where we are working
  • Visualization area: In this area we can easily load packages and open R help files, but more importantly, we can view charts

The following screenshot shows the RStudio environment:

With the use of RStudio, our work will be considerably simplified, displaying all the resources needed in one window.

R packages for regression

Previously, we have mentioned the R packages, which allow us to access a series of features to solve a specific problem. In this section, we will present some packages that contain valuable resources for regression analysis. These packages will be analyzed in detail in the following chapters, where we will provide practical applications.

The R stats package

R stats is a package that contains many useful functions for statistical calculations and random number generation. In the following table you will see some of the information on this package:

Package

stats

Date

October 3, 2017

Version

3.5.0

Title                    

 

The R stats package

 

Author

 

R core team and contributors worldwide

 

 

There are so many functions in the package; we will only mention the ones that are closest to regression analysis. These are the most useful functions used in regression analysis:

  • lm: This function is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance, and analysis of co-variance.
  • summary.lm: This function returns a summary for linear model fits.
  • coef: With the help of this function, coefficients from objects returned by modeling functions can be extracted. Coefficients is an alias for it.
  • fitted: Fitted values are extracted by this function from objects returned by modeling functions fitted. Values are an alias for it.
  • formula: This function provides a way of extracting formulae which have been included in other objects.
  • predict: This function predicts values based on linear model objects.
  • residuals: This function extracts model residuals from objects returned by modeling functions.
  • confint: This function computes confidence intervals for one or more parameters in a fitted model. Base has a method for objects inheriting from the lm class.
  • deviance: This function returns the deviance of a fitted model object.
  • influence.measures: This suite of functions can be used to compute some of the regression (leave-one-out deletion) diagnostics for linear and generalized linear models (GLM).
  • lm.influence: This function provides the basic quantities used when forming a wide variety of diagnostics for checking the quality of regression fits.
  • ls.diag: This function computes basic statistics, including standard errors, t-values, and p-values for the regression coefficients.
  • glm: This function is used to fit GLMs, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
  • loess: This function fits a polynomial surface determined by one or more numerical predictors, using local fitting.
  • loess.control: This function sets control parameters for loess fits.
  • predict.loess: This function extracts predictions from a loess fit, optionally with standard errors.
  • scatter.smooth: This function plots and adds a smooth curve computed by loess to a scatter plot.

What we have analyzed are just some of the many functions contained in the stats package. As we can see, with the resources offered by this package we can build a linear regression model, as well as GLMs (such as multiple linear regression, polynomial regression, and logistic regression). We will also be able to make model diagnosis in order to verify the plausibility of the classic hypotheses underlying the regression model, but we can also address local regression models with a non-parametric approach that suits multiple regressions in the local neighborhood.

The car package

This package includes many functions for: ANOVA analysis, matrix and vector transformations, printing readable tables of coefficients from several regression models, creating residual plots, tests for the autocorrelation of error terms, and many other general interest statistical and graphing functions.

In the following table you will see some of the information on this package:

Package

car

Date

June 25, 2017

Version

2.1-5

Title                    

Companion to Applied Regression

Author

John Fox, Sanford Weisberg, and many others

 

The following are the most useful functions used in regression analysis contained in this package:

  • Anova: This function returns ANOVA tables for linear and GLMs
  • linear.hypothesis: This function is used for testing a linear hypothesis and methods for linear models, GLMs, multivariate linear models, and linear and generalized linear mixed-effects models
  • cookd: This function returns Cook's distances for linear and GLMs
  • outlier.test: This function reports the Bonferroni p-values for studentized residuals in linear and GLMs, based on a t-test for linear models and a normal-distribution test for GLMs
  • durbin.watson: This function computes residual autocorrelations and generalized Durbin-Watson statistics and their bootstrapped p-values
  • levene.test: This function computes Levene's test for the homogeneity of variance across groups
  • ncv.test: This function computes a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the response (fitted values), or with a linear combination of predictors

What we have listed are just some of the many functions contained in the stats package. In this package, there are also many functions that allow us to draw explanatory graphs from information extracted from regression models as well as a series of functions that allow us to make variables transformations.

The MASS package

This package includes many useful functions and data examples, including functions for estimating linear models through generalized least squares (GLS), fitting negative binomial linear models, the robust fitting of linear models, and Kruskal's non-metric multidimensional scaling.

In the following table you will see some of the information on this package:

Package

MASS

Date

October 2, 2017

Version

7.3-47

Title                    

Support Functions and Datasets for Venables and Ripley's MASS

Author

Brian Ripley, Bill Venables, and many others

 

The following are the most useful functions used in regression analysis contained in this package:

  • lm.gls: This function fits linear models by GLS
  • lm.ridge: This function fist a linear model by Ridge regression
  • glm.nb: This function contains a modification of the system function 
  • glm(): It includes an estimation of the additional parameter, theta, to give a negative binomial GLM
  • polr: A logistic or probit regression model to an ordered factor response is fitted by this function
  • lqs: This function fits a regression to the good points in the dataset, thereby achieving a regression estimator with a high breakdown point
  • rlm: This function fits a linear model by robust regression using an M-estimator
  • glmmPQL: This function fits a GLMM model with multivariate normal random effects, using penalized quasi-likelihood (PQL
  • boxcox: This function computes and optionally plots profile log-likelihoods for the parameter of the Box-Cox power transformation for linear models

As we have seen, this package contains many useful features in regression analysis; in addition there are numerous datasets that we can use for our examples that we will encounter in the following chapters.

The caret package

This package contains many functions to streamline the model training process for complex regression and classification problems. The package utilizes a number of R packages.

In the following table you will see listed some of the information on this package:

Package

caret

Date

September 7, 2017

Version

6.0-77

Title                    

Classification and Regression Training

Author

Max Kuhn and many others

 

The most useful functions used in regression analysis in this package are as follows:

  • train: Predictive models over different tuning parameters are fitted by this function. It fits each model, sets up a grid of tuning parameters for a number of classification and regression routines, and calculates a resampling-based performance measure.
  • trainControl: This function permits the estimation of parameter coefficients with the help of resampling methods like cross-validation.
  • varImp: This function calculates variable importance for the objects produced by train and method-specific methods.
  • defaultSummary: This function calculates performance across resamples. Given two numeric vectors of data, the mean squared error and R-squared error are calculated. For two factors, the overall agreement rate and Kappa are determined.
  • knnreg: This function performs K-Nearest Neighbor (KNN) regression that can return the average value for the neighbors.
  • plotObsVsPred: This function plots observed versus predicted results in regression and classification models.
  • predict.knnreg: This function extracts predictions from the KNN regression model.

The caret package contains hundreds of machine learning algorithms (also for regression), and renders useful and convenient methods for data visualization, data resampling, model tuning, and model comparison, among other features.

The glmnet package

This package contains many extremely efficient procedures in order to fit the entire Lasso or ElasticNet regularization path for linear regression, logistic and multinomial regression models, Poisson regression, and the Cox model. Multiple response Gaussian and grouped multinomial regression are the two recent additions.

In the following table you will see listed some of the information on this package:

Package

glmnet

Date

September 21, 2017

Version

2.0-13

Title                    

Lasso and Elastic-Net Regularized Generalized Linear Models

Author

Jerome Friedman, Trevor Hastie, Noah Simon, Junyang Qian, and Rob Tibshirani

 

The following are the most useful functions used in regression analysis contained in this package:

  • glmnet:  A GLM is fit by this function via penalized maximum likelihood. The regularization path is computed for the Lasso or ElasticNet penalty at a grid of values for the regularization parameter lambda. This function can also deal with all shapes of data, including very large sparse data matrices. Finally, it fits linear, logistic and multinomial, Poisson, and Cox regression models.
  • glmnet.control: This function views and/or changes the factory default parameters in glmnet.
  • predict.glmnet: This function predicts fitted values, logits, coefficients, and more from a fitted glmnet object.
  • print.glmnet: This function prints a summary of the glmnet path at each step along the path.
  • plot.glmnet: This function produces a coefficient profile plot of the coefficient paths for a fitted glmnet object.
  • deviance.glmnet: This function computes the deviance sequence from the glmnet object.

As we have mentioned, this package fits Lasso and ElasticNet model paths for regression, logistic, and multinomial regression using coordinate descent. The algorithm is extremely fast, and exploits sparsity in the input matrix where it exists. A variety of predictions can be made from the fitted models.

The sgd package

This package contains a fast and flexible set of tools for large scale estimation. It features many stochastic gradient methods, built-in models, visualization tools, automated hyperparameter tuning, model checking, interval estimation, and convergence diagnostics.

In the following table you will see listed some of the information on this package:

Package

sgd

Date

January 5, 2016

Version

1.1

Title                    

Stochastic Gradient Descent for Scalable Estimation

Author

Dustin Tran, Panos Toulis, Tian Lian, Ye Kuang, and Edoardo Airoldi

 

The following are the most useful functions used in regression analysis contained in this package:

  • sgd: This function runs Stochastic Gradient Descent (SGD) in order to optimize the induced loss function given a model and data
  • print.sgd: This function prints objects of the sgd class
  • predict.sgd: This function forms predictions using the estimated model parameters from SGD
  • plot.sgd: This function plots objects of the sgd class

The BLR package

This package performs a special case of linear regression named Bayesian linear regression. In Bayesian linear regression, the statistical analysis is undertaken within the context of a Bayesian inference.

In the following table you will see listed some of the information on this package:

Package

BLR

Date

December 3, 2014

Version

1.4

Title                    

Bayesian Linear Regression

Author

Gustavo de los Campos, Paulino Perez Rodriguez

The following are the most useful functions used in regression analysis contained in this package:

  • BLR: This function was designed to fit parametric regression models using different types of shrinkage methods.
  • sets: This is a vector (599x1) that assigns observations to ten disjointed sets; the assignment was generated at random. This is used later to conduct a 10-fold CV.

The Lars package

This package contains efficient procedures for fitting an entire Lasso sequence with the cost of a single least squares fit. Least angle regression and infinitesimal forward stagewise regression are related to the Lasso.

In the following table you will see listed some of the information on this package:

Package

Lars

Date

April 23, 2013

Version

1.2

Title                    

Least Angle Regression, Lasso and Forward Stagewise

Author

Trevor Hastie and Brad Efron

The following are the most useful functions used in regression analysis contained in this package:

  • lars: This function fits least angle regression and Lasso and infinitesimal forward stagewise regression models.
  • summary.lars: This function produces an ANOVA-type summary for a lars object.
  • plot.lars: This function produce a plot of a lars fit. The default is a complete coefficient path.
  • predict.lars: This function make predictions or extracts coefficients from a fitted lars model.

Summary

In this chapter, we were introduced to the basic concepts of regression analysis, and then we discovered different types of statistical processes. Starting from the origin of the term regression, we explored the meaning of this type of analysis. We have therefore gone through analyzing the real cases in which it is possible to extract knowledge from the data at our disposal.

Then, we analyzed how to build regression models step by step. Each step of the workflow was analyzed to understand the meaning and the operations to be performed. Particular emphasis was devoted to the generalization ability of the regression model, which is crucial for all other machine learning algorithms. 

We explored the fundamentals of regression analysis and correlation analysis, which allows us to study the relationships between variables. We understood the differences between these two types of analysis.  

In addition, an introduction, background information, and basic knowledge of the R environment were covered. Finally, we explored a number of essential packages that R provides for understanding the amazing world of regression.

In the next chapter, we will approach regression with the simplest algorithm: simple linear regression. First, we will describe a regression problem in terms of where to fit a regressor and then we will provide some intuitions underneath the math formulation. Then, the reader will learn how to tune the model for high performance and deeply understand every parameter of it. Finally, some tricks will be described to lower the complexity and scale the approach.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Implement different regression analysis techniques to solve common problems in data science - from data exploration to dealing with missing values
  • From Simple Linear Regression to Logistic Regression - this book covers all regression techniques and their implementation in R
  • A complete guide to building effective regression models in R and interpreting results from them to make valuable predictions

Description

Regression analysis is a statistical process which enables prediction of relationships between variables. The predictions are based on the casual effect of one variable upon another. Regression techniques for modeling and analyzing are employed on large set of data in order to reveal hidden relationship among the variables. This book will give you a rundown explaining what regression analysis is, explaining you the process from scratch. The first few chapters give an understanding of what the different types of learning are – supervised and unsupervised, how these learnings differ from each other. We then move to covering the supervised learning in details covering the various aspects of regression analysis. The outline of chapters are arranged in a way that gives a feel of all the steps covered in a data science process – loading the training dataset, handling missing values, EDA on the dataset, transformations and feature engineering, model building, assessing the model fitting and performance, and finally making predictions on unseen datasets. Each chapter starts with explaining the theoretical concepts and once the reader gets comfortable with the theory, we move to the practical examples to support the understanding. The practical examples are illustrated using R code including the different packages in R such as R Stats, Caret and so on. Each chapter is a mix of theory and practical examples. By the end of this book you will know all the concepts and pain-points related to regression analysis, and you will be able to implement your learning in your projects.

Who is this book for?

This book is intended for budding data scientists and data analysts who want to implement regression analysis techniques using R. If you are interested in statistics, data science, machine learning and wants to get an easy introduction to the topic, then this book is what you need! Basic understanding of statistics and math will help you to get the most out of the book. Some programming experience with R will also be helpful

What you will learn

  • Get started with the journey of data science using Simple linear regression
  • Deal with interaction, collinearity and other problems using multiple linear regression
  • Understand diagnostics and what to do if the assumptions fail with proper analysis
  • Load your dataset, treat missing values, and plot relationships with exploratory data analysis
  • Develop a perfect model keeping overfitting, under-fitting, and cross-validation into consideration
  • Deal with classification problems by applying Logistic regression
  • Explore other regression techniques – Decision trees, Bagging, and Boosting techniques
  • Learn by getting it all in action with the help of a real world case study.

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jan 31, 2018
Length: 422 pages
Edition : 1st
Language : English
ISBN-13 : 9781788622707
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jan 31, 2018
Length: 422 pages
Edition : 1st
Language : English
ISBN-13 : 9781788622707
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 136.97
Regression Analysis with R
$43.99
R Data Mining
$48.99
Data Analysis with R, Second Edition
$43.99
Total $ 136.97 Stars icon
Banner background image

Table of Contents

10 Chapters
Getting Started with Regression Chevron down icon Chevron up icon
Basic Concepts – Simple Linear Regression Chevron down icon Chevron up icon
More Than Just One Predictor – MLR Chevron down icon Chevron up icon
When the Response Falls into Two Categories – Logistic Regression Chevron down icon Chevron up icon
Data Preparation Using R Tools Chevron down icon Chevron up icon
Avoiding Overfitting Problems - Achieving Generalization Chevron down icon Chevron up icon
Going Further with Regression Models Chevron down icon Chevron up icon
Beyond Linearity – When Curving Is Much Better Chevron down icon Chevron up icon
Regression Analysis in Practice Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.7
(3 Ratings)
5 star 66.7%
4 star 33.3%
3 star 0%
2 star 0%
1 star 0%
Vincenzo Ferraro Mar 13, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book has a lot to offer both in terms of theory and code. It is a simple book that does not require a great deal of knowledge of R. The most widespread statistical techniques are treated in a simple way. There are also many examples available to immediately apply the concepts learned.
Amazon Verified review Amazon
TR Mar 13, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I ordered this book to prepare for a university exams and I found it really interesting and very clear. For those who want to have a base I highly recommend it. In addition to explaining the concepts well is full of examples that seriously help to understand the operation of various algorithms.
Amazon Verified review Amazon
Hitomi Cranford Oct 27, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
I was hoping for more in terms of multicollinearity. Discussion of the tidyverse library to build amazing multiple linear regressions would have been awesome but that doesn't take away from the fact that this book is a really good start!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.