Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
The Statistics and Machine Learning with R Workshop
The Statistics and Machine Learning with R Workshop

The Statistics and Machine Learning with R Workshop: Unlock the power of efficient data science modeling with this hands-on guide

eBook
$39.99
Paperback
$49.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Table of content icon View table of contents Preview book icon Preview Book

The Statistics and Machine Learning with R Workshop

Getting Started with R

In this chapter, we will cover the basics of R, the most widely used open source language for statistical analysis and modeling. We will start with an introduction to RStudio, how to perform simple calculations, the common data structures and control logic, and how to write functions in R.

By the end of the chapter, you will be able to do basic computations in R using common data structures such as vectors, lists and data frames in the RStudio integrated development environment (IDE). You will also be able to wrap these calculations in functions using different methods.

In this chapter, we will cover the following:

  • Introducing R
  • Covering the R and RStudio basics
  • Common data structures in R
  • Control logic in R
  • Exploring functions in R

Technical requirements

To complete the exercises in this chapter, you will need to have the following:

  • The latest version of R, which is 4.1.2 at the time of writing
  • The latest version of RStudio Desktop, which is 2021.09.2+382

All the code for this chapter is available at https://github.com/PacktPublishing/The-Statistics-and-Machine-Learning-with-R-Workshop/blob/main/Chapter_1/Chapter_1.R.

Introducing R

R is a popular open source language that supports statistical analysis and modeling, and it is most widely used by statisticians developing statistical models and performing data analysis. One question commonly asked by learners is how to choose between Python and R. For those new to both and needing a simple model for a not-so-big dataset, R would be a better choice. It has rich resources to support modeling and plotting tasks that were developed by statisticians long before Python was born. Besides its many off-the-shelf graphing and statistical modeling offerings, the R community is also catching up in advanced machine learning such as deep learning, which the Python community currently dominates.

There are many differences between the two languages, and recent years have witnessed increasing convergence in many aspects. This book aims to equip you with the essential knowledge to understand and use statistics and calculus via R. We hope that at some point, you will be able to extract from the inner workings of the language itself and think at the methodological level when performing some analysis. After cultivating the essential skills from the fundamentals, it will just be a matter of personal preference regarding the specific language in use. To this end, R provides dedicated utility functions to automatically “convert” Python code to be used within the R context, which gives us another reason not to worry about choosing a specific language.

Covering the R and RStudio basics

It is easy to confuse R with RStudio if you are a first-time user. In a nutshell, R is the engine that supports all sorts of backend computations, and RStudio is a convenient tool for navigating and managing related coding and reference resources. Specifically, RStudio is an IDE where the user writes R code, performs analysis, and develops models without worrying much about the backend logistics required by the R engine. The interface provided by RStudio makes the development work much more convenient and user-friendly than the vanilla R interface.

First, we need to install R on our computer, as the RStudio will ship with the computation horsepower upon installation. We can choose the corresponding version of R at https://cloud.r-project.org/, depending on the specific type of operating system we use. RStudio can then be downloaded at https://www.rstudio.com/products/rstudio/download/ and installed accordingly. When launching the RStudio application after installing both software, the R engine will be automatically detected and used. Let’s go through an exercise to get familiar with the interface.

Exercise 1.01 – exploring RStudio

RStudio provides a comprehensive environment for working with R scripts and exploring the data simultaneously. In this exercise, we will look at a basic example of how to write a simple script to store a string and perform a simple calculation using RStudio.

Perform the following steps to complete this exercise:

  1. Launch the RStudio application and observe the three panes:
    • The Console pane is used to execute R commands and display the immediate result.
    • The Environment pane stores all the global variables in the current session.
    • The Files pane lists all the files within the current working directory along with other tabs, as shown in Figure 1.1.

      Note that the R version is printed as a message in the console (highlighted in the dashed box):

Figure 1.1 – A screenshot of the RStudio upon the first launch

Figure 1.1 – A screenshot of the RStudio upon the first launch

We can also type R.version in the console to retrieve more detailed information on the version of the R engine in use, as shown in Figure 1.2. It is essential to check the R version, as different versions may produce different results when running the same code.

Figure 1.2 – Typing a command in the console to check the R version

Figure 1.2 – Typing a command in the console to check the R version

  1. Build a new R script by clicking on the plus sign in the upper-left corner or via File | New File | R Script. An R script allows us to write longer R code that involves functions and chunks of code executed in sequence. We will build an R script and name it test.R upon saving the file. See the following figure for an illustration:
Figure 1.3 – Creating a new R script

Figure 1.3 – Creating a new R script

  1. Running the script can be achieved by placing the cursor at the current line and pressing Cmd + Enter for macOS or Ctrl + Enter for Windows; alternatively, click on the Run button at the top of the R script pane, as shown in the following figure:
Figure 1.4 – Executing the script by clicking on the Run button

Figure 1.4 – Executing the script by clicking on the Run button

  1. Type the following commands in the script editing pane and observe the output in the console as well as the changes in the other panes. First, we create a variable named test by assigning "I am a string". A variable can be used to store an object, which could take the form of a string, number, data frame, or even function (more on this later). Strings consist of characters, a common data type in R. The test variable created in the script is also reflected in the Environment pane, which is a convenient check as we can also observe the content in the variable. See Figure 1.5 for an illustration:
    # String assignment
    test = "I am a string"
    print(test)
Figure 1.5 – Creating a string-type variable

Figure 1.5 – Creating a string-type variable

We also assign a simple addition operation to test2 and print it out in the console. These commands are also annotated via the # sign, where the contents after the sign are not executed and are only used to provide an explanation of the following code. See Figure 1.6 for an illustration:

# Simple calculation
test2 = 1 + 2
print(test2)
Figure 1.6 – Assigning a string and performing basic computation

Figure 1.6 – Assigning a string and performing basic computation

  1. We can also check the contents of the environment workspace via the ls() function:
    >>> ls()
    "test"  "test2"

In addition, note that the newly created R script is also reflected in the Files pane. RStudio is an excellent one-stop IDE for working with R and will be the programming interface for this book. We will introduce more features of RStudio in a more specific context along the way.

Note

The canonical way of assigning some value to a variable is via the <- operator instead of the = sign as in the example. However, the author chose to use the = sign as it is faster to type on the screen and has an equivalent effect as the <- sign in the majority of cases.

In addition, note that the output message in the Console pane has a preceding [1] sign, which indicates that the result is a one-dimensional output. We will ignore this sign in the output message unless otherwise specified.

The exercise in the previous section provides an additional example, which is an essential operation in R. As with other modern programming languages, R also ships with many standard arithmetic operators, including subtraction (-), multiplication (*), division (/), exponentiation (^), and modulo (%%) operators. The modulo operator returns the remainder of the numerator in the division operation.

Let’s look at an exercise to go through some common arithmetic operations.

Exercise 1.02 – common arithmetic operations in R

This exercise will perform different arithmetic operations (addition, subtraction, multiplication, division, exponentiation, and modulo) between two numbers: 5 and 2.

Type the commands under the EXERCISE 1.02 comment section in the R Script pane and observe the output message in the console shown in Figure 1.7. Note that we removed the print() function, as directly executing the command will also print out the result as highlighted in the console:

Figure 1.7 – Performing common arithmetic operations in R

Figure 1.7 – Performing common arithmetic operations in R

Note that these elementary arithmetic operations can jointly form complex operations. When evaluating a complex operation that consists of multiple operators, the general rule of thumb is to use parentheses to enforce the execution of a specific component according to the desired sequence. This follows in most numeric analyses using any programming language.

But, what forms can we expect the data to take in R?

Common data types in R

There are five most basic data types in R: numeric, integer, character, logical, and factor. Any complex R object can be decomposed into individual elements that fall into one of these five data types and, therefore, contain one or more data types. The definition of these five data types is as follows:

  • Numeric is the default data type in R and represents a decimal value, such as 1.23. A variable is treated as a numeric even if we assign an integer value to it in the first place.
  • Integer is a whole number and so a subset of the numeric data type.
  • Character is the data type used to store a sequence of characters (including letters, symbols, or even numbers) to form a string or a piece of text, surrounded by double or single quotes.
  • Logical is a Boolean data type that only takes one of two values: TRUE or FALSE. It is often used in a conditional statement to determine whether specific codes after the condition should be executed.
  • Factor is a special data type used to store categorical variables that contain a limited number of categories (or levels), ordered or unordered. For example, a list of student heights classified as low, medium, and high can be represented as a factor type to encode the inherent ordering, which would not be available when represented as a character type. On the other hand, unordered lists such as male and female can also be represented as factor types.

Let’s go through an example to understand these different data types.

Exercise 1.03 – understanding data types in R

R has strict rules on the data types when performing arithmetic operations. In general, the data types of all variables should be the same when evaluating a particular statement (a piece of code). Performing an arithmetic operation on different data types may give an error. In this exercise, we will look at how to check the data type to ensure the type consistency and different ways to convert the data type from one into another:

  1. We start by creating five variables, each belonging to a different data type. Check the data type using the class() function. Note that we can use the semicolon to separate different actions:
    >>> a = 1.0; b = 1; c = "test"; d = TRUE; e = factor("test")
    >>> class(a); class(b); class(c); class(d); class(e)
    "numeric"
    "numeric"
    "character"
    "logical"
    "factor"

    As expected, the data type of the b variable is converted into numeric even when it is assigned an integer in the first place.

  2. Perform addition on the variables. Let’s start with the a and b variables:
    >>> a + b
    2
    >>> class(a + b)
    "numeric"

    Note that the decimal point is ignored when displaying the result of the addition, which is still numeric as verified via the class() function.

    Now, let’s look at the addition between a and c:

    >>> a + c
    Error in a + c : non-numeric argument to binary operator

    This time, we received an error message due to a mismatch in data types when evaluating an addition operation. This is because the + addition operator in R is a binary operator designed to take in two values (operands) and produce another, all of which need to be numeric (including integer, of course). The error pops up when any of the two input arguments are non-numeric.

  3. Let’s trying adding a and d:
    >>> a + d
    2
    >>> class(a + d)
    "numeric"

    Surprisingly, the result is the same as a + b, suggesting that the Boolean b variable taking a TRUE value is converted into a value of one under the hood. Correspondingly, a Boolean value of FALSE, obtained by adding an exclamation mark before the variable, would be treated as zero when performing an arithmetic operation with a numeric:

    >>> a + !d
    1

    Note that the implicit Boolean conversion occurs in settings when such conversion is necessary to proceed in a specific statement. For example, d is converted into a numeric value of one when evaluating whether a equals d:

    >>> a == d
    TRUE
  4. Convert the data types using the as.(datatype) family of functions in R.

    For example, the as.numeric() function converts the input parameter into a numeric, as.integer() returns the integer part of the input decimal, as.character() converts all inputs (including numeric and Boolean) into strings, and as.logical() converts any non-zero numeric into TRUE and zero into FALSE. Let’s look at a few examples:

    >>> class(as.numeric(b))
    "numeric"

    This suggests that the b variable is successfully converted into numeric. Note that type conversion is a standard data processing operation in R, and type incompatibility is a popular source of error that may be difficult to trace:

    >>> as.integer(1.8)
    1
    >>> round(1.8)
    2

    Since as.integer() only returns the integer part of the input, the result is always “floored” to the lower bound integer. We could use the round() function to round it up or down, depending on the value of the first digit after the decimal point:

    >>> as.character(a)
    "1"
    >>> as.character(d)
    "TRUE"

    The as.character() function converts all input parameters into strings as represented by the double quotes, including numeric and Boolean. The converted value no longer maintains the original arithmetic property. For example, a numeric converted into a character would not go through the addition operation. Also, a Boolean converted into a character would no longer be evaluated via a logical statement and treated as a character:

    >>> as.factor(a)
    1
    Levels: 1
    >>> as.factor(c)
    test
    Levels: test

    Since there is only one element in the input parameter, the resulting number of levels is only 1, meaning the original input itself.

Note

A categorical variable is called a nominal variable when there is no natural ordering among the categories, and an ordinal variable if there is natural ordering. For example, the temperature variable valued as either high, medium, or low has an inherent ordering in nature, while a gender variable valued as either male or female has no order.

Common data structures in R

Data structures provide an organized way to store various data points that follow either the same or different types. This section will look at the typical data structures used in R, including the vector, matrix, data frame, and list.

Vector

A vector is a one-dimensional array that can hold a series of elements of any consistent data type, including numeric, integer, character, logical, or factor. We can create a vector by filling in comma-separated elements in the input argument of the combine function, c(). The arithmetic operations between two vectors are similar to the single-element example earlier, provided that their lengths are equal. There needs to be a one-to-one correspondence between the elements of the two vectors; if not, the calculation may give an error. Let’s look at an exercise.

Exercise 1.04 – working with vectors

We will create two vectors of the same length in this exercise and add them up. As an extension, we will also attempt the same addition using a vector of a different length. We will also perform a pairwise comparison between the two vectors:

  1. Create two vectors named vec_a and vec_b and extract simple summary statistics such as mean and sum:
    >>> vec_a = c(1,2,3)
    >>> vec_b = c(1,1,1)
    >>> sum(vec_a)
    6
    >>> mean(vec_a)
    2

    The sum and mean of a vector can be generated using the sum() and mean() function, respectively. We will cover more ways to summarize a vector later.

  2. Add up vec_a and vec_b:
    >>> vec_a + vec_b
    2 3 4

    The addition between two vectors is performed element-wise. The result can also be saved into another variable for further processing. How about adding a single element to a vector?

  3. Add vec_a and 1:
    >>> vec_a + 1
    2 3 4

    Under the hood, element one is broadcasted into vector c(1,1,1), whose length is decided by vec_a. Broadcasting is a unique mechanism that replicates the elements of the short vector into the required length, as long as the length of the longer vector is a multiple of the short vector’s length. The same trick may not apply when it is not a multiple.

  4. Add vec_a and c(1,1):
    >>> vec_a + c(1,1)
    2 3 4
    Warning message:
    In vec_a + c(1, 1) :
    longer object length is not a multiple of shorter object length

    We still get the same result, except for a warning message saying that the longer vector’s length of three is not a multiple of the shorter vector length of two. Pay attention to this warning message. It is not recommended to follow such practice as the warning may become an explicit error or become the implicit cause of an underlying bug in an extensive program.

  5. Next, we will perform a pairwise comparison between the two vectors:
    vec_a > vec_b
    FALSE  TRUE  TRUE
    vec_a == vec_b
    TRUE FALSE FALSE

    Here, we have used evaluation operators such as > (greater than) and == (equal to), returning logical results (TRUE or FALSE) for each pair.

    Note, there are multiple logical comparison operators in R. The common ones include the following:

    • < for less than
    • <= for less than or equal to
    • > for greater than
    • >= for greater than or equal to
    • == for equal to
    • != for not equal to

Besides the common arithmetic operations, we may also be interested in selected partial components of a vector. We can use square brackets to select specific elements of a vector, which is the same way to select elements in other data structures such as in a matrix or a data frame. In between the square brackets are indices indicating what elements to select. For example, we can use vec_a[1] to select the first element of vec_a. Let’s go through an exercise to look at different ways to subset a vector.

Exercise 1.05 – subsetting a vector

We can pass in the select index (starting from 1) to select the corresponding element in the vector. We can wrap the indices via the c() combine function and pass in the square brackets to select multiple elements. Selecting multiple sequential indices can also be achieved via a shorthand notation by writing the first and last index with a colon in between. Let’s run through different ways of subsetting a vector:

  1. Select the first element in vec_a:
    >>> vec_a[1]
    1
  2. Select the first and third elements in vec_a:
    >>> vec_a[c(1,3)]
    1 3
  3. Select all three elements in vec_a:
    >>> vec_a[c(1,2,3)]
    1 2 3

    Selecting multiple elements in this way is not very convenient since we need to type every index. When the indices are sequential, a nice shorthand trick is to use the starting and end index separated by a colon. For example, 1:3 would be the same as c(1,2,3):

    >>> vec_a[1:3]
    1 2 3

    We can also perform more complex subsetting by adding a conditional statement within the square brackets as the selection criteria. For example, the logical evaluation introduced earlier returns either True or False. An element whose index is marked as true in the square bracket would be selected. Let’s see an example.

  4. Select elements in vec_a that are bigger than the corresponding elements in vec_b:
    >>> vec_a[vec_a > vec_b]
    2 3

    The result contains the last two elements since only the second and third indices are set as true.

Matrix

Like a vector, a matrix is a two-dimensional array consisting of a collection of elements of the same data type arranged in a fixed number of rows and columns. It is often faster to work with a data structure exclusively containing the same data type since the program does not need to differentiate between different types of data. This makes the matrix a popular data structure in scientific computing, especially in an optimization procedure that involves intensive computation. Let’s get familiar with the matrix, including different ways to create, index, subset, and enlarge a matrix.

Exercise 1.06 – creating a matrix

The standard way to create a matrix in R is to call the matrix() function, where we need to supply three input arguments:

  • The elements to be filled in the matrix
  • The number of rows in the matrix
  • The filling direction (either by row or by column)

We will also rename the rows and columns of the matrix:

  1. Use vec_a and vec_b to create a matrix called mtx_a:
    >>> mtx_a = matrix(c(vec_a,vec_b), nrow=2, byrow=TRUE)
    >>> mtx_a
         [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    1    1    1

    First, the input vectors, vec_a and vec_b, are combined via the c() function to form a long vector, which then gets sequentially arranged into two rows (nrow=2) row-wise (byrow=TRUE). Feel free to try out different dimension configurations, such as setting three rows and two columns when creating the matrix.

    Pay attention to the row and column names in the output. The rows are indexed by the first index in the square bracket, while the second indexes the columns. We can also rename the matrix as follows.

  2. Rename the matrix mtx_a via the rownames() and colnames() functions:
    >>> rownames(mtx_a) = c("r1", "r2")
    >>> colnames(mtx_a) = c("c1", "c2", "c3")
    >>> mtx_a
       c1 c2 c3
    r1  1  2  3
    r2  1  1  1

Let’s look at how to select elements from the matrix.

Exercise 1.07 – subsetting a matrix

We can still use the square brackets to select one or more matrix elements. The colon shorthand trick also applies to matrix subsetting:

  1. Select the element at the first row and second column of the mtx_a matrix:
    >>> mtx_a[1,2]
    2
  2. Select all elements of the last two columns across all rows in the mtx_a matrix:
    >>> mtx_a[1:2,c(2,3)]
       c2 c3
    r1  2  3
    r2  1  1
  3. Select all elements of the second row of the mtx_a matrix:
    >>> mtx_a[2,]
    c1 c2 c3
     1  1  1

    In this example, we have used the fact that the second (column-level) index indicates that all columns are selected when left blank. The same applies to the first (row-level) index as well.

    We can also select the second row using the row name:

    >>> mtx_a[rownames(mtx_a)=="r2",]
    c1 c2 c3
    1  1

    Selecting elements by matching the row name using a conditional evaluation statement offers a more precise way of subsetting the matrix, especially when counting the exact index becomes troublesome. Name-based indexing also applies to columns.

  4. Select the third row of the mtx_a matrix:
    >>> mtx_a[,3]
    r1 r2
     3  1
    >>> mtx_a[,colnames(mtx_a)=="c3"]
    r1 r2
     3  1

    Therefore, we have multiple ways to select the specific elements of interest from a matrix.

Working with a matrix requires similar arithmetic operations compared to a vector. In the next exercise, we will look at summarizing a matrix both row-wise and column-wise and performing basic operations such as addition and multiplication.

Exercise 1.08 – arithmetic operations with a matrix

Let’s start by making a new matrix:

  1. Create another matrix named mtx_b whose elements are double those in mtx_a:
    >>> mtx_b = mtx_a * 2
    >>> mtx_b
       c1 c2 c3
    r1  2  4  6
    r2  2  2  2

    Besides multiplication, all standard arithmetic operators (such as +, -, and /) apply in a similar element-wise fashion to a matrix, backed by the same broadcasting mechanism. Operations between two matrices of the same size are also performed element-wise.

  2. Divide mtx_a by mtx_b:
    >>> mtx_a / mtx_b
        c1  c2  c3
    r1 0.5 0.5 0.5
    r2 0.5 0.5 0.5
  3. Calculate the row-wise and column-wise sum and mean of mtx_a using rowSums(), colSums(), rowMeans(), and colMeans() respectively:
    >>> rowSums(mtx_a)
    r1 r2
     6  3
    >>> colSums(mtx_a)
    c1 c2 c3
     2  3  4
    >>> rowMeans(mtx_a)
    r1 r2
     2  1
    >>> colMeans(mtx_a)
    c1  c2  c3
    1.0 1.5 2.0

When running an optimizing procedure, we often need to save some intermediate metrics, such as model loss and accuracy, for diagnosis. These metrics can be saved in a matrix form by gradually appending new data to the current matrix. Let’s look at how to expand a matrix both row-wise and column-wise.

Exercise 1.09 – expanding a matrix

Adding a column or multiple columns to a matrix can be achieved via the cbind() function, which merges a new matrix or vector column-wise. Similarly, an additional matrix or vector can be concatenated row-wise via the rbind() function:

  1. Append mtx_b to mtx_a column-wise:
    >>> cbind(mtx_a, mtx_b)
       c1 c2 c3 c1 c2 c3
    r1  1  2  3  2  4  6
    r2  1  1  1  2  2  2

    We may need to rename the columns since some of them overlap. This also applies to the row-wise concatenation as follows.

  2. Append mtx_b to mtx_a row-wise:
    >>> rbind(mtx_a, mtx_b)
       c1 c2 c3
    r1  1  2  3
    r2  1  1  1
    r1  2  4  6
    r2  2  2  2

So, we’ve seen the matrix in operation. How about data frames next?

Data frame

A data frame is a standard data structure where variables are stored as columns and observations as rows in an object. It is an advanced version of a matrix in that the elements for each column can be of different data types.

The R engine comes with several default datasets stored as data frames. In the next exercise, we will look at different ways to examine and understand the structure of a data frame.

Exercise 1.10 – understanding data frames

The data frame is a famous data structure representing rectangular-shaped data similar to Excel. Let’s examine a default dataset in R as an example:

  1. Load the iris dataset:
    >>> data("iris")
    >>> dim(iris)
    150   5

    Checking the dimension using the dim() function suggests that the iris dataset contains 150 rows and five columns. We can initially understand its contents by looking at the first and last few observations (rows) in the dataset.

  2. Examine the first and last five rows using head() and tail():
    >>> head(iris)
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1          5.1         3.5          1.4         0.2  setosa
    2          4.9         3.0          1.4         0.2  setosa
    3          4.7         3.2          1.3         0.2  setosa
    4          4.6         3.1          1.5         0.2  setosa
    5          5.0         3.6          1.4         0.2  setosa
    6          5.4         3.9          1.7         0.4  setosa
    >>> tail(iris)
        Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    145          6.7         3.3          5.7         2.5 virginica
    146          6.7         3.0          5.2         2.3 virginica
    147          6.3         2.5          5.0         1.9 virginica
    148          6.5         3.0          5.2         2.0 virginica
    149          6.2         3.4          5.4         2.3 virginica
    150          5.9         3.0          5.1         1.8 virginica

    Note that the row names are sequentially indexed by integers starting from one by default. The first four columns are numeric, and the last is a character (or factor). We can look at the structure of the data frame more systematically.

  3. Examine the structure of the iris dataset using str():
    >>> str(iris)
    'data.frame':    150 obs. of  5 variables:
     $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
     $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
     $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
     $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
     $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

    The str() function summarizes the data frame structure, including the total number of observations and variables, the complete list of variable names, data type, and the first few observations. The number of categories (levels) is also shown if the column is a factor.

    We can also create a data frame by passing in vectors as columns of the same length to the data.frame() function.

  4. Create a data frame called df_a with two columns that correspond to vec_a and vec_b respectively:
    >>> df_a = data.frame("a"=vec_a, "b"=vec_b)
    >>> df_a
      a b
    1 1 1
    2 2 1
    3 3 1

Selecting the elements of a data frame can be done in a similar fashion to matrix selection. Other functions such as subset() make the selection more flexible. Let’s go through an example.

Exercise 1.11 – selecting elements in a data frame

In this exercise, we will first look at different ways to select a particular set of elements and then introduce the subset() function to perform customized conditional selection:

  1. Select the second column of the df_a data frame:
    >>> df_a[,2]
    1 1 1

    The row-level indexing is left blank to indicate that all rows will be selected. We can also make it explicit by referencing all row-level indices:

    >>> df_a[1:3,2]
    1 1 1

    We can also select by using the name of the second column as follows:

    >>> df_a[,"b"]
    1 1 1

    Alternatively, we can use the shortcut $ sign to reference the column name directly:

    >>> df_a$b
    1 1 1

    The subset() function provides an easy and structured way to perform row-level filtering and column-level selection. Let’s see how it works in practice.

  2. Select the rows of df_a where column a is greater than two:
    >>> subset(df_a, a>2)
      a b
    3 3 1

    Note that row index three is also shown as part of the output.

    We can directly use column a within the context of the subset() function, saving us from using the $ sign instead. We can also select the column by passing the column name to the select argument.

  3. Select column b where column a is greater than two in df_a:
    >>> subset(df_a, a>2, select="b")
      b
    3 1

Another typical operation in data analysis is sorting one or more variables of a data frame. Let’s see how it works in R.

Exercise 1.12 – sorting vectors and data frames

The order() function can be used to return the ranked position of the elements in the input vector, which can then be used to sort the elements via updated indexing:

  1. Create the c(5,1,10) vector in vec_c and sort it in ascending order:
    >>> vec_c = c(5,1,10)
    >>> order(vec_c)
    2 1 3
    >>> vec_c[order(vec_c)]
    1  5 10

    Since the smallest element in vec_c is 1, the corresponding ranked position is 1. Similarly, 5 is set as the second rank and 10 as the third and highest rank. The ranked positions are then used to reshuffle and sort the original vector, the same as how we would select its elements via positional indexing.

    The order() function ranks the elements in ascending order by default. What if we want to sort by descending order? We could simply add a minus sign to the input vector.

  2. Sort the df_a data frame by column a in descending order:
    >>> df_a[order(-df_a$a),]
      a b
    3 3 1
    2 2 1
    1 1 1

Data frames will be the primary structures we will work with in this book. Let’s look at the last and most complex data structure: list.

List

A list is a flexible data structure that can hold different data types (numeric, integer, character, logical, factor, or even list itself), each possibly having a different length. It is the most complex structure we have introduced so far, gathering various objects in a structured way. To recap, let’s compare the four data structures in terms of the contents, data type, and length in Figure 1.8. In general, all four structures can store elements of any data type. Vectors (one-dimensional array) and matrices (two-dimensional array) require the contents to be homogeneous data types. A data frame contains one or more vectors whose data types could differ, and a list could contain entries of different data types. Matrices and data frames follows a rectangular shape and so require the same length for each column. However, the entries in a list could be of arbitrary lengths (subject to memory constraint) different from each other.

Figure 1.8 – Comparing four different data structures in terms of contents, data type, and length

Figure 1.8 – Comparing four different data structures in terms of contents, data type, and length

Let’s look at how to create a list.

Exercise 1.13 – creating a list

In this exercise, we will go through different ways to manipulate a list, including creating and renaming a list, and accessing, adding, and removing elements in a list:

  1. Create a list using the previous a, vec_a, and df_a variables:
    >>> ls_a = list(a, vec_a, df_a)
    >>> ls_a
    [[1]]
    [1] 1
    [[2]]
    [1] 1 2 3
    [[3]]
      a b
    1 1 1
    2 2 1
    3 3 1

    The output shows that the list elements are indexed by double square brackets, which can be used to access the entries in the list.

  2. Access the second entry in the list, ls_a:
    >>> ls_a[[2]]
    1 2 3

    The default indices can also be renamed to enable entry selection by name.

  3. Rename the list based on the original names and access the vec_a variable:
    >>> names(ls_a) <- c("a", "vec_a", "df_a")
    ls_a
    $a
    [1] 1
    $vec_a
    [1] 1 2 3
    $df_a
      a b
    1 1 1
    2 2 1
    3 3 1
    >>> ls_a[['vec_a']]
    1 2 3
    >>> ls_a$vec_a
    1 2 3

    We can access a specific entry in the list by using the name either in square brackets or via the $ sign.

  4. Add a new entry named new_entry with the content "test" in the ls_a list:
    >>> ls_a[['new_entry']] = "test"
    >>> ls_a
    $a
    [1] 1
    $vec_a
    [1] 1 2 3
    $df_a
      a b
    1 1 1
    2 2 1
    3 3 1
    $new_entry
    [1] "test"

    The result shows that "test" is now added to the last entry of ls_a. We can also remove a specific entry by assigning NULL to it.

  5. Remove the entry named df_a in ls_a:
    >>> ls_a[['df_a']] = NULL
    >>> ls_a
    $a
    [1] 1
    $vec_a
    [1] 1 2 3
    $new_entry
    [1] "test"

    The entry named df_a is now successfully removed from the list. We can also update an existing entry in the list.

  6. Update the entry named vec_a to be c(1,2):
    >>> ls_a[['vec_a']] = c(1,2)
    >>> ls_a
    $a
    [1] 1
    $vec_a
    [1] 1 2
    $new_entry
    [1] "test"

    The entry named vec_a is now successfully updated.

The flexibility and scalability of the list structure make it a popular choice for storing heterogeneous data elements, similar to the dictionary in Python. In the next section, we will extend our knowledge base by going over the control logic in R, which gives us more flexibility and precision when writing long programs.

Control logic in R

Relational and logical operators help compare statements as we add logic to the program. We can also add to the complexity by evaluating multiple conditional statements via loops that repeatedly iterate over a sequence of actions. This section will cover the essential relational and logical operators that form the building blocks of conditional statements.

Relational operators

We briefly covered a few relational operators such as >= and == earlier. This section will provide a detailed walkthrough on the use of standard relational operators. Let’s look at a few examples.

Exercise 1.14 – practicing with standard relational operators

Relational operators allow us to compare two quantities and obtain the single result of the comparison. We will go over the following steps to learn how to express and use standard relational operators in R:

  1. Execute the following evaluations using the equality operator (==) and observe the output:
    >>> 1 == 2
    FALSE
    >>> "statistics" == "calculus"
    FALSE
    >>> TRUE == TRUE
    TRUE
    >>> TRUE == FALSE
    FALSE

    The equality operator performs by strictly evaluating the two input arguments on both sides (including logical data) and only returns TRUE if they are equal.

  2. Execute the same evaluations using the inequality operator (!=) and observe the output:
    >>> 1 != 2
    TRUE
    >>> "statistics" != "calculus"
    TRUE
    >>> TRUE != TRUE
    FALSE
    >>> TRUE != FALSE
    TRUE

    The inequality operator is the exact opposite of the equality operator.

  3. Execute the following evaluations using the greater than and less than operators (> and <) and observe the output:
    >>> 1 < 2
    TRUE
    >>> "statistics" > "calculus"
    TRUE
    >>> TRUE > FALSE
    TRUE

    In the second evaluation, the comparison between character data follows the pairwise alphabetical order of both strings starting from the leftmost character. In this case, the letter s comes after c and is encoded as a higher-valued numeric. In the third example, TRUE is converted into one and FALSE into zero, so returning a logical value of TRUE.

  4. Execute the following evaluations using the greater-than-or-equal-to operator (>=) and less-than-or-equal-to operator (<=) and observe the output:
    >>> 1 >= 2
    FALSE
    >>> 2 <= 2
    TRUE

    Note that these operators consist of two conditional evaluations connected via an OR operator (|). We can, therefore, break it down into two evaluations in brackets, resulting in the same output as before:

    >>> (1 > 2) | (1 == 2)
    FALSE
    >>> (2 < 2) | (2 == 2)
    TRUE

    The relational operators also apply to vectors, which we encountered earlier, such as row-level filtering to subset a data frame.

  5. Compare vec_a with 1 using the greater-than operator:
    >>> vec_a > 1
    FALSE  TRUE  TRUE

    We would get the same result by separately comparing each element and combining the resulting using c().

Logical operators

A logical operator is used to combine the results of multiple relational operators. There are three basic logical operators in R, including AND (&), OR (|), and NOT (!). The AND operator returns TRUE only if both operands are TRUE, and the OR operator returns TRUE if at least one operand is TRUE. On the other hand, the NOT operator flips the evaluation result to the opposite.

Let’s go through an exercise on the use of these logical operators.

Exercise 1.15 – practicing using standard logical operators

We will start with the AND operator, the most widely used control logic to ensure a specific action only happens if multiple conditions are satisfied at the same time:

  1. Execute the following evaluations using the AND operator and observe the output:
    >>> TRUE & FALSE
    FALSE
    >>> TRUE & TRUE
    TRUE
    >>> FALSE & FALSE
    FALSE
    >>> 1 > 0 & 1 < 2
    TRUE

    The result shows that both conditions need to be satisfied to obtain a TRUE output.

  2. Execute the following evaluations using the OR operator and observe the output:
    >>> TRUE | FALSE
    TRUE
    >>> TRUE | TRUE
    TRUE
    >>> FALSE | FALSE
    FALSE
    >>> 1 < 0 | 1 < 2
    TRUE

    The result shows that the output is TRUE if at least one condition is evaluated as TRUE.

  3. Execute the following evaluations using the NOT operator and observe the output:
    >>> !TRUE
    FALSE
    >>> !FALSE
    TRUE
    >>> !(1<0)
    TRUE

    In the third example, the evaluation is the same as 1 >= 0, which returns TRUE. The NOT operator, therefore, reverses the evaluation result after the exclamation sign.

    These operators can also be used to perform pairwise logical evaluations in vectors.

  4. Execute the following evaluations and observe the output:
    >>> c(TRUE, FALSE) & c(TRUE, TRUE)
    TRUE FALSE
    >>> c(TRUE, FALSE) | c(TRUE, TRUE)
    TRUE TRUE
    >>> !c(TRUE, FALSE)
    FALSE  TRUE

There is also a long-form for the AND (&&) and the OR (||) logical operators. Different from the element-wise comparison in the previous short-form, the long-form is used to evaluate only the first element of each input vector, and such evaluation continues only until the result is determined. In other words, the long-form only returns a single result when evaluating two vectors of multiple elements. It is most widely used in modern R programming control flow, especially in the conditional if statement.

Let’s look at the following example:

>>> c(TRUE, FALSE) && c(FALSE, TRUE)
FALSE
>>> c(TRUE, FALSE) || c(FALSE, TRUE)
TRUE

Both evaluations are based on the first element of each vector. That is, the second element of each vector is ignored in both evaluations. This offers computational benefit, especially when the vectors are large. Since there is no point in continuing the evaluation if the final result can be obtained by evaluating the first element, we can safely discard the rest.

In the first evaluation using &&, comparing the first element of the two vectors (TRUE and FALSE) returns FALSE, while continuing the comparison of the second element will also return FALSE, so the second comparison is unnecessary. In the second evaluation using ||, comparing the first element (TRUE | FALSE) gives TRUE, saving the need to make the second comparison, as the result will always be evaluated as TRUE.

Conditional statements

A conditional statement, or more specifically, the if-else statement, is used to combine the result of multiple logical operators and decide the flow of follow-up actions. It is commonly used to increase the complexity of large R programs. The if-else statement follows a general structure as follows, where the evaluation condition is first validated. If the validation returns TRUE, the expression within the curve braces of the if clause would be executed and the rest of the code is ignored. Otherwise, the expression within the else clause would be executed:

if(evaluation condition){
some expression
} else {
other expression
}

Let’s go through an exercise to see how to use the if-else control statement.

Exercise 1.16 – practicing using the conditional statement

Time for another exercise! Let’s practice using the conditional statement:

  1. Initialize an x variable with a value of 1 and write an if-else condition to determine the output message. Print out "positive" if x is greater than zero, and "not positive" otherwise:
    >>> x = 1
    >>> if(x > 0){
    >>>	print("positive")
    >>> } else {
    >>> 	print("not positive")
    >>> }
    "positive"

    The condition within the if clause evaluates to be TRUE, and the code inside is executed, printing out "positive" in the console. Note that the else branch is optional and can be removed if we only intend to place one check to the input. Additional if-else control can also be embedded within a branch.

    We can also add additional branches using the if-else conditional control statement, where the middle part can be repeated multiple times.

  2. Initialize an x variable with 0 and write a control flow to determine and print out its sign:
    >>> x = 0
    >>> if(x > 0){
    >>>  print("positive")
    >>> } else if(x == 0){
    >>>  print("zero")
    >>> } else {
    >>>  print("negative")
    >>> }
    "zero"

    As the conditions are sequentially evaluated, the second statement returns TRUE and so prints out "zero".

Loops

A loop is similar to the if statement; the codes will only be executed if the condition evaluates to be TRUE. The only difference is that a loop will continue to iteratively execute the code as long as the condition is TRUE. There are two types of loops: the while loop and the for loop. The while loop is used when the number of iterations is unknown, and the termination relies on either the evaluation condition or a separated condition within the running expression using the break control statement. The for loop is used when the number of iterations is known.

The while loop follows a general structure as follows, where condition 1 first gets evaluated to determine the expression within the outer curly braces that should be executed. There is an (optional) if statement to decide whether the while loop needs to be terminated based on condition 2. These two conditions control the termination of the while loop, which exits the execution as long as any one condition evaluates as TRUE. Inside the if clause, condition 2 can be placed anywhere within the while block:

while(condition 1){
some expression
if(condition 2){
        break
}
}

Note that condition 1 within the while statement needs to be FALSE at some point; otherwise, the loop will continue indefinitely, which may cause a session expiry error within RStudio.

Let’s go through an exercise to look at how to use the while loop.

Exercise 1.17 – practicing the while loop

Let’s try out the while loop:

  1. Initialize an x variable with a value of 2 and write a while loop. If x is less than 10, square it and print out its value:
    >>> x = 2
    >>> while(x < 10){
    >>>   x = x^2
    >>>   print(x)
    >>> }
    4
    16

    The while loop is executed twice, bringing the value of x from 2 to 16. During the third evaluation, x is above 10 and the conditional statement evaluates to be FALSE, thus exiting the loop. We can also print out x to double-check its value:

    >>> x
    16
  2. Add a condition after the squaring to exit the loop if x is greater than 10:
    >>> x = 2
    >>> while(x < 10){
    >>>   x = x^2
    >>>   if(x > 10){
    >>>     break
    >>>  }
    >>>   print(x)
    >>> }
    4

    Only one number is printed out this time. The reason is that when x is changed to 16, the if condition evaluates to be TRUE, thus triggering the break statement to exit the while loop and ignore the print() statement. Let’s verify the value of x:

    >>> x
    16

Let’s look at the for loop, which assumes the following general structure. Here, var is a placement to sequentially reference the contents in sequence, which can be a vector, a list, or another data structure:

for(var in sequence){
some expression
}

The same expression will be evaluated for each unique variable in sequence, unless an explicit if condition is triggered to either exit the loop using break, or skip the rest of the code and immediately jump to the next iteration using next. Let’s go through an exercise to put these in perspective.

Exercise 1.18 – practicing using the for loop

Next, let’s try the for loop:

  1. Create a vector to store three strings (statistics, and, and calculus) and print out each element:
    >>> string_a = c("statistics","and","calculus")
    >>> for(i in string_a){
    >>>   print(i)
    >>> }
    "statistics"
    "and"
    "calculus"

    Here, the for loop iterates through each element in the string_a vector by sequentially assigning the element value to the i variable at each iteration. We can also choose to iterate using the vector index, as follows:

    >>> for(i in 1:length(string_a)){
    >>>   print(string_a[i])
    >>> }
    "statistics"
    "and"
    "calculus"

    Here, we created a series of integer indexes from 1 up to the length of the vector and assigned them to the i variable in each iteration, which is then used to reference the element in the string_a vector. This is a more flexible and versatile way of referencing elements in a vector since we can also use the same index to reference other vectors. Directly referencing the element as in the previous approach is more concise and readable. However, it lacks the level of control and flexibility without the looping index.

  2. Add a condition to break the loop if the current element is "and":
    >>> for(i in string_a){
    >>>   if(i == "and"){
    >>>     break
    >>>   }
    >>>   print(i)
    >>> }
    "statistics"

    The loop is exited upon satisfying the if condition when the current value in i is "and".

  3. Add a condition to jump to the next iteration if the current element is "and":
    >>> for(i in string_a){
    >>>   if(i == "and"){
    >>>    next
    >>>   }
    >>>  print(i)
    >>> }
    "statistics"
    "calculus"

    When the next statement is evaluated, the following print() function is ignored, and the program jumps to the next iteration, printing only "statistics" and "calculus" with the "and" element.

So far, we have covered some of the most fundamental building blocks in R. We are now ready to come to the last and most widely used building block: functions.

Exploring functions in R

A function is a collection of statements in the form of an object that receives an (optional) input, completes a specific task, and (optionally) generates an output. We may or may not be interested in how a function achieves the task and produces the output. When we only care about utilizing an existing function, which could be built-in and provisioned by R itself or pre-written by someone else, we can treat it as a black box and pass the required input to obtain the output we want. Examples include the sum() and mean() functions we used in the previous exercise. We can also define our own function to operate as an interface that processes a given input signal and produces an output. See Figure 1.9 for an illustration:

Figure 1.9 – Illustration of a function’s workflow

Figure 1.9 – Illustration of a function’s workflow

A function can be created using the function keyword with the following format:

function_name = function(argument_1, argument_2, …){
  some statements
}

A function can be decomposed into the following parts:

  • Function name: The name of the functional object registered and stored in the R environment. We use this name followed by a pair of parentheses and (optionally) input arguments within the parentheses to call the function.
  • Input argument: A placeholder used to receive input value when calling the function. An argument can be optional (with a default value assigned) or compulsory (with no default value assigned). Setting all arguments as optional is the same as requiring no compulsory input arguments for the function. However, we will need to pass a specific value to a compulsory argument in order to call the function. In addition, the optional argument can also appear after the compulsory argument, if any.
  • Function body: This is the area where the main statement is executed to complete a specific action and fulfill the purpose of the function.
  • Return value: The last statement to be evaluated within the function body, usually explicitly wrapped within the return() function.

Let’s go through an exercise on creating a user-defined function.

Exercise 1.19 – creating a user-defined function

Now, let’s try it out:

  1. Create a function named test_func to receive an input and print out "(input) is fun". Allow the option to print the message in uppercase:
    test_func = function(x, cap=FALSE){
      msg = paste(x,"is fun!")
      if(cap){
        msg = toupper(msg)
      }
      return(msg)
    }

    Note that we used the = sign instead of <- to assign the functional object to the test_func variable. However, the latter is more commonly observed when creating functions in R. In the input, we created two arguments: the compulsory argument, x, to receive the message to be printed, and the optional argument, cap, to determine whether the message needs to be converted into uppercase. The optional argument means that the user can either go with the default setting (that is, a lowercase message) by not supplying anything to this argument or overwrite the default behavior by explicitly passing in a value.

    In the function body, we first create a msg variable and assign the message content by calling the paste() function, a built-in function to concatenate the two input arguments. If the cap argument is FALSE, the if statement will evaluate to FALSE and msg will be directly returned as the function’s output. Otherwise, the statement within the if clause will be triggered to convert the msg variable into uppercase using the toupper() function, another built-in function in R.

  2. Let’s see what happens after calling the function in different ways:
    >>> test_func("r")
    "r is fun!"
    >>> test_func("r",cap=TRUE)
    "R IS FUN!"
    >>> test_func()
    Error in paste(x, "is fun!") : argument "x" is missing, with no default

    The first two cases work as expected. In the third case, we did not supply any value to the x argument, defined as a compulsory argument. This leads to an error and fails to call the function.

Summary

In this chapter, we covered the essential building blocks in R, including how to leverage and navigate the RStudio IDE, basic arithmetic operations (addition, subtraction, multiplication, division, exponentiation, and modulo), common data structures (vectors, matrices, data frames, and lists), control logic, including relational operators (>, ==, <, >=, <=, and !=) and logical operators (&, |, !, &&, and ||), conditional statements using ifelse, the for and while loops, and finally, functions in R. Understanding these fundamental aspects will greatly benefit our learning in later chapters as we gradually introduce more challenging topics.

In the next chapter, we will cover dplyr, one of the most widely used libraries for data processing and manipulation. Tapping into the various utility functions provided by dplyr will make it much easier to handle most data processing tasks.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Advance your ML career with the help of detailed explanations, intuitive illustrations, and code examples
  • Gain practical insights into the real-world applications of statistics and machine learning
  • Explore the technicalities of statistics and machine learning for effective data presentation
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

The Statistics and Machine Learning with R Workshop is a comprehensive resource packed with insights into statistics and machine learning, along with a deep dive into R libraries. The learning experience is further enhanced by practical examples and hands-on exercises that provide explanations of key concepts. Starting with the fundamentals, you’ll explore the complete model development process, covering everything from data pre-processing to model development. In addition to machine learning, you’ll also delve into R's statistical capabilities, learning to manipulate various data types and tackle complex mathematical challenges from algebra and calculus to probability and Bayesian statistics. You’ll discover linear regression techniques and more advanced statistical methodologies to hone your skills and advance your career. By the end of this book, you'll have a robust foundational understanding of statistics and machine learning. You’ll also be proficient in using R's extensive libraries for tasks such as data processing and model training and be well-equipped to leverage the full potential of R in your future projects.

Who is this book for?

This book is for beginner to intermediate-level data scientists, undergraduate to masters-level students, and early to mid-senior data scientists or analysts looking to expand their knowledge of machine learning by exploring various R libraries. Basic knowledge of linear algebra and data modeling is a must.

What you will learn

  • Hone your skills in different probability distributions and hypothesis testing
  • Explore the fundamentals of linear algebra and calculus
  • Master crucial statistics and machine learning concepts in theory and practice
  • Discover essential data processing and visualization techniques
  • Engage in interactive data analysis using R
  • Use R to perform statistical modeling, including Bayesian and linear regression
Estimated delivery fee Deliver to Russia

Economy delivery 10 - 13 business days

$6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 25, 2023
Length: 516 pages
Edition : 1st
Language : English
ISBN-13 : 9781803240305
Category :
Languages :
Concepts :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Product feature icon AI Assistant (beta) to help accelerate your learning
Estimated delivery fee Deliver to Russia

Economy delivery 10 - 13 business days

$6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Publication date : Oct 25, 2023
Length: 516 pages
Edition : 1st
Language : English
ISBN-13 : 9781803240305
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 149.97
Machine Learning with R
$49.99
The Statistics and Machine Learning with R Workshop
$49.99
Python Deep Learning
$49.99
Total $ 149.97 Stars icon

Table of Contents

19 Chapters
Part 1:Statistics Essentials Chevron down icon Chevron up icon
Chapter 1: Getting Started with R Chevron down icon Chevron up icon
Chapter 2: Data Processing with dplyr Chevron down icon Chevron up icon
Chapter 3: Intermediate Data Processing Chevron down icon Chevron up icon
Chapter 4: Data Visualization with ggplot2 Chevron down icon Chevron up icon
Chapter 5: Exploratory Data Analysis Chevron down icon Chevron up icon
Chapter 6: Effective Reporting with R Markdown Chevron down icon Chevron up icon
Part 2:Fundamentals of Linear Algebra and Calculus in R Chevron down icon Chevron up icon
Chapter 7: Linear Algebra in R Chevron down icon Chevron up icon
Chapter 8: Intermediate Linear Algebra in R Chevron down icon Chevron up icon
Chapter 9: Calculus in R Chevron down icon Chevron up icon
Part 3:Fundamentals of Mathematical Statistics in R Chevron down icon Chevron up icon
Chapter 10: Probability Basics Chevron down icon Chevron up icon
Chapter 11: Statistical Estimation Chevron down icon Chevron up icon
Chapter 12: Linear Regression in R Chevron down icon Chevron up icon
Chapter 13: Logistic Regression in R Chevron down icon Chevron up icon
Chapter 14: Bayesian Statistics Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(6 Ratings)
5 star 50%
4 star 50%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Sangita Mahala Nov 27, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is a highly recommended resource for anyone seeking a comprehensive and practical introduction to statistics and machine learning using R. Peng Liu's masterful guidance and engaging approach make this book an essential tool for data scientists of all levels.
Amazon Verified review Amazon
Steven Fernandes Dec 05, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The book expertly bridges theory and practice in statistics and machine learning, focusing on R for practical application. It starts with probability distributions and hypothesis testing, building a foundation in linear algebra and calculus. The book excels in making complex statistical and machine learning concepts accessible, emphasizing data processing and visualization techniques. Interactive data analysis using R is a key feature, enhancing engagement and understanding. The detailed coverage of statistical modeling, including Bayesian and linear regression in R, makes it an indispensable resource for those aspiring to master data analysis in a hands-on, applied manner.
Amazon Verified review Amazon
Gustavo Feb 29, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I have received this book from Packt to provide my review. The book has a good R foundation for those who are not familiar with the Language. It also brings a couple of Math/ Algebra/ Calculus chapters that don't have very strong application examples, but it's interesting and well explained.The Stats portion in the last part is very good, with many regression examples.Great book.
Amazon Verified review Amazon
Papu Siameja Feb 06, 2024
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The combination of a little bit of theory and practical examples makes this book a good introduction to the use of R for statistical analysis and machine learning. The examples are clear and easy to follow as the required R packages are clearly stated at the beginning of each chapter.
Feefo Verified review Feefo
H2N Nov 16, 2023
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
This book is an excellent introduction for early-career data scientists and undergraduate students with a basic grasp of linear algebra and modeling. It starts with the essentials of R programming, focusing on key data structures and logical operations. The journey continues with data processing techniques using dplyr, covering transformations and aggregations. The reader is then guided through more complex data processing and quality enhancement methods. Data visualization is masterfully explained through ggplot2, from elementary to sophisticated techniques. The book also delves into exploratory data analysis, R Markdown for interactive documents, and advanced topics such as linear algebra, calculus in R, probability, statistical estimation, and regression models, finishing with Bayesian statistics. This comprehensive guide is invaluable for practical R applications in data science, though readers may long for a Python version.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela