Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Machine Learning with R Cookbook, Second Edition
Machine Learning with R Cookbook, Second Edition

Machine Learning with R Cookbook, Second Edition: Analyze data and build predictive models , Second Edition

Arrow left icon
Profile Icon Yu-Wei, Chiu (David Chiu)
Arrow right icon
$54.99
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2 (1 Ratings)
Paperback Oct 2017 572 pages 2nd Edition
eBook
$29.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Yu-Wei, Chiu (David Chiu)
Arrow right icon
$54.99
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2 (1 Ratings)
Paperback Oct 2017 572 pages 2nd Edition
eBook
$29.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$29.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Machine Learning with R Cookbook, Second Edition

Practical Machine Learning with R

In this chapter, we will cover the following topics:

  • Downloading and installing R
  • Downloading and installing RStudio
  • Installing and loading packages
  • Understanding basic data structures
  • Basic commands for subsetting
  • Reading and writing data
  • Manipulating data
  • Applying basic statistics
  • Visualizing data
  • Getting a dataset for machine learning

Introduction

The aim of machine learning is to uncover hidden patterns and unknown correlations, and to find useful information from data. In addition to this, through incorporation with data analysis, machine learning can be used to perform predictive analysis. With machine learning, the analysis of business operations and processes is not limited to human scale thinking; machine scale analysis enables businesses to capture hidden values in big data.

Machine learning has similarities to the human reasoning process. Unlike traditional analysis, the generated model cannot evolve as data is accumulated. Machine learning can learn from the data that is processed and analyzed. In other words, the more data that is processed, the more it can learn.

R, as a dialect of GNU-S, is a powerful statistical language that can be used to manipulate and analyze data. Additionally, R provides many machine learning packages and visualization functions, which enable users to analyze data on the fly. Most importantly, R is open source and free.

Using R greatly simplifies machine learning. All you need to know is how each algorithm can solve your problem and then you can simply use a written package to quickly generate prediction models on data with a few command lines. For example, you can perform Naïve Bayes for spam mail filtering, conduct k-means clustering for customer segmentation, use linear regression to forecast house prices, or implement a hidden Markov model to predict the stock market, as shown in the following screenshot:

Stock market prediction using R

Moreover, you can perform nonlinear dimension reduction to calculate the dissimilarity of image data and visualize the clustered graph, as shown in the following screenshot. All you need to do is follow the recipes provided in this book:

A clustered graph of face image data

This chapter serves as an overall introduction to machine learning and R; the first few recipes introduce how to set up the R environment and the integrated development environment, RStudio. After setting up the environment, the following recipe introduces package installation and loading. In order to understand how data analysis is practiced using R, the next four recipes cover data read/write, data manipulation, basic statistics, and data visualization using R. The last recipe in the chapter lists useful data sources and resources.

Downloading and installing R

To use R, you must first install it on your computer. This recipe gives detailed instructions on how to download and install R.

Getting ready

If you are new to the R language, you can find a detailed introduction, language history, and functionality on the official website (http://www.r-project.org/). When you are ready to download and install R, please access the following link: http://cran.r-project.org/.

How to do it...

Please perform the following steps to download and install R for Windows and macOS:

  1. Go to the R CRAN website, http://www.r-project.org/, and click on the download R link, that is, http://cran.r-project.org/mirrors.html):
R Project home page
  1. You may select the mirror location closest to you:
CRAN mirrors
  1. Select the correct download link based on your operating system:
Click on the download link based on your OS

As the installation of R differs for Windows and macOS, the steps required to install R for each OS are provided here.

For Windows:

  1. Click on Download R for Windows, as shown in the following screenshot, and then click on base:
  1. Click on Download R 3.x.x for Windows:
  1. The installation file should be downloaded. Once the download is finished, you can double-click on the installation file and begin installing R, It will ask for you selecting setup language:
Installation step - Selecting Language
  1. The next screen will be an installation screen; click on Next on all screens to complete the installation. Once installed, you can see the shortcut icon on the desktop:
R icon for 32 bit and 64 bit on desktop
  1. Double-click on the icon and it will open the R Console:
The Windows R Console

For macOS X:

  1. Go to Download R for (Mac) OS X, as shown in the following screenshot.
  2. Click on the latest version (R-3.4.1.pkg file extension) according to your macOS version:
  1. Double-click on the downloaded installation file (.pkg extension) and begin to install R. Leave all the installation options as the default settings if you do not want to make any changes:
  1. Follow the onscreen instructions through Introduction, Read Me, License, Destination Select, Installation Type, Installation, and Summary, and click on Continue to complete the installation.
  1. After the file is installed, you can use spotlight search or go to the Applications folder to find R:
Use spotlight search to find R
  1. Click on R to open R Console:

As an alternative to downloading a Mac .pkg file to install R, Mac users can also install R using Homebrew:

  1. Download XQuartz-2.X.X.dmg from https://xquartz.macosforge.org/landing/.
  2. Double-click on the .dmg file to mount it.
  3. Update brew with the following command line:
        $ brew update  
  1. Clone the repository and symlink all its formulae to homebrew/science:
        $ brew tap homebrew/science  
  1. Install gfortran:
        $ brew install gfortran  
  1. Install R:
        $ brew install R      

For Linux users, there are precompiled binaries for Debian, RedHat, SUSE, and Ubuntu. Alternatively, you can install R from a source code. Besides downloading precompiled binaries, you can install R for Linux through a package manager. Here are the installation steps for CentOS and Ubuntu.

Downloading and installing R on Ubuntu:

  1. Add the entry to the /etc/apt/sources.list file replace <> with appropriate value:
        $ sudo sh -c "echo 'deb http:// <cran mirros site 
url>/bin/linux/ubuntu <ubuntu version>/' >> /etc/apt/sources.list"
  1. Then, update the repository:
        $ sudo apt-get update  
  1. Install R with the following command:
        $ sudo apt-get install r-base  
  1. Start R in the command line:
        $ R  

Downloading and installing R on CentOS 5:

  1. Get the rpm CentOS 5 RHEL EPEL repository of CentOS 5:
        $ wget
http://dl.fedoraproject.org/pub/epel/5/x86_64/epel-release-5-
4.noarch.rpm
  1. Install the CentOS 5 RHEL EPEL repository:
        $ sudo rpm -Uvh epel-release-5-4.noarch.rpm  
  1. Update the installed packages:
        $ sudo yum update  
  1. Install R through the repository:
        $ sudo yum install R  
  1. Start R in the command line:
        $ R  

Downloading and installing R on CentOS 6:

  1. Get the rpm CentOS 5 RHEL EPEL repository of CentOS 6:
        $ wget
http://dl.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-
8.noarch.rpm
  1. Install the CentOS 5 RHEL EPEL repository:
        $ sudo rpm -Uvh epel-release-6-8.noarch.rpm  
  1. Update the installed packages:
        $ sudo yum update  
  1. Install R through the repository:
        $ sudo yum install R  
  1. Start R in the command line:
        $ R  

Downloading and installing R on Fedora [Latest Version]:

$ dnf install R  

This will install R and all its dependencies.

How it works...

CRAN provides precompiled binaries for Linux, macOS X, and Windows. For macOS and Windows users, the installation procedures are straightforward. You can generally follow onscreen instructions to complete the installation. For Linux users, you can use the package manager provided for each platform to install R or build R from the source code.

See also

Downloading and installing RStudio

To write an R script, one can use R Console, R commander, or any text editor (such as EMACS, VIM, or sublime). However, the assistance of RStudio, an integrated development environment (IDE) for R, can make development a lot easier.

RStudio provides comprehensive facilities for software development. Built-in features, such as syntax highlighting, code completion, and smart indentation, help maximize productivity. To make R programming more manageable, RStudio also integrates the main interface into a four-panel layout. It includes an interactive R Console, a tabbed source code editor, a panel for the currently active objects/history, and a tabbed panel for the file browser/plot window/package install window/R help window. Moreover, RStudio is open source and is available for many platforms, such as Windows, macOS X, and Linux. This recipe shows how to download and install RStudio.

Getting ready

RStudio requires a working R installation; when RStudio loads, it must be able to locate a version of R. You must therefore have completed the previous recipe with R installed on your OS before proceeding to install RStudio.

How to do it...

Perform the following steps to download and install RStudio for Windows and macOS users:

  1. Access RStudio's official site by using the following URL: http://www.rstudio.com/products/RStudio/
RStudio home page
  1. For the desktop version installation, click on RStudio Desktop under the Desktop section. It will redirect you to the bottom of the home page:
  1. Click on the DOWNLOAD RSTUDIO DESKTOP button (http://www.rstudio.com/products/rstudio/download/), it will display download page, with the option of open source license and commercial license. Scroll down to RStudio Desktop Open Source License and click on DOWNLOAD button:
RStudio Download page
  1. It will display different installers for different OS types. Select the appropriate option and download the RStudio:
RStudio Download page
  1. Install RStudio by double-clicking on the downloaded packages. For Windows users, follow the onscreen instructions to install the application:
RStudio Installation page
  1. For Mac users, simply drag the RStudio icon to the Applications folder.
  2. Start RStudio:
The RStudio console

Perform the following steps for downloading and installing RStudio for Ubuntu/Debian and RedHat/CentOS users:

For Debian(6+)/Ubuntu(10.04+) 32 bit:

$ wget http://download1.rstudio.org/rstudio-0.98.1091-i386.deb
$ sudo gdebi rstudio-0.98. 1091-i386.deb  

For Debian(6+)/Ubuntu(10.04+) 64 bit:

$ wget http://download1.rstudio.org/rstudio-0.98. 1091-amd64.deb
$ sudo gdebi rstudio-0.98. 1091-amd64.deb  

For RedHat/CentOS(5,4+) 32 bit:

$ wget http://download1.rstudio.org/rstudio-0.98. 1091-i686.rpm
$ sudo yum install --nogpgcheck rstudio-0.98. 1091-i686.rpm  

For RedHat/CentOS(5,4+) 64 bit:

$ wget http://download1.rstudio.org/rstudio-0.98. 1091-x86_64.rpm
$ sudo yum install --nogpgcheck rstudio-0.98. 1091-x86_64.rpm  

How it works...

The RStudio program can be run on the desktop or through a web browser. The desktop version is available for the Windows, macOS X, and Linux platforms with similar operations across all platforms. For Windows and macOS users, after downloading the precompiled package of RStudio, follow the onscreen instructions, shown in the preceding steps, to complete the installation. Linux users may use the package management system provided for installation.

See also

  • In addition to the desktop version, users may install a server version to provide access to multiple users. The server version provides a URL that users can access to use the RStudio resources. To install RStudio, please refer to the following link: http://www.rstudio.com/ide/download/server.html. This page provides installation instructions for the following Linux distributions: Debian (6+), Ubuntu (10.04+), RedHat, and CentOS (5.4+).
  • For other Linux distributions, you can build RStudio from the source code.

Installing and loading packages

After successfully installing R, users can download, install, and update packages from the repositories. As R allows users to create their own packages, official and non-official repositories are provided to manage these user-created packages. CRAN is the official R package repository. Currently, the CRAN package repository features 11,589 available packages (as of 10/11/2017). Through the use of the packages provided on CRAN, users may extend the functionality of R to machine learning, statistics, and related purposes. CRAN is a network of FTP and web servers around the world that store identical, up-to-date versions of code and documentation for R. You may select the closest CRAN mirror to your location to download packages.

Getting ready

Start an R session on your host computer.

How to do it...

Perform the following steps to install and load R packages:

  1. Load a list of installed packages:
        > library()
  1. Set the default CRAN mirror:
        > chooseCRANmirror()

R will return a list of CRAN mirrors, and then ask the user to either type a mirror ID to select it, or enter zero to exit:

  1. Install a package from CRAN; take package e1071 as an example:
        > install.packages("e1071")
  1. Update a package from CRAN; take package e1071 as an example:
        > update.packages("e1071")
  1. Load the package:
        > library(e1071)  
  1. If you would like to view the documentation of the package, you can use the help function:
        > help(package ="e1071")  
  1. If you would like to view the documentation of the function, you can use the help function:
        > help(svm, e1071)  
  1. Alternatively, you can use the help shortcut, ?, to view the help document for this function:
        > ?e1071::svm
  1. If the function does not provide any documentation, you may want to search the supplied documentation for a given keyword. For example, if you wish to search for documentation related to svm:
        > help.search("svm")
  1. Alternatively, you can use ?? as the shortcut for help.search:
        > ??svm
  1. To view the argument taken for the function, simply use the args function. For example, if you would like to know the argument taken for the lm function:
        > args(lm)
  1. Some packages will provide examples and demos; you can use example or demo to view an example or demo. For example, one can view an example of the lm package and a demo of the graphics package by typing the following commands:
        > example(lm)
        > demo(graphics)  
  1. To view all the available demos, you may use the demo function to list all of them:
        > demo()

How it works...

This recipe first introduces how to view loaded packages, install packages from CRAN, and load new packages. Before installing packages, those of you who are interested in the listing of the CRAN package can refer to http://cran.r-project.org/web/packages/available_packages_by_name.html.

When a package is installed, documentation related to the package is also provided. You are, therefore, able to view the documentation or the related help pages of installed packages and functions. Additionally, demos and examples are provided by packages that can help users understand the capability of the installed package.

See also

  • Besides installing packages from CRAN, there are other R package repositories, including Crantastic, a community site for rating and reviewing CRAN packages, and R-Forge, a central platform for the collaborative development of R packages. In addition to this, Bioconductor provides R packages for the analysis of genomic data.
  • If you would like to find relevant functions and packages, please visit the list of task views at http://cran.r-project.org/web/views/, or search for keywords at http://rseek.org.

Understanding of basic data structures

Ensure you have completed the previous recipes by installing R on your operating system.

Data types

You need to have brief idea about basic data types and structures in R in order to grasp all the recipies in book. This section will give you an overview for the same and make you ready for using R. R supports all the basic data types supported by any other programming and scripting language. In simple words, data can be of numeric, character, date, and logical type. As the name suggests, numeric means all type of numbers, while logical allows only true and false. To check the type of data, the class function, which will display the class of the data, is used.

Perform following task on R Console or RStudio:

> x=123 
> class(x) 
Output: 
[1] "numeric"
> x="ABC"
> class(x)
Output:
[1] "character"

Data structures

R supports different types of data structures to store and process data. The following is a list of basic and commonly used data structures used in R:

  • Vectors
  • List
  • Array
  • Matrix
  • DataFrames

Vectors

A vector is a container that stores data of same type. It can be thought of as a traditional array in programming language. It is not to be confused with mathematical vector which have rows and columns. To create a vector the c() function, which will combine the arguments, is used. One of the beautiful features of vectors is that any operation performed on vector is performed on each element of the vector. If a vector consists of three elements, adding two will increases every element by two.

How to do it...

Perform the following steps to create and see vector in R:

> x=c(1,2,3,4) # c(1:4) 
> x 
Output: 
[1] 1 2 3 4 
> x=c(1,2,3,4,"ABC") 
> x 
Output: 
[1] "1" "2" "3" "4" "ABC" 
> x * 2 
Output: 
[1] 2 4 6 8 
> sqrt(x) 
Output: 
[1] 1.000000 1.414214 1.732051 2.000000 
> y = x==2 
> y 
Output: 
[1] FALSE TRUE FALSE FALSE 
> class(y) 
Output: 
[1] "logical" 
> t = c(1:10) 
> t 
Output: 
[1] 1 2 3 4 5 6 7 8 9 10

How it works...

Printing a vector will starts with index [1] which shows the elements are indexed in vector and it starts from 1, not from 0 like other languages. Any operation done on a vector is applied on individual elements of the vector, so the multiplication operation is applied on individual elements of the vector. If vector is passed as an argument to any inbuilt function, it will be applied on individual elements. You can see how powerful it is and it removes the need to write the loops for doing the operation. The vector changes the type on basis of data it holds and operation we apply on it. Using x==2 will check each element of vector for equality with two and returns the vector with logical value, that is, TRUE or FALSE. There are many other ways of creating a vector; one such way is shown in creating vector t.

Lists

Unlike a vector, a list can store any type of data. A list is, again, a container that can store arbitrary data. A list can contain another list, a vector, or any other data structure. To create a list, the list function is used.

How to do it...

Perform the following steps to create and see a list in R:

> y = list(1,2,3,"ABC") 
> y 
Output: 
[[1]]
[1]1
[[2]]
[1]2
[[3]]
[1]3
[[4]]
[1] "ABC" > y = list(c(1,2,3),c("A","B","C")) > y Output: [[1]]
[1] 1 2 3
[[2]]
[1] "A" "B" "C"

How it works...

A list, as said, can contain anything; we start with a simple example to store some elements in a list using the list function. In the next step, we create a list with a vector as element of the list. So, y is a list with its first element as vector of 1, 2, 3 and its second element as vector of A, B, and C.

Array

An array is nothing but a multidimensional vector, and can store only the same type of data. The way to create a multidimensional vector dimension is specified using dim.

How to do it...

Perform the following steps to create and see an array in R:

> t = array(c(1,2,3,4), dim=c(2,2)) # Create two dimensional array 
> t 
Output: 
    [,1]       [,2]
[1,] 1 3
[2,] 2 4 > arr = array(c(1:24), dim=c(3,4,2)) # Creating three dimensional array > arr Output: , , 1

[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
, , 2
[,1] [,2] [,3] [,4]
[1,] 13 16 19 22
[2,] 14 17 20 23 [3,] 15 18 21 24

How it works...

Creating an array is straightforward. Use the array function and provide the value for nth row; it will create a two-dimensional array with appropriate columns.

Matrix

A matrix is like a DataFrame, with the constraint that every element must be of the same type.

How to do it...

Perform the following steps to create and see a matrix in R:

> m = matrix(c(1,2,3,4,5,6), nrow=3) 
> m 
Output: 
[,1]       [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

DataFrame

DataFrame can be seen as an Excel spreadsheet, with rows and columns where every column can have different data types. In R, each column of a DataFrame is a vector.

How to do it...

Perform the following steps:

    > p = c(1,2,3)
    > q = c("A","B","C")
    > r = c(TRUE, FALSE, FALSE)
    > d = data.frame(No=p, Name=q, Attendance=r)
    > d
    Output:
           No       Name   Attendance
    1       1          A      TRUE
    2       2          B      FALSE
    3       3          C      FALSE

Basic commands for subsetting

R allows data to be sliced or to get the subset using various methods.

How to do it...

Perform the following steps to see subsetting. It is assumed that the DataFrame d and matrix m exist from the previous exercise:

> d$No   # Slice the column 
Output: 
[1] 1 2 3 
> d$Name  # Slice the column 
Output: 
[1] A B C 
> d$Name[1] 
Output: 
[1] A 
> d[2,]  # get Row 
Output: 
      No       Name   Attendance
2 2 B FALSE > temp = c(1:100) # Creates a vector of 100 elements from 1 to 100 > temp[14:16] # Part from vector Output: [1] 14 15 16 > m[,2] # To access second column from matrix m Output: [1] 4 5 6 > m[3,] # To access third row from matrix m Output: [1] 3 6 > m[2,1] # To access single element from matrix m Output: [1] 2 > m[c(1,3), c(2)] # Access [1,2] and [3,2] Output: [1] 4 6

Data input

R provides various ways to read data for processing. It supports reading data from CSV files, Excel files, databases, other statistical tools, binary files, and websites. Apart from this, there are many datasets that come bundled with the R. Just execute the data() command on RStudio or R prompt it will list the datasets available. If you want to create quick dataset you can create a blank DataFrame and use edit command as shown here:

> temp = data.frame()
> edit(temp)  

This will open an Excel like screen for data manipulation as shown in the following screenshot:

Reading and writing data

Before starting to explore data, you must load the data into the R session. This recipe will introduce methods to load data from a file into the memory, use the predefined data within R, using the data from database.

Getting ready

First, start an R session on your machine. As this recipe involves steps toward the file I/O, if the user does not specify the full path, read and write activity will take place in the current working directory. For working with databases, it is assumed you have working PostgreSQL on your system with some data.

You can simply type getwd() in the R session to obtain the current working directory location. However, if you would like to change the current working directory, you can use setwd("<path>"), where <path> can be replaced with your desired path, to specify the working directory.

How to do it...

Perform the following steps to read and write data with R:

  1. To view the built-in datasets of R, type the following command:
        > data() 
  1. R will return a list of datasets in a dataset package, and the list comprises the name and description of each dataset.
  2. To load the dataset iris into an R session, type the following command:
        > data(iris)  
  1. The dataset iris is now loaded into the DataFrame format, which is a common
    data structure in R to store a data table.
  1. To view the data type of iris, simply use the class function:
        > class(iris)
        [1] "data.frame"
  1. The data.frame console print shows that the iris dataset is in the structure of DataFrame.
  2. Use the save function to store an object in a file. For example, to save the loaded iris data into myData.RData, use the following command:
        > save(iris, file="myData.RData")  
  1. Use the load function to read a saved object into an R session. For example, to load iris data from myData.RData, use the following command:
        > load("myData.RData")  
  1. In addition to using built-in datasets, R also provides a function to import data from text into a DataFrame. For example, the read.table function can format a given text into a DataFrame:
        > test.data = read.table(header = TRUE, text = " 
        + a b 
        + 1 2 
        + 3 4 
        + ") 
  1. You can also use row.names and col.names to specify the names of columns and rows:
        > test.data = read.table(text = " 
        + 1 2 
        + 3 4",  
        + col.names=c("a","b"), 
        + row.names = c("first","second")) 
  1. View the class of the test.data variable:
        > class(test.data) 
        [1] "data.frame" 
  1. The class function shows that the test.data variable contains a DataFrame.
  2. In addition to importing data by using the read.table function, you can use the write.table function to export data to a text file:
        > write.table(test.data, file = "test.txt" , sep = " ") 
  1. The write.table function will write the content of test.data into test.txt
    (the written path can be found by typing getwd()), with a separation delimiter as white space.
  2. Similar to write.table, write.csv can also export data to a file. However, write.csv uses a comma as the default delimiter:
        > write.csv(test.data, file = "test.csv")  
  1. With the read.csv function, the csv file can be imported as a DataFrame. However, the last example writes column and row names of the DataFrame to the test.csv file. Therefore, specifying header to TRUE and row names as the first column within the function can ensure the read DataFrame will not treat the header and the first column as values:
        > csv.data = read.csv("test.csv", header = TRUE, row.names=1) 
        > head(csv.data) 
          a b 
        1 1 2 
        2 3 4 

This section will cover how to work with the database. To connect with PostgreSQL, the RPostgreSQL package is required which can be installed using this command:

> install.packages("RPostgreSQL") 
It will install package in your system. You need to have active internet connection for this command to complete. Once installed you can use the package for accessing database. You need to have username, password, database name for accessing the PostgreSQL. Replace the value with your values for parameter in dbconnect function.
> require("RPostgreSQL") 
> driver = dbDriver("PostgreSQL") 
> connection = dbConnect(driver, dbname="restapp", host="localhost", 
         port=5432, user="postgres", password="postgres") 
> dbExistsTable(connection, "country") 
[1] TRUE
TRUE shows that table exists in the database. To query the table use.
> data = dbGetQuery(connection, "select * from country") 
> class(data) 
Output: 
[1] "data.frame" 
> data 
Output: 
    id         code     name
1 1 US USA
2 43 AS Austria
3 55 BR Brazil

Reading table data will result in to DataFrame in R.

How it works...

Generally, data for collection may be in multiple files and different formats. To exchange data between files and RData, R provides many built-in functions, such as save, load, read.csv, read.table, write.csv, and write.table.

This example first demonstrates how to load the built-in dataset iris into an R session.
The iris dataset is the most famous and commonly used dataset in the field of machine learning. Here, we use the iris dataset as an example. The recipe shows how to save RData and load it with the save and load functions. Furthermore, the example explains how to use read.table, write.table, read.csv, and write.csv to exchange data from files to a DataFrame. The use of the R I/O function to read and write data is very important as most of the data sources are external. Therefore, you have to use these functions to load data into an R session.

You need to install the package for reading from the database. For all database, you can find the package, after installing the steps mostly remains the same for reading the data from the database.

There's more...

For the load, read.table, and read.csv functions, the file to be read can also be a complete URL (for supported URLs, use ?url for more information).

On some occasions, data may be in an Excel file instead of a flat text file. The WriteXLS package allows writing an object into an Excel file with a given variable in the first argument and the file to be written in the second argument:

  1. Install the WriteXLS package:
        > install.packages("WriteXLS")  
  1. Load the WriteXLS package:
        > library("WriteXLS")  
  1. Use the WriteXLS function to write the DataFrame iris into a file named iris.xls:
        > WriteXLS("iris", ExcelFileName="iris.xls")  

Manipulating data

This recipe will discuss how to use the built-in R functions to manipulate data. As data manipulation is the most time-consuming part of most analysis procedures, you should gain knowledge of how to apply these functions on data.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to manipulate the data with R.

Subset the data using the bracelet notation:

  1. Load the dataset iris into the R session:
        > data(iris)  
  1. To select values, you may use a bracket notation that designates the indices of the dataset. The first index is for the rows and the second for the columns:
        > iris[1,"Sepal.Length"]
        Output:
    
        [1] 5.1  
  1. You can also select multiple columns using c():
        > Sepal.iris = iris[, c("Sepal.Length", "Sepal.Width")]  
  1. You can then use str() to summarize and display the internal structure of Sepal.iris:
        > str(Sepal.iris)
        Output:
       'data.frame':  150 obs. of  2 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ..  
  1. To subset data with the rows of given indices, you can specify the indices at the first index with the bracket notation. In this example, we show you how to subset data with the top five records with the Sepal.Length column and the Sepal.Width selected:
        > Five.Sepal.iris = iris[1:5, c("Sepal.Length", "Sepal.Width")]
        > str(Five.Sepal.iris)
        Output:
        'data.frame':   5 obs. of  2 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 
  1. It is also possible to set conditions to filter the data. For example, to filter returned records containing the setosa data with all five variables. In the following example, the first index specifies the returning criteria, and the second index specifies the range of indices of the variable returned:
        > setosa.data = iris[iris$Species=="setosa",1:5]
        > str(setosa.data)
        Output:
        'data.frame':   50 obs. of  5 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
        $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
        $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
        $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
1 1 1 1 1 1 1 ...
  1. Alternatively, the which function returns the indexes of satisfied data. The following example returns the indices of the iris data containing species equal to setosa:
        > which(iris$Species=="setosa")
        Output:
        [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
        [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
        [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50
  1. The indices returned by the operation can then be applied as the index to select the iris containing the setosa species. The following example returns the setosa with all five variables:
        > setosa.data = iris[which(iris$Species=="setosa"),1:5]
        > str(setosa.data)
        Output:
        'data.frame':   50 obs. of  5 variables:
         $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
         $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
         $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
         $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
         $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 
1 1 1 1 1 1 1 ...

Subset data using the subset function:

  1. Besides using the bracket notation, R provides a subset function that enables users to subset the DataFrame by observations with a logical statement.
  2. First, subset species, sepal length, and sepal width out of the iris data. To select the sepal length and width out of the iris data, one should specify the column to be subset in the select argument:
        > Sepal.data = subset(iris, select=c("Sepal.Length", "Se-
pal.Width")) > str(Sepal.data) Output: 'data.frame': 150 obs. of 2 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...

This reveals that Sepal.data contains 150 objects with the Sepal.Length variable and Sepal.Width.

  1. On the other hand, you can use a subset argument to get subset data containing setosa only. In the second argument of the subset function, you can specify the subset criteria:
        > setosa.data = subset(iris, Species =="setosa")
        > str(setosa.data)
        Output:
       'data.frame': 50 obs. of  5 variables:
        $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
        $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
        $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
        $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
        $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1
1 1 1 1 1 1 1 ...
  1. Most of the time, you may want to apply a union or intersect a condition while subsetting data. The OR and AND operations can be further employed for this purpose. For example, if you would like to retrieve data with Petal.Width >=0.2 and Petal.Length < = 1.4:
        > example.data= subset(iris, Petal.Length <=1.4 & Petal.Width >=
0.2, select=Species ) > str(example.data) Output: 'data.frame': 21 obs. of 1 variable: $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
1 1 1 1 ...
  • Merging data: Merging data involves joining two DataFrames into a merged DataFrame by a common column or row name. The following example shows how to merge the flower.type DataFrame and the first three rows of the iris with a common row name within the Species column:
        > flower.type = data.frame(Species = "setosa", Flower = "iris")
        > merge(flower.type, iris[1:3,], by ="Species")
        Output:
        Species Flower Sepal.Length Sepal.Width Petal.Length Petal.Width
      1  setosa   iris          5.1         3.5          1.4         0.2
      2  setosa   iris          4.9         3.0          1.4         0.2
      3  setosa   iris          4.7         3.2          1.3         0.2
  • Ordering data: The order function will return the index of a sorted DataFrame with a specified column. The following example shows the results from the first six records with the sepal length ordered (from big to small) iris data:
        > head(iris[order(iris$Sepal.Length, decreasing = TRUE),])
        Output:
          Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
        132          7.9         3.8          6.4         2.0 virginica
        118          7.7         3.8          6.7         2.2 virginica
        119          7.7         2.6          6.9         2.3 virginica
        123          7.7         2.8          6.7         2.0 virginica
        136          7.7         3.0          6.1         2.3 virginica
        106          7.6         3.0          6.6         2.1 virginica
    
  

How it works...

Before conducting data analysis, it is important to organize collected data into a structured format. Therefore, we can simply use the R DataFrame to subset, merge, and order a dataset. This recipe first introduces two methods to subset data: one uses the bracket notation, while the other uses the subset function. You can use both methods to generate the subset data by selecting columns and filtering data with the given criteria. The recipe then introduces the merge function to merge DataFrames. Last, the recipe introduces how to use order to sort the data.

There's more...

The sub and gsub functions allow using regular expression to substitute a string. The sub and gsub functions perform the replacement of the first and all the other matches, respectively:

> sub("e", "q", names(iris))
Output:
[1] "Sqpal.Length" "Sqpal.Width"  "Pqtal.Length" "Pqtal.Width"  "Spqcies"     
> gsub("e", "q", names(iris))
Output:
[1] "Sqpal.Lqngth" "Sqpal.Width"  "Pqtal.Lqngth" "Pqtal.Width"  "Spqciqs"

Applying basic statistics

R provides a wide range of statistical functions, allowing users to obtain the summary statistics of data, generate frequency and contingency tables, produce correlations, and conduct statistical inferences. This recipe covers basic statistics that can be applied to a dataset.

Getting ready

Ensure you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to apply statistics to a dataset:

  1. Load the iris data into an R session:
        > data(iris)
  1. Observe the format of the data:
        > class(iris)
        [1] "data.frame"  
  1. The iris dataset is a DataFrame containing four numeric attributes: petal length, petal width, sepal width, and sepal length. For numeric values, you can perform descriptive statistics, such as mean, sd, var, min, max, median, range, and quantile. These can be applied to any of the four attributes in the dataset:
        > mean(iris$Sepal.Length)
        Output:
        [1] 5.843333
        > sd(iris$Sepal.Length)
        Output:
        [1] 0.8280661
        > var(iris$Sepal.Length)
        Output:
        [1] 0.6856935
        > min(iris$Sepal.Length)
        Output:
        [1] 4.3
        > max(iris$Sepal.Length)
        Output:
        [1] 7.9
        > median(iris$Sepal.Length)
        Output:
        [1] 5.8
        > range(iris$Sepal.Length)
        Output:
        [1] 4.3 7.9
        > quantile(iris$Sepal.Length)
        Output:
        0%  25%  50%  75% 100% 
        4.3  5.1  5.8  6.4  7.9
  1. The preceding example demonstrates how to apply descriptive statistics to a single variable. In order to obtain summary statistics on every numeric attribute of the DataFrame, one may use sapply. For example, to apply the mean on the first four attributes in the iris DataFrame, ignore the na value by setting na.rm as TRUE:
        > sapply(iris[1:4], mean, na.rm=TRUE)
        Output:
        Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
          5.843333     3.057333     3.758000     1.199333 
    
  
  1. As an alternative to using sapply to apply descriptive statistics on given attributes, R offers the summary function that provides a full range of descriptive statistics. In the following example, the summary function provides the mean, median, 25th and 75th quartiles, min, and max of every iris dataset numeric attribute:
        > summary(iris)
        Output:
        Sepal.Length  Sepal.Width   Petal.Length   Petal.Width  Species  
        Min.  4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa    :50  
        1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300
virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
  1. The preceding example shows how to output the descriptive statistics of a single variable. R also provides the correlation for users to investigate the relationship between variables. The following example generates a 4x4 matrix by computing the correlation of each attribute pair within the iris:
        > cor(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
        Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
        Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
        Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
  1. R also provides a function to compute the covariance of each attribute pair within the iris dataset:
        > cov(iris[,1:4])
        Output:
        Sepal.Length Sepal.Width Petal.Length Petal.Width
        Sepal.Length    0.6856935  -0.0424340    1.2743154   0.5162707
        Sepal.Width    -0.0424340   0.1899794   -0.3296564  -0.1216394
        Petal.Length    1.2743154  -0.3296564    3.1162779   1.2956094
        Petal.Width     0.5162707  -0.1216394    1.2956094   0.5810063
  1. Statistical tests are performed to access the significance of the results; here we demonstrate how to use a t-test to determine the statistical differences between two samples. In this example, we perform a t.test on the petal width an of an iris in either the setosa or versicolor species. If we obtain a p-value less than 0.5, we can be certain that the petal width between the setosa and versicolor will vary significantly:
        > t.test(iris$Petal.Width[iris$Species=="setosa"], 
        +        iris$Petal.Width[iris$Species=="versicolor"])
        Output:
        
Welch Two Sample t-test
data: iris$Petal.Width[iris$Species == "setosa"] and
iris$Petal.Width[iris$Species == "versicolor"] t = -34.0803, df = 74.755, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.143133 -1.016867 sample estimates: mean of x mean of y 0.246 1.326
  1. Alternatively, you can perform a correlation test on the sepal length to the sepal width of an iris, and then retrieve a correlation score between the two variables. The stronger the positive correlation, the closer the value is to 1. The stronger the negative correlation, the closer the value is to -1:
        > cor.test(iris$Sepal.Length, iris$Sepal.Width)
        Output:
        Pearson's product-moment correlation
        data:  iris$Sepal.Length and iris$Sepal.Width
        t = -1.4403, df = 148, p-value = 0.1519
        alternative hypothesis: true correlation is not equal to 0
        95 percent confidence interval:
       -0.27269325  0.04351158
        sample estimates:
            cor 
       -0.1175698   

How it works...

R has a built-in statistics function, which enables the user to perform descriptive statistics on a single variable. The recipe first introduces how to apply mean, sd, var, min, max, median, range, and quantile on a single variable. Moreover, in order to apply the statistics on all four numeric variables, one can use the sapply function. In order to determine the relationships between multiple variables, one can conduct correlation and covariance. Finally, the recipe shows how to determine the statistical differences of two given samples by performing a statistical test.

There's more...

If you need to compute an aggregated summary of statistics against data in different groups, you can use the aggregate and reshape functions to compute the summary statistics of data subsets:

  1. Use aggregate to calculate the mean of each iris attribute group by the species:
        > aggregate(x=iris[,1:4],by=list(iris$Species),FUN=mean)  
  1. Use reshape to calculate the mean of each iris attribute group by the species:
        >  library(reshape)
        >  iris.melt <- melt(iris,id='Species')
        >  cast(Species~variable,data=iris.melt,mean,
             subset=Species %in% c('setosa','versicolor'),
             margins='grand_row')

For information on reshape and aggregate, refer to the help documents by using ?reshape or ?aggregate.

Visualizing data

Visualization is a powerful way to communicate information through graphical means. Visual presentations make data easier to comprehend. This recipe presents some basic functions to plot charts, and demonstrates how visualizations are helpful in data exploration.

Getting ready

Ensure that you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to visualize a dataset:

  1. Load the iris data into the R session:
        > data(iris)  
  1. Calculate the frequency of species within the iris using the table command:
        > table.iris = table(iris$Species)
        > table.iris
        Output:
      

setosa versicolor virginica 50 50 50
  1. As the frequency in the table shows, each species represents 1/3 of the iris data. We can draw a simple pie chart to represent the distribution of species within the iris:
        > pie(table.iris)
        Output:
The pie chart of species distribution
  1. The histogram creates a frequency plot of sorts along the x-axis. The following example produces a histogram of the sepal length:
        > hist(iris$Sepal.Length)  
The histogram of the sepal length
  1. In the histogram, the x-axis presents the sepal length and the y-axis presents the count for different sepal lengths. The histogram shows that for most irises, sepal lengths range from 4 cm to 8 cm.
  2. Boxplots, also named box and whisker graphs, allow you to convey a lot of information in one simple plot. In such a graph, the line represents the median of the sample. The box itself shows the upper and lower quartiles. The whiskers show the range:
        > boxplot(Petal.Width ~ Species, data = iris)
The boxplot of the petal width
  1. The preceding screenshot clearly shows the median and upper range of the petal width of the setosa is much shorter than versicolor and virginica. Therefore, the petal width can be used as a substantial attribute to distinguish iris species.
  2. A scatter plot is used when there are two variables to plot against one another. This example plots the petal length against the petal width and color dots in accordance to the species it belongs to:
        > plot(x=iris$Petal.Length, y=iris$Petal.Width, col=iris$Species) 
The scatter plot of the sepal length
  1. The preceding screenshot is a scatter plot of the petal length against the petal width. As there are four attributes within the iris dataset, it takes six operations to plot all combinations. However, R provides a function named pairs, which can generate each subplot in one figure:
        > pairs(iris[1:4], main = "Edgar Anderson's Iris Data", pch = 21, 
bg = c("red", "green3", "blue")[unclass(iris$Species)])
Pairs scatterplot of iris data

How it works...

R provides many built-in plot functions, which enable users to visualize data with different kinds of plots. This recipe demonstrates the use of pie charts that can present category distribution. A pie chart of an equal size shows that the number of each species is equal. A histogram plots the frequency of different sepal lengths. A box plot can convey a great deal of descriptive statistics, and shows that the petal width can be used to distinguish an iris species. Lastly, we introduced scatter plots, which plot variables on a single plot. In order to quickly generate a scatter plot containing all the pairs of iris dataset, one may use the pairs command.

See also

  • ggplot2 is another plotting system for R, based on the implementation of Leland Wilkinson's grammar of graphics. It allows users to add, remove, or alter components in a plot with a higher abstraction. However, the level of abstraction results is slow compared to lattice graphics. For those of you interested in the topic of ggplot, you can refer to this site: http://ggplot2.org/.

Getting a dataset for machine learning

While R has a built-in dataset, the sample size and field of application is limited. Apart from generating data within a simulation, another approach is to obtain data from external data repositories. A famous data repository is the UCI machine learning repository, which contains both artificial and real datasets. This recipe introduces how to get a sample dataset from the UCI machine learning repository.

Getting ready

Ensure that you have completed the previous recipes by installing R on your operating system.

How to do it...

Perform the following steps to retrieve data for machine learning:

  1. Access the UCI machine learning repository: http://archive.ics.uci.edu/ml/.
  2. Click on view all data sets. Here you will find a list of datasets containing field names, such as Name, Data Types, Default Task, Attribute Types, #Instances, #Attributes, and Year:
  3. Use Ctrl + F to search for Iris:
  4. Click on Iris. This will display the data folder and the dataset description:
  5. Click on Data Folder, which will display a directory containing the iris dataset:
  1. You can then either download iris.data or use the read.csv function to read the dataset:
        > iris.data = read.csv(url("http://archive.ics.uci.edu/ml/machine-
learning-databases/iris/iris.data"), header = FALSE, col.names =
c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width",
"Species")) > head(iris.data) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 Iris-setosa 2 4.9 3.0 1.4 0.2 Iris-setosa 3 4.7 3.2 1.3 0.2 Iris-setosa 4 4.6 3.1 1.5 0.2 Iris-setosa 5 5.0 3.6 1.4 0.2 Iris-setosa 6 5.4 3.9 1.7 0.4 Iris-setosa

How it works...

Before conducting data analysis, it is important to collect your dataset. However, to collect an appropriate dataset for further exploration and analysis is not easy. We can, therefore, use the prepared dataset with the UCI repository as our data source. Here, we first access the UCI dataset repository and then use the iris dataset as an example. We can find the iris dataset by using the browser's find function (Ctrl + F), and then enter the file directory. Last, we can download the dataset and use the R I/O function, read.csv, to load the iris dataset into an R session.

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Apply R to simplify predictive modeling with short and simple code
  • Use machine learning to solve problems ranging from small to big data
  • Build a training and testing dataset, applying different classification methods.

Description

Big data has become a popular buzzword across many industries. An increasing number of people have been exposed to the term and are looking at how to leverage big data in their own businesses, to improve sales and profitability. However, collecting, aggregating, and visualizing data is just one part of the equation. Being able to extract useful information from data is another task, and a much more challenging one. Machine Learning with R Cookbook, Second Edition uses a practical approach to teach you how to perform machine learning with R. Each chapter is divided into several simple recipes. Through the step-by-step instructions provided in each recipe, you will be able to construct a predictive model by using a variety of machine learning packages. In this book, you will first learn to set up the R environment and use simple R commands to explore data. The next topic covers how to perform statistical analysis with machine learning analysis and assess created models, covered in detail later on in the book. You'll also learn how to integrate R and Hadoop to create a big data analysis platform. The detailed illustrations provide all the information required to start applying machine learning to individual projects. With Machine Learning with R Cookbook, machine learning has never been easier.

Who is this book for?

This book is for data science professionals, data analysts, or people who have used R for data analysis and machine learning who now wish to become the go-to person for machine learning with R. Those who wish to improve the efficiency of their machine learning models and need to work with different kinds of data set will find this book very insightful.

What you will learn

  • Create and inspect transaction datasets and perform association analysis with the Apriori algorithm
  • Visualize patterns and associations using a range of graphs and find frequent item-sets using the Eclat algorithm
  • Compare differences between each regression method to discover how they solve problems
  • Detect and impute missing values in air quality data
  • Predict possible churn users with the classification approach
  • Plot the autocorrelation function with time series analysis
  • Use the Cox proportional hazards model for survival analysis
  • Implement the clustering method to segment customer data
  • Compress images with the dimension reduction method
  • Incorporate R and Hadoop to solve machine learning problems on big data
Estimated delivery fee Deliver to Ecuador

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 23, 2017
Length: 572 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787284395
Category :
Languages :
Tools :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Estimated delivery fee Deliver to Ecuador

Standard delivery 10 - 13 business days

$19.95

Premium delivery 3 - 6 business days

$40.95
(Includes tracking information)

Product Details

Publication date : Oct 23, 2017
Length: 572 pages
Edition : 2nd
Language : English
ISBN-13 : 9781787284395
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 153.97
Mastering Machine Learning with R, Second Edition
$54.99
Machine Learning with R Cookbook, Second Edition
$54.99
Python Machine Learning, Second Edition
$43.99
Total $ 153.97 Stars icon

Table of Contents

14 Chapters
Practical Machine Learning with R Chevron down icon Chevron up icon
Data Exploration with Air Quality Datasets Chevron down icon Chevron up icon
Analyzing Time Series Data Chevron down icon Chevron up icon
R and Statistics Chevron down icon Chevron up icon
Understanding Regression Analysis Chevron down icon Chevron up icon
Survival Analysis Chevron down icon Chevron up icon
Classification 1 - Tree, Lazy, and Probabilistic Chevron down icon Chevron up icon
Classification 2 - Neural Network and SVM Chevron down icon Chevron up icon
Model Evaluation Chevron down icon Chevron up icon
Ensemble Learning Chevron down icon Chevron up icon
Clustering Chevron down icon Chevron up icon
Association Analysis and Sequence Mining Chevron down icon Chevron up icon
Dimension Reduction Chevron down icon Chevron up icon
Big Data Analysis (R and Hadoop) Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
(1 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 100%
1 star 0%
Chris H Mar 23, 2018
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
With better editing and some more meat in the discussion of techniques this could be a useful resource. As such it still may be one for some readers, but the errors and poor editing in general make it hard to recommend this over other similar books. For example, the one sample t-test incorrectly states the null hypothesis (as the sample mean being less than the population mean, which is actually the correct alternative hypothesis) and therefore comes to the wrong conclusion. It is one thing to have a bad typo or omission, but to get the main point backwards in a discussion of the statistical test? How this could make it into a 2nd edition may be hard to fathom, but Packt books tend to be plagued by such poor editing. (However, I do recommend R learners look at the Packt titles by Fischetti, Lantz, Cirillo. There are some good ones.)
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela