Preface
Note
About
This section briefly introduces the authors, the coverage of this book, the technical skills you'll need to get started, and the hardware and software requirements required to complete all of the included activities and exercises.
About the Book
R was one of the first programming languages developed for statistical computing and data analysis with excellent support for visualization. With the rise of data science, R emerged as an undoubtedly good choice of programming language among many data science practitioners. Since R was open source and extremely powerful in building sophisticated statistical models, it quickly found adoption in both industry and academia.
Applied Supervised Learning with R covers the complete process of using R to develop applications using supervised machine learning algorithms that cater to your business needs. Your learning curve starts with developing your analytical thinking to create a problem statement using business inputs or domain research. You will learn about many evaluation metrics that compare various algorithms, and you will then use these metrics to select the best algorithm for your problem. After finalizing the algorithm you want to use, you will study hyperparameter optimization techniques to fine-tune your set of optimal parameters. To avoid overfitting your model, you will also be shown how to add various regularization terms. You will also learn about deploying your model into a production environment.
When you have completed the book, you will be an expert at modeling supervised machine learning algorithms to precisely fulfill your business needs.
About the Authors
Karthik Ramasubramanian completed his M.Sc. in Theoretical Computer Science at PSG College of Technology, India, where he pioneered the application of machine learning, data mining, and fuzzy logic in his research work on computer and network security. He has over seven years' experience of leading data science and business analytics in retail, Fast-Moving Consumer Goods, e-commerce, information technology, and the hospitality industry for multinational companies and unicorn start-ups.
He is a researcher and a problem solver with diverse experience of the data science life cycle, starting from data problem discovery to creating data science proof of concepts and products for various industry use cases. In his leadership roles, Karthik has been instrumental in solving many ROI-driven business problems via data science solutions. He has mentored and trained hundreds of professionals and students globally in data science through various online platforms and university engagement programs. He has also developed intelligent chatbots based on deep learning models that understand human-like interactions, customer segmentation models, recommendation systems, and many natural language processing models.
He is an author of the book Machine Learning Using R, published by Apress, a publishing house of Springer Business+Science Media. The book was a big success with more than 50,000 online downloads and hardcover sales. The book was subsequently published as a second edition with extended chapters on Deep Learning and Time Series Modeling.
Jojo Moolayil is an artificial intelligence, deep learning, machine learning, and decision science professional with over six years of industrial experience. He is the author of Learn Keras for Deep Neural Networks, published by Apress, and Smarter Decisions – The Intersection of IoT and Decision Science, published by Packt Publishing. He has worked with several industry leaders on high-impact, critical data science and machine learning projects across multiple verticals. He is currently associated with Amazon Web Services as a research scientist in Canada.
Apart from writing books on AI, decision science, and the internet of things, Jojo has been a technical reviewer for various books in the same fields published by Apress and Packt Publishing. He is an active data science tutor and maintains a blog at http://blog.jojomoolayil.com.
Learning Objectives
Develop analytical thinking to precisely identify a business problem
Wrangle data with dplyr, tidyr, and reshape2
Visualize data with ggplot2
Validate your supervised machine learning model using the k-fold algorithm
Optimize hyperparameters with grid and random search and Bayesian optimization
Deploy your model on AWS Lambda with Plumber
Improve a model's performance with feature selection and dimensionality reduction
Audience
This book is specially designed for novice and intermediate data analysts, data scientists, and data engineers who want to explore various methods of supervised machine learning and its various use cases. Some background in statistics, probability, calculus, linear algebra, and programming will help you thoroughly understand and follow the content of this book.
Approach
Applied Supervised Learning with R perfectly balances theory and exercises. Each module is designed to build on the learning of the previous module. The book contains multiple activities that use real-life business scenarios for you to practice and apply your new skills in a highly relevant context.
Minimum Hardware Requirements
For the optimal student experience, we recommend the following hardware configuration:
Processor: Intel or AMD 4-core or better
Memory: 8 GB RAM
Storage: 20 GB available space
Software Requirements
You'll need the following software installed in advance:
Operating systems: Windows 7, 8.1, or 10, Ubuntu 14.04 or later, or macOS Sierra or later
Browser: Google Chrome or Mozilla Firefox
RStudio
RStudio Cloud
You'll also need the following software, packages, and libraries installed in advance:
dplyr
tidyr
reshape2
lubridate
ggplot2
caret
mlr
OpenML
Conventions
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The Location, WindDir, and RainToday variables and many more are categorical, and the remainder are continuous."
A block of code is set as follows:
temp_df<-as.data.frame( sort( round( sapply(df, function(y) sum(length(which(is.na(y)))))/dim(df)[1],2) ) ) colnames(temp_df) <- "NullPerc"
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text such as this: "Click on the Next button and navigate to the Details page."
Installation and Setup
To install a package on the RStudio Cloud, you can use the following syntax:
install.packages("Package_Name")
For example:
install.packages("ggplot2")
To verify the installation, run the following command:
library(Package_Name)
For example:
library(ggplot2)
Installing the Code Bundle
Copy the code bundle for the class to the C:/Code folder.
Additional Resources
The code bundle for this book is also hosted on GitHub at: https://github.com/TrainingByPackt/Applied-Supervised-Learning-with-R.
We also have other code bundles from our rich catalog of books, videos, and E-learning products available at https://github.com/PacktPublishing/ and https://github.com/TrainingByPackt. Check them out!