Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Machine Learning with R
Mastering Machine Learning with R

Mastering Machine Learning with R: Master machine learning techniques with R to deliver insights for complex projects

eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Mastering Machine Learning with R

Chapter 1. A Process for Success

 

"If you don't know where you are going, any road will get you there."

 
 --Robert Carrol
 

"If you can't describe what you are doing as a process, you don't know what you're doing."

 
 --W. Edwards Deming

At first glance, this chapter may seem to have nothing to do with machine learning, but it has everything to do with machine learning and specifically, its implementation and making the changes happen. The smartest people, best software, and best algorithm do not guarantee success, no matter how it is defined.

In most—if not all—projects, the key to successfully solving problems or improving decision-making is not the algorithm, but the soft, more qualitative skills of communication and influence. The problem many of us have with this is that it is hard to quantify how effective one is around these skillsets. It is probably safe to say that many of us ended up in this position because of a desire to avoid it. After all, the highly successful TV comedy The Big Bang Theory was built on this premise. Therefore, this chapter is to set you up for success. The intent is to provide a process, a flexible process no less, where you can become a Change Agent: a person who can influence and turn their insights into action without positional power. We will focus on Cross-Industry Standard Process for Data Mining (CRISP-DM). It is probably the most well-known and respected of any processes for analytical projects. Even if you use another industry process or something proprietary, there should still be a few gems in this chapter that you can take away.

I will not hesitate to say that this all is easier said than done, and without question, I'm guilty of every sin by both commission and omission that will be discussed in this chapter. With skill and some luck, you can avoid the many physical and emotional scars I've picked up over the last 10 and a half years.

Finally, we will also have a look at a flow chart (a cheat sheet) that you can use to help you identify what methodology to apply to the problem at hand.

The process

The CRISP-DM process was designed specifically for the data mining. However, it is flexible and thorough enough that it can be applied to any analytical project, whether it is predictive analytics, data science, or machine learning. Don't be intimidated by the numerous list of tasks as you can apply your judgment to the process and adapt it for any real-world situation. The following figure provides a visual representation of the process and shows the feedback loops, which facilitate its flexibility:

The process

Figure from CRISP-DM 1.0, Step-by-step data mining guide

The process has the following six phases:

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

For an in-depth review of the entire process with all of its tasks and subtasks, you can examine the paper by SPSS, CRISP-DM 1.0, step-by-step data mining guide, available at https://the-modeling-agency.com/crisp-dm.pdf.

I will discuss each of the steps in the process, covering the important tasks. However, it will not be in the detailed level of the guide, but more high level. We will not skip any of the critical details but focus more on the techniques that one can apply to the tasks. Keep in mind that the process steps will be used in the later chapters as a framework in the actual application of the machine learning methods in general and the R code specifically.

Business understanding

One cannot underestimate how important this first step of the process is in achieving success. It is the foundational step and failure or success here will likely determine failure or success for the rest of the project. The purpose of this step is to identify the requirements of the business so that you can translate them into analytical objectives. It has the following four tasks:

  1. Identify the business objective
  2. Assess the situation
  3. Determine the analytical goals
  4. Produce a project plan

Identify the business objective

The key to this task is to identify the goals of the organization and frame the problem. An effective question to ask is, what are we going to do different? This may seem like a benign question, but it can really challenge people to ponder what they need from an analytical perspective and it can get to the root of the decision that needs to be made. It can also prevent you from going out and doing a lot of unnecessary work on some fishing expedition. As such, the key for you is to identify the decision. A working definition of a decision can be put forward to the team as the irrevocable choice to commit or not commit the resources. Additionally, remember that the choice to do nothing different is indeed a decision.

This does not mean that a project should not be launched if the choices are not absolutely clear. There will be times when the problem is not or cannot be well-defined; to paraphrase former Defense Secretary Donald Rumsfeld, there are known – unknowns. Indeed, there will probably be many times when the problem is ill-defined and the project's main goal is to further the understanding of the problem and generate hypotheses; again calling on Secretary Rumsfeld, unknown – unknowns, which means that you don't know what you don't know. However, in ill-defined problems, one should go forward with an understanding of what will happen next in terms of resource commitment based on the various outcomes of hypothesis exploration.

Another thing to consider in this task is to manage expectations. There is no such thing as a perfect data, no matter what its depth and breadth is. This is not the time to make guarantees but to communicate what is possible, given your expertise.

I recommend a couple of outputs from this task. The first is a mission statement. This is not the touchy-feely mission statement of an organization, but it is your mission statement or, more importantly, the mission statement approved by the project sponsor. I stole this idea from my years of military experience and I could write volumes on why it is effective, but that is for another day. Let's just say that in the absence of clear direction or guidance, the mission statement or whatever you want to call it becomes the unifying statement and can help prevent scope creep. It consists of the following points:

  • Who: This is yourself or the team or project name; everyone likes a cool project name, for example, Project Viper, Project Fusion, and so on
  • What: This is the task that you will perform, for example, conduct machine learning
  • When: This is the deadline
  • Where: This could be geographical; by function, department, initiative, and so on
  • Why: This is the purpose of doing the project, that is, the business goal

The second task is to have as clear a definition of success as possible. Literally, ask what does success look like? Help the team/sponsor paint a picture of success that you can understand. Your job then is to translate this into modeling requirements.

Assess the situation

This task helps you in project planning by gathering information on the resources available, constraints, and assumptions, identifying the risks, and building contingency plans. I would further add that this is also the time to identify the key stakeholders that will be impacted by the decisions to be made.

A couple of points here. When examining the resources that are available, do not neglect to scour the records of the past and current projects. Odds are someone in the organization has or is working on the same problem and it may be essential to synchronize your work with theirs. Don't forget to enumerate the risks considering time, people, and money. Do everything in your power to create a list of the stakeholders, both those that impact your project and those that could be impacted by your project. Identify who these people are and how they can influence/be impacted by the decision. Once this is done, work with the project sponsor to formulate a communication plan with these stakeholders.

Determine the analytical goals

Here, you are looking to translate the business goal into technical requirements. This includes turning the success criterion from the task of creating a business objective to technical success. This might be things such as RMSE or a level of predictive accuracy.

Produce a project plan

The task here is to build an effective project plan with all the information gathered up to this point. Regardless of what technique you use, whether it be a Gantt chart or some other graphic, produce it and make it a part of your communication plan. Make this plan widely available to the stakeholders and update it on a regular basis and as circumstances dictate.

Data understanding

After enduring the all-important pain of the first step, you can now get your hands on the data. The tasks in this process consist of the following:

  1. Collect the data
  2. Describe the data
  3. Explore the data
  4. Verify the data quality

This step is the classic case of ETL is Extract, Transform, Load. There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine if the variables are sparse and identify the extent to which the data may be missing. This may drive the learning method that you use and/or whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon an incomplete data collection, cases where unintended IT issues led to errors in the data, or there were planned changes in the business rules. This is critical in the time series where often business rules change over time on how the data is classified. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself the heartache later on and make one.

Data preparation

Almost there! This step has the following five tasks:

  1. Select the data
  2. Clean the data
  3. Construct the data
  4. Integrate the data
  5. Format the data

These tasks are relatively self-explanatory. The goal is to get the data ready to input in the algorithms. This includes merging, feature engineering, and transformations. If imputation is needed, then it happens here as well. Additionally, with R, pay attention to how the outcome needs to be labeled. If your outcome/response variable is Yes/No, it may not work in some packages and will require a transformed or no variable with 1/0. At this point, you should also break your data into the various test sets if applicable: train, test, or validate. This step can be an unforgivable burden, but most experienced people will tell you that it is where you can separate yourself from your peers. With this, let's move on to the money step.

Modeling

This is where all the work that you've done up to this point can lead to fist-pumping exuberance or fist-pounding exasperation. But hey, if it was that easy, everyone would be doing it. The tasks are as follows:

  1. Select a modeling technique
  2. Generate a test design
  3. Build a model
  4. Assess a model

Oddly, this process step includes the considerations that you have already thought of and prepared for. In the first step, one will need at least a modicum of an idea about how they will be modeling. Remember, that this is a flexible, iterative process and not some strict linear flowchart such as an aircrew checklist.

The cheat sheet included in this chapter should help guide you in the right direction for the modeling techniques. A test design refers to the creation of your test and train datasets and/or the use of cross-validation and this should have been thought of and accounted for in the data preparation.

Model assessment involves comparing the models with the criteria/criterion that you developed in the business understanding, for example, RMSE, Lift, ROC, and so on.

Evaluation

With the evaluation process, the main goal is to confirm that the work that has been done and the model selected at this point meets the business objective. Ask yourself and others, have we achieved the definition of success? Let the Netflix prize serve as a cautionary tale here. I'm sure you are aware that Netflix awarded a $1 million prize to the team that could produce the best recommendation algorithm as defined by the lowest RMSE. However, Netflix did not implement it because the incremental accuracy gained was not worth the engineering effort! Always apply Occam's razor. At any rate, here are the tasks:

  1. Evaluate the results
  2. Review the process
  3. Determine the next steps

In reviewing the process, it may be necessary—as you no doubt determined earlier in the process—to take the results through governance and communicate with the other stakeholders in order to gain their buy-in. As for the next steps, if you want to be a change agent, make sure that you answer the what, so what, and now what in the stakeholders' minds. If you can tie their now what into the decision that you made earlier, you are money.

Deployment

If everything is done according to the plan up to this point, it might just come down to flipping a switch and your model goes live. Assuming that this is not the case, here are the tasks of this step:

  1. Deploying the plan
  2. Monitoring and maintenance of the plan
  3. Producing the final report
  4. Reviewing the project

After the deployment and monitoring/maintenance is underway, it is crucial for yourself and those that will walk in your steps to produce a well-written final report. This report should include a white paper and briefing slide. I have to say that I resisted the drive to put my findings in a white paper as I was an indentured servant to the military's passion for PowerPoint slides. However, slides can and will be used against you, cherry-picked or misrepresented by various parties for their benefit. Trust me, that just doesn't happen with a white paper as it becomes an extension of your findings and beliefs.

Now for the all-important process review. You may have your own proprietary way of conducting it, but here is what it should cover, whether you conduct it in a formal or informal way:

  • What was the plan?
  • What actually happened?
  • Why did it happen or did not happen?
  • What should be sustained in future projects?
  • What should be improved upon in future projects?
  • Create an action plan to ensure sustainment and improvement happens

That concludes the review of the CRISP-DM process, which provides a comprehensive and flexible framework to guarantee the success of your project and make you an agent of change.

Algorithm flowchart

The purpose of this section is to create a tool that will help you not just select the possible modeling techniques but also to think deeper about the problem. The residual benefit is that it may help you frame the problem with the project sponsor/team. The techniques in the flowchart are certainly not comprehensive but are exhaustive enough to get you started. It also includes techniques not discussed in this book.

The following figure starts the flow of selecting the potential modeling techniques. As you answer the question(s), it will take you to one of the four additional charts:

Algorithm flowchart

Figure 1

If the data is a text or in the time series format, then you will follow the flow in the following figure:

Algorithm flowchart

Figure 2

In this branch of the algorithm, you do not have a text or the time series data. Additionally, you are not trying to predict what category the observations belong to.

Algorithm flowchart

Figure 3

To get to this section, you would have data that is not text or time series. You want to categorize the data, but it does not have an outcome label, which brings us to clustering methods, as follows:

Algorithm flowchart

Figure 4

This brings us to a situation where we want to categorize the data and it is labeled, that is, classification:

Algorithm flowchart

Figure 5

Summary

This chapter was about how to set yourself and your team up for success in any project that you tackle. The CRISP-DM process is put forward as a flexible and comprehensive framework in order to facilitate the softer skills of communication and influence. Each process step and the tasks in each step were enumerated. More than that, the commentary provides some techniques and considerations to help in the process execution. By taking heed of the process, you can indeed become an agent of positive change to any organization.

The other item put forth in this chapter was an algorithm flowchart; a cheat sheet to help in identifying the proper techniques to apply in order to solve the business problem. With this foundation in place, we can now move on to applying these techniques to real-world problems.

Left arrow icon Right arrow icon

Description

Machine learning is a field of Artificial Intelligence to build systems that learn from data. Given the growing prominence of R—a cross-platform, zero-cost statistical programming environment—there has never been a better time to start applying machine learning to your data. The book starts with introduction to Cross-Industry Standard Process for Data Mining. It takes you through Multivariate Regression in detail. Moving on, you will also address Classification and Regression trees. You will learn a couple of “Unsupervised techniques”. Finally, the book will walk you through text analysis and time series. The book will deliver practical and real-world solutions to problems and variety of tasks such as complex recommendation systems. By the end of this book, you will gain expertise in performing R machine learning and will be able to build complex ML projects using R and its packages.

What you will learn

  • Gain deep insights to learn the applications of machine learning tools to the industry
  • Manipulate data in R efficiently to prepare it for analysis
  • Master the skill of recognizing techniques for effective visualization of data
  • Understand why and how to create test and training data sets for analysis
  • Familiarize yourself with fundamental learning methods such as linear and logistic regression
  • Comprehend advanced learning methods such as support vector machines
  • Realize why and how to apply unsupervised learning methods

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 28, 2015
Length: 400 pages
Edition : 1st
Language : English
ISBN-13 : 9781783984534
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Oct 28, 2015
Length: 400 pages
Edition : 1st
Language : English
ISBN-13 : 9781783984534
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 148.97
Machine Learning with R
$54.99
Mastering Machine Learning with R
$54.99
Learning Bayesian Models with R
$38.99
Total $ 148.97 Stars icon
Banner background image

Table of Contents

14 Chapters
1. A Process for Success Chevron down icon Chevron up icon
2. Linear Regression – The Blocking and Tackling of Machine Learning Chevron down icon Chevron up icon
3. Logistic Regression and Discriminant Analysis Chevron down icon Chevron up icon
4. Advanced Feature Selection in Linear Models Chevron down icon Chevron up icon
5. More Classification Techniques – K-Nearest Neighbors and Support Vector Machines Chevron down icon Chevron up icon
6. Classification and Regression Trees Chevron down icon Chevron up icon
7. Neural Networks Chevron down icon Chevron up icon
8. Cluster Analysis Chevron down icon Chevron up icon
9. Principal Components Analysis Chevron down icon Chevron up icon
10. Market Basket Analysis and Recommendation Engines Chevron down icon Chevron up icon
11. Time Series and Causality Chevron down icon Chevron up icon
12. Text Mining Chevron down icon Chevron up icon
A. R Fundamentals Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(6 Ratings)
5 star 66.7%
4 star 16.7%
3 star 0%
2 star 16.7%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Fabien Deneuville Aug 04, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
J'ai apprécié ce livre. Il est globalement très bien, fait, donne de multiples exemples. Il s'adresse à qui connaît déjà bien R et a de solides bases en analytics. Il permettra d'aller plus loin sur le machine learning, les différents types d'algorithmes, les techniques existantes... J'ai appris des choses avec ce livre.
Amazon Verified review Amazon
HDFS_Python Jun 11, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Overall, I think the book was good and I enjoyed reading it, for a statistics book this is a praise. The following pros will seem lacking to the cons but believe me that it is because the book was overall good and any compliment hits nearly all chapters in the book. When I did see a con, I expanded on it to give full insight into the issue. As in any endeavor of this sort, it is always a challenge to find the right balance between theory and application.Pros:The book contains companion code. This means a student can save the code for the future, load it in when necessary, and alter the code to learn from it. In my honest opinion, this is the best option for me to study and learn a topic. Each chapter covers a different over-arching problem, which is gradually solved when new-techniques and strategies are introduced then implemented to solidify the knowledge with use. Allowing the reader to see what scenarios the technique surrounds and how it is run. The book covers a wide variety of topics, allowing a student to become a jack-of-all-trades, in the use of machine learning and advanced statistical techniques in R.Cons:I believe this book is suited well for someone with a mathematical and programming background. Without either, the book would seem challenging and daunting in some areas (i.e. Neural Networks). The book would not be impossible for someone without knowledge in R to read it, but it would be advised that the person knows passing knowledge of the software before they begin this book.Lack of mathematical theory. In a few areas, the book shows how to use the topic to reach the end but does not include the deep mathematical background into how the calculation are run. It has a chance of creating a black-box scenario where someone knows how it works on the outside without a clue of how it is run on the inside. In my opinion, this isn’t always necessary knowing how to calculate acf, pacf, and eacf by hand is nice but doesn’t help when running acf(model). Side note: no reasonable person would calculate acf past five lag or pacf by hand.Overview for all subjects. The way the book was made for ease in learning makes brings up a small problem. Some challenging data sets may exceed the scope of the books training material and could lead to the reader being ill prepared. An example of this problem would be if a time series problem contains innovative or additive outlier. This means the student may receive a model with the lowest AIC value, but the formula may not be the most optimized format. For this case, a student should know when a problem is showing intriguing characteristics and should being a research process into how to confront these problems, through other reading material, internet, or professional network.
Amazon Verified review Amazon
meitzmann Oct 20, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Mastering Machine Learning in R provides a great introduction to machine learning and data analysis techniques. It is refreshing to read a statistics/data focused book that is written in an accessible manner by someone with good communication skills.The concepts are laid out in a logical format that includes Data Preparation and Business Cases, two things that are often left out of many similar texts. The author goes into detail on concepts that need it and avoids it on concepts that don’t while still providing enough resources for the reader.The code that comes with the book makes it a great resource for students or someone who is looking to teach themselves. Overall I highly recommend this text.
Amazon Verified review Amazon
Mugdha Hota Oct 18, 2018
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Ok
Amazon Verified review Amazon
Nick P Jan 31, 2018
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Great introductory book on the subject and no need for me to fumble through other books.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.