Search icon CANCEL
Subscription
0
Cart icon
Cart
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
R$49.99 | ALL EBOOKS & VIDEOS
Save more on purchases! Buy 2 and save 10%, Buy 3 and save 15%, Buy 5 and save 20%
Mastering Machine Learning with R, Second Edition - Second Edition
Mastering Machine Learning with R, Second Edition - Second Edition

Mastering Machine Learning with R, Second Edition: Advanced prediction, algorithms, and learning methods with R 3.x, Second Edition

R$245.99 R$49.99
Book Apr 2017 420 pages 2nd Edition
eBook
R$245.99 R$49.99
Print
R$306.99
Subscription
Free Trial
eBook
R$245.99 R$49.99
Print
R$306.99
Subscription
Free Trial

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now
Table of content icon View table of contents Preview book icon Preview Book

Mastering Machine Learning with R, Second Edition - Second Edition

Chapter 1. A Process for Success

"If you don't know where you are going, any road will get you there."                                                                                                              - Robert Carrol

"If you can't describe what you are doing as a process, you don't know what you're doing."                                                                                                              - W. Edwards Deming

At first glance, this chapter may seem to have nothing to do with machine learning, but it has everything to do with machine learning (specifically, its implementation and making change happen). The smartest people, best software, and best algorithms do not guarantee success, no matter how well it is defined.

In most, if not all, projects, the key to successfully solving problems or improving decision-making is not the algorithm, but the softer, more qualitative skills of communication and influence. The problem many of us have with this is that it is hard to quantify how effective one is around these skills. It is probably safe to say that many of us ended up in this position because of a desire to avoid it. After all, the highly successful TV comedy The Big Bang Theory was built on this premise. Therefore, the goal of this chapter is to set you up for success. The intent is to provide a process, a flexible process no less, where you can become a change agent: a person who can influence and turn their insights into action without positional power. We will focus on Cross-Industry Standard Process for Data Mining (CRISP-DM). It is probably the most well-known and respected of all processes for analytical projects. Even if you use another industry process or something proprietary, there should still be a few gems in this chapter that you can take away.

I will not hesitate to say that this all is easier said than done; without question, I'm guilty of every sin (both commission and omission) that will be discussed in this chapter. With skill and some luck, you can avoid the many physical and emotional scars I've picked up over the last 12 years.

Finally, we will also have a look at a flow chart (a cheat sheet) that you can use to help you identify what methodologies to apply to the problem at hand.

The process


The CRISP-DM process was designed specifically for data mining. However, it is flexible and thorough enough to be applied to any analytical project, whether it is predictive analytics, data science, or machine learning. Don't be intimidated by the numerous lists of tasks as you can apply your judgment to the process and adapt it for any real-world situation. The following figure provides a visual representation of the process and shows the feedback loops that make it so flexible:

Figure 1: CRISP-DM 1.0, Step-by-step data mining guide

The process has the following six phases:

  • Business understanding
  • Data understanding
  • Data preparation
  • Modeling
  • Evaluation
  • Deployment

For an in-depth review of the entire process with all of its tasks and subtasks, you can examine the paper by SPSS, CRISP-DM 1.0, step-by-step data mining guide, available at https://the-modeling-agency.com/crisp-dm.pdf.

I will discuss each of the steps in the process, covering the important tasks. However, it will not be in as detailed as the guide, but more high-level. We will not skip any of the critical details but focus more on the techniques that one can apply to the tasks. Keep in mind that these process steps will be used in later chapters as a framework in the actual application of the machine-learning methods in general and the R code, in particular.

Business understanding


One cannot underestimate how important this first step in the process is in achieving success. It is the foundational step, and failure or success here will likely determine failure or success for the rest of the project. The purpose of this step is to identify the requirements of the business so that you can translate them into analytical objectives. It has the following four tasks:

  1. Identifying the business objective.
  2. Assessing the situation.
  3. Determining analytical goals.
  4. Producing a project plan.

 

Identifying the business objective

The key to this task is to identify the goals of the organization and frame the problem. An effective question to ask is, "What are we going to do different?" This may seem like a benign question, but it can really challenge people to work out what they need from an analytical perspective and it can get to the root of the decision that needs to be made. It can also prevent you from going out and doing a lot of unnecessary work on some kind of "fishing expedition." As such, the key for you is to identify the decision. A working definition of a decision can be put forward to the team as the irrevocable choice to commit or not commit the resources. Additionally, remember that the choice to do nothing different is indeed a decision.

This does not mean that a project should not be launched if the choices are not absolutely clear. There will be times when the problem is not, or cannot be, well defined; to paraphrase former Defense Secretary Donald Rumsfeld, there are known-unknowns. Indeed, there will probably be many times when the problem is ill defined and the project's main goal is to further the understanding of the problem and generate hypotheses; again calling on Secretary Rumsfeld, unknown-unknowns, which means that you don't know what you don't know. However, with ill-defined problems, one could go forward with an understanding of what will happen next in terms of resource commitment based on the various outcomes from hypothesis exploration.

Another thing to consider in this task is the management of expectations. There is no such thing as perfect data, no matter what its depth and breadth are. This is not the time to make guarantees but to communicate what is possible, given your expertise.

I recommend a couple of outputs from this task. The first is a mission statement. This is not the touchy-feely mission statement of an organization, but it is your mission statement or, more importantly, the mission statement approved by the project sponsor. I stole this idea from my years of military experience and I could write volumes on why it is effective, but that is for another day. Let's just say that, in the absence of clear direction or guidance, the mission statement, or whatever you want to call it, becomes the unifying statement for all stakeholders and can help prevent scope creep. It consists of the following points:

  • Who: This is yourself or the team or project name; everyone likes a cool project name, for example, Project Viper, Project Fusion, and so on
  • What: This is the task that you will perform, for example, conducting machine learning
  • When: This is the deadline
  • Where: This could be geographical, by function, department, initiative, and so on
  • Why: This is the purpose behind implementing the project, that is, the business goal

The second task is to have as clear a definition of success as possible. Literally, ask "What does success look like?" Help the team/sponsor paint a picture of success that you can understand. Your job then is to translate this into modeling requirements.

Assessing the situation

This task helps you in project planning by gathering information on the resources available, constraints, and assumptions; identifying the risks; and building contingency plans. I would further add that this is also the time to identify the key stakeholders that will be impacted by the decision(s) to be made.

A couple of points here. When examining the resources that are available, do not neglect to scour the records of past and current projects. Odds are someone in the organization has worked, or is working on the same problem and it may be essential to synchronize your work with theirs. Don't forget to enumerate the risks considering time, people, and money. Do everything in your power to create a list of stakeholders, both those that impact your project and those that could be impacted by your project. Identify who these people are and how they can influence/be impacted by the decision. Once this is done, work with the project sponsor to formulate a communication plan with these stakeholders.

Determining the analytical goals

Here, you are looking to translate the business goal into technical requirements. This includes turning the success criterion from the task of creating a business objective to technical success. This might be things such as RMSE or a level of predictive accuracy.

Producing a project plan

The task here is to build an effective project plan with all the information gathered up to this point. Regardless of what technique you use, whether it be a Gantt chart or some other graphic, produce it and make it a part of your communication plan. Make this plan widely available to the stakeholders and update it on a regular basis and as circumstances dictate.

 

Data understanding


After enduring the all-important pain of the first step, you can now get busy with the data. The tasks in this process consist of the following:

  1. Collecting the data.
  2. Describing the data.
  3. Exploring the data.
  4. Verifying the data quality.

This step is the classic case of Extract, Transform, Load (ETL). There are some considerations here. You need to make an initial determination that the data available is adequate to meet your analytical needs. As you explore the data, visually and otherwise, determine whether the variables are sparse and identify the extent to which data may be missing. This may drive the learning method that you use and/or determine whether the imputation of the missing data is necessary and feasible.

Verifying the data quality is critical. Take the time to understand who collects the data, how it is collected, and even why it is collected. It is likely that you may stumble upon incomplete data collection, cases where unintended IT issues led to errors in the data, or planned changes in the business rules. This is critical in time series where often business rules on how the data is classified change over time. Finally, it is a good idea to begin documenting any code at this step. As a part of the documentation process, if a data dictionary is not available, save yourself potential heartache and make one.

Data preparation


Almost there! This step has the following five tasks:

  1. Selecting the data.
  2. Cleaning the data.
  3. Constructing the data.
  4. Integrating the data.
  5. Formatting the data.

These tasks are relatively self-explanatory. The goal is to get the data ready to input in the algorithms. This includes merging, feature engineering, and transformations. If imputation is needed, then it happens here as well. Additionally, with R, pay attention to how the outcome needs to be labeled. If your outcome/response variable is Yes/No, it may not work in some packages and will require a transformed or no variable with 1/0. At this point, you should also break your data into the various test sets if applicable: train, test, or validate. This step can be an unmitigated burden, but most experienced people will tell you that it is where you can separate yourself from your peers. With this, let's move on to the payoff, where you earn your money.

Modeling


This is where all the work that you've done up to this point can lead to fist-pumping exuberance or fist-pounding exasperation. But hey, if it was that easy, everyone would be doing it. The tasks are as follows:

  1. Selecting a modeling technique.
  2. Generating a test design.
  3. Building a model.
  4. Assessing a model.

Oddly, this process step includes the considerations that you have already thought of and prepared for. In the first step, you will need at least some idea about how you will be modeling. Remember that this is a flexible, iterative process and not some strict linear flowchart such as an aircrew checklist.

The cheat sheet included in this chapter should help guide you in the right direction for the modeling techniques. Test design refers to the creation of your test and train datasets and/or the use of cross-validation and this should have been thought of and accounted for in the data preparation.

Model assessment involves comparing the models with the criteria/criterion that you developed in the business understanding, for example, RMSE, Lift, ROC, and so on.

 

Evaluation


With the evaluation process, the main goal is to confirm that the model selected at this point meets the business objective. Ask yourself and others, "Have we achieved our definition of success?". Let the Netflix prize serve as a cautionary tale here. I'm sure you are aware that Netflix awarded a $1-million prize to the team that could produce the best recommendation algorithm as defined by the lowest RMSE. However, Netflix did not implement it because the incremental accuracy gained was not worth the engineering effort! Always apply Occam's razor. At any rate, here are the tasks:

  1. Evaluating the results.
  2. Reviewing the process.
  3. Determining the next steps.

In reviewing the process, it may be necessary, as you no doubt determined earlier in the process, to take the results through governance and communicate with the other stakeholders in order to gain their buy-in. As for the next steps, if you want to be a change agent, make sure that you answer the what, so what, and now what in the stakeholders' minds. If you can tie their now what into the decision that you made earlier, you have earned your money.

Deployment


If everything is done according to the plan up to this point, it might just come down to flipping a switch and your model goes live. Assuming that this is not the case, here are the tasks for this step:

  1. Deploying the plan.
  2. Monitoring and maintaining the plan.
  3. Producing the final report.
  4. Reviewing the project.

After the deployment and monitoring/maintenance and underway, it is crucial for you and those who will walk in your steps to produce a well-written final report. This report should include a white paper and briefing slide. I have to say that I resisted the drive to put my findings in a white paper as I was an indentured servant to the military's passion for PowerPoint slides. However, slides can and will be used against you, cherry-picked or misrepresented by various parties for their benefit. Trust me, that just doesn't happen with a white paper as it becomes an extension of your findings and beliefs. Use PowerPoint to brief stakeholders, but use that the white paper as the document of record and as a preread, should your organization insist on one. It is my standard procedure to create this white paper in R using knitr and LaTex.

Now for the all-important process review, you may have your own proprietary way of conducting it; but here is what it should cover, whether you conduct it in a formal or informal way:

  • What was the plan?
  • What actually happened?
  • Why did it happen or not happen?
  • What should be sustained in future projects?
  • What should be improved upon in future projects?
  • Create an action plan to ensure sustainment and improvement happen

That concludes the review of the CRISP-DM process, which provides a comprehensive and flexible framework to guarantee the success of your project and make you an agent of change.

Algorithm flowchart


The purpose of this section is to create a tool that will help you not just select possible modeling techniques but also think deeper about the problem. The residual benefit is that it may help you frame the problem with the project sponsor/team. The techniques in the flowchart are certainly not comprehensive but are exhaustive enough to get you started. It also includes techniques not discussed in this book.

The following figure starts the flow of selecting the potential modeling techniques. As you answer the question(s), it will take you to one of the four additional charts:

Figure 2

If the data is text or in the time series format, then you will follow the flow in the following figure:

Figure 3

In this branch of the algorithm, you do not have text or time series data. You also do not want to predict a category, so you are looking to make recommendations, understand associations, or predict a quantity:

Figure 4

To get to this section, you will have data that is not text or time series. You want to categorize the data, but it does not have an outcome label, which brings us to clustering methods, as follows:

Figure 5

This brings us to the situation where we want to categorize the data and it is labeled, that is, classification:

Figure 6

Summary


This chapter was about how to set up you and your team for success in any project that you tackle. The CRISP-DM process is put forward as a flexible and comprehensive framework in order to facilitate the softer skills of communication and influence. Each step of the process and the tasks in each step were enumerated. More than that, the commentary provides some techniques and considerations to with the process execution. By taking heed of the process, you can indeed become an agent of positive change to any organization.

The other item put forth in this chapter was an algorithm flowchart; a cheat sheet to help identify some of the proper techniques to apply in order to solve the business problem. With this foundation in place, we can now move on to applying these techniques to real-world problems.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand and apply machine learning methods using an extensive set of R packages such as XGBOOST
  • Understand the benefits and potential pitfalls of using machine learning methods such as Multi-Class Classification and Unsupervised Learning
  • Implement advanced concepts in machine learning with this example-rich guide

Description

This book will teach you advanced techniques in machine learning with the latest code in R 3.3.2. You will delve into statistical learning theory and supervised learning; design efficient algorithms; learn about creating Recommendation Engines; use multi-class classification and deep learning; and more. You will explore, in depth, topics such as data mining, classification, clustering, regression, predictive modeling, anomaly detection, boosted trees with XGBOOST, and more. More than just knowing the outcome, you’ll understand how these concepts work and what they do. With a slow learning curve on topics such as neural networks, you will explore deep learning, and more. By the end of this book, you will be able to perform machine learning with R in the cloud using AWS in various scenarios with different datasets.

What you will learn

[*] Gain deep insights into the application of machine learning tools in the industry [*] Manipulate data in R efficiently to prepare it for analysis [*] Master the skill of recognizing techniques for effective visualization of data [*] Understand why and how to create test and training data sets for analysis [*] Master fundamental learning methods such as linear and logistic regression [*] Comprehend advanced learning methods such as support vector machines [*] Learn how to use R in a cloud service such as Amazon

Product Details

Country selected

Publication date : Apr 24, 2017
Length 420 pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781787287471
Category :
Languages :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Apr 24, 2017
Length 420 pages
Edition : 2nd Edition
Language : English
ISBN-13 : 9781787287471
Category :
Languages :

Table of Contents

23 Chapters
Title Page Chevron down icon Chevron up icon
Credits Chevron down icon Chevron up icon
About the Author Chevron down icon Chevron up icon
About the Reviewers Chevron down icon Chevron up icon
Packt Upsell Chevron down icon Chevron up icon
Customer Feedback Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. A Process for Success Chevron down icon Chevron up icon
2. Linear Regression - The Blocking and Tackling of Machine Learning Chevron down icon Chevron up icon
3. Logistic Regression and Discriminant Analysis Chevron down icon Chevron up icon
4. Advanced Feature Selection in Linear Models Chevron down icon Chevron up icon
5. More Classification Techniques - K-Nearest Neighbors and Support Vector Machines Chevron down icon Chevron up icon
6. Classification and Regression Trees Chevron down icon Chevron up icon
7. Neural Networks and Deep Learning Chevron down icon Chevron up icon
8. Cluster Analysis Chevron down icon Chevron up icon
9. Principal Components Analysis Chevron down icon Chevron up icon
10. Market Basket Analysis, Recommendation Engines, and Sequential Analysis Chevron down icon Chevron up icon
11. Creating Ensembles and Multiclass Classification Chevron down icon Chevron up icon
12. Time Series and Causality Chevron down icon Chevron up icon
13. Text Mining Chevron down icon Chevron up icon
14. R on the Cloud Chevron down icon Chevron up icon
15. R Fundamentals Chevron down icon Chevron up icon
16. Sources Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Top Reviews
No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.