Machine Learning with R

Chapter 1. Introducing Machine Learning

If science fiction stories are to be believed, the invention of artificial intelligence inevitably leads to apocalyptic wars between machines and their makers. The stories begin with today's reality: computers being taught to play simple games like tic-tac-toe and to automate routine tasks. As the stories go, machines are later given control of traffic lights and communications, followed by military drones and missiles. The machines' evolution takes an ominous turn once the computers become sentient and learn how to teach themselves. Having no more need for human programmers, humankind is then "deleted."

Thankfully, at the time of this writing, machines still require user input.

Though your impressions of machine learning may be colored by these mass-media depictions, today's algorithms are too application-specific to pose any danger of becoming self-aware. The goal of today's machine learning is not to create an artificial brain, but rather to assist us with making sense of the world's massive data stores.

Putting popular misconceptions aside, by the end of this chapter, you will gain a more nuanced understanding of machine learning. You will also be introduced to the fundamental concepts that define and differentiate the most commonly used machine learning approaches. You will learn:

The origins, applications, and pitfalls of machine learning
How computers transform data into knowledge and action
Steps to match a machine learning algorithm to your data

The field of machine learning provides a set of algorithms that transform data into actionable knowledge. Keep reading to see how easy it is to use R to start applying machine learning to real-world problems.

The origins of machine learning

Beginning at birth, we are inundated with data. Our body's sensors—the eyes, ears, nose, tongue, and nerves—are continually assailed with raw data that our brain translates into sights, sounds, smells, tastes, and textures. Using language, we are able to share these experiences with others.

From the advent of written language, human observations have been recorded. Hunters monitored the movement of animal herds; early astronomers recorded the alignment of planets and stars; and cities recorded tax payments, births, and deaths. Today, such observations, and many more, are increasingly automated and recorded systematically in ever-growing computerized databases.

The invention of electronic sensors has additionally contributed to an explosion in the volume and richness of recorded data. Specialized sensors, such as cameras, microphones, chemical noses, electronic tongues, and pressure sensors mimic the human ability to see, hear, smell, taste, and feel. These sensors process the data far differently than a human being would. Unlike a human's limited and subjective attention, an electronic sensor never takes a break and has no emotions to skew its perception.

Tip

Although sensors are not clouded by subjectivity, they do not necessarily report a single, definitive depiction of reality. Some have an inherent measurement error due to hardware limitations. Others are limited by their scope. A black-and-white photograph provides a different depiction of its subject than one shot in color. Similarly, a microscope provides a far different depiction of reality than a telescope.

Between databases and sensors, many aspects of our lives are recorded. Governments, businesses, and individuals are recording and reporting information, from the monumental to the mundane. Weather sensors record temperature and pressure data; surveillance cameras watch sidewalks and subway tunnels; and all manner of electronic behaviors are monitored: transactions, communications, social media relationships, and many others.

This deluge of data has led some to state that we have entered an era of big data, but this may be a bit of a misnomer. Human beings have always been surrounded by large amounts of data. What makes the current era unique is that we have vast amounts of recorded data, much of which can be directly accessed by computers. Larger and more interesting datasets are increasingly accessible at the tips of our fingers, only a web search away. This wealth of information has the potential to inform action, given a systematic way of making sense of it all.

The field of study interested in the development of computer algorithms for transforming data into intelligent action is known as machine learning. This field originated in an environment where the available data, statistical methods, and computing power rapidly and simultaneously evolved. Growth in the volume of data necessitated additional computing power, which in turn spurred the development of statistical methods for analyzing large datasets. This created a cycle of advancement allowing even larger and more interesting data to be collected, and enabling today's environment in which endless streams of data are available on virtually any topic.

Figure 1.1: The cycle of advancement that enabled machine learning

A closely related sibling of machine learning, data mining, is concerned with the generation of novel insight from large databases. As the term implies, data mining involves a systematic hunt for nuggets of actionable intelligence. Although there is some disagreement over how widely machine learning and data mining overlap, a potential point of distinction is that machine learning focuses on teaching computers how to use data to solve a problem, while data mining focuses on teaching computers to identify patterns that humans then use to solve a problem.

Virtually all data mining involves the use of machine learning, but not all machine learning requires data mining. For example, you might apply machine learning to data mine automobile traffic data for patterns related to accident rates. On the other hand, if the computer is learning how to drive the car itself, this is purely machine learning without data mining.

Tip

The phrase "data mining" is also sometimes used as a pejorative to describe the deceptive practice of cherry-picking data to support a theory.

Uses and abuses of machine learning

Most people have heard of Deep Blue, the chess-playing computer that in 1997 was the first to win a game against a world champion. Another famous computer, Watson, defeated two human opponents on the television trivia game show Jeopardy in 2011. Based on these stunning accomplishments, some have speculated that computer intelligence will replace workers in information technology occupations, just as machines replaced workers in fields and assembly lines.

The truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem. They are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action.

Tip

Without completely discounting the achievements of Deep Blue and Watson, it is important to note that neither is even as intelligent as a typical five-year-old. For more on why "comparing smarts is a slippery business," see the Popular Science article FYI: Which Computer Is Smarter, Watson Or Deep Blue?, by Will Grunewald, 2012: https://www.popsci.com/science/article/2012-12/fyi-which-computer-smarter-watson-or-deep-blue.

Machines are not good at asking questions, or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound partners with its trainer: the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail.

Figure 1.2: Machine learning algorithms are powerful tools that require careful direction

To better understand the real-world applications of machine learning, we'll now consider some cases where it has been used successfully, some places where it still has room for improvement, and some situations where it may do more harm than good.

Machine learning successes

Machine learning is most successful when it augments, rather than replaces, the specialized knowledge of a subject-matter expert. It works with medical doctors at the forefront of the fight to eradicate cancer; assists engineers and programmers with efforts to create smarter homes and automobiles; and helps social scientists to build knowledge of how societies function. Toward these ends, it is employed in countless businesses, scientific laboratories, hospitals, and governmental organizations. Any effort that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it.

Though it is impossible to list every use case for machine learning, a look at recent success stories identifies several prominent examples:

Identification of unwanted spam messages in email
Segmentation of customer behavior for targeted advertising
Forecasts of weather behavior and long-term climate changes
Reduction of fraudulent credit card transactions
Actuarial estimates of financial damage of storms and natural disasters
Prediction of popular election outcomes
Development of algorithms for auto-piloting drones and self-driving cars
Optimization of energy use in homes and office buildings
Projection of areas where criminal activity is most likely
Discovery of genetic sequences linked to diseases

By the end of this book, you will understand the basic machine learning algorithms that are employed to teach computers to perform these tasks. For now, it suffices to say that no matter what the context is, the machine learning process is the same. Regardless of the task, an algorithm takes data and identifies patterns that form the basis for further action.

The limits of machine learning

Although machine learning is used widely and has tremendous potential, it is important to understand its limits. Machine learning, at this time, emulates a relatively limited subset of the capabilities of the human brain. It offers little flexibility to extrapolate outside of strict parameters and knows no common sense. With this in mind, one should be extremely careful to recognize exactly what an algorithm has learned before setting it loose in the real world.

Without a lifetime of past experiences to build upon, computers are also limited in their ability to make simple inferences about logical next steps. Take, for instance, the banner advertisements seen on many websites. These are served according to patterns learned by data mining the browsing history of millions of users. Based on this data, someone who views websites selling shoes is interested in buying shoes and should therefore see advertisements for shoes. The problem is that this becomes a never-ending cycle in which, even after shoes have been purchased, additional shoe advertisements are served, rather than advertisements for shoelaces and shoe polish.

Many people are familiar with the deficiencies of machine learning's ability to understand or translate language, or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show The Simpsons, which showed a parody of the Apple Newton tablet. For its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully's note to "Beat up Martin" was misinterpreted by the Newton as "Eat up Martha."

Figure 1.3: Screen captures from Lisa on Ice, The Simpsons, 20th Century Fox (1994)

Machine language processing has improved enough in the time since the Apple Newton that Google, Apple, and Microsoft are all confident in their ability to offer voice-activated virtual concierge services such as Google Assistant, Siri, and Cortana. Still, these services routinely struggle to answer relatively simple questions. Furthermore, online translation services sometimes misinterpret sentences that a toddler would readily understand, and the predictive text feature on many devices has led to a number of humorous "autocorrect fail" sites that illustrate computers' ability to understand basic language but completely misunderstand context.

Some of these mistakes are surely to be expected. Language is complicated, with multiple layers of text and subtext, and even human beings sometimes misunderstand context. In spite of the fact that machine learning is rapidly improving at language processing, the consistent shortcomings illustrate the important fact that machine learning is only as good as the data it has learned from. If context is not explicit in the input data, then just like a human, the computer will have to make its best guess from its limited set of past experiences.

Machine learning ethics

At its core, machine learning is simply a tool that assists us with making sense of the world's complex data. Like any tool, it can be used for good or for evil. Where machine learning goes most wrong is when it is applied so broadly, or so callously, that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless can lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to at least briefly consider the ethical implications of the art.

Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain, and constantly in flux. Caution should be exercised when obtaining or analyzing data in order to avoid breaking laws; violating terms of service or data use agreements; or abusing the trust or violating the privacy of customers or the public.

Tip

The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, was at one time, "don't be evil." While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states, "above all, do no harm."

Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of the items in a store. Many have equipped checkout lanes with devices that print coupons for promotions based on a customer's buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products he or she wants to buy. At first, this appears relatively harmless, but consider what happens when this practice is taken a bit further.

One possibly apocryphal tale concerns a large retailer in the United States that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers who would later purchase profitable items such as diapers, baby formula, and toys.

Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty not only whether a woman was pregnant, but also the approximate timing for when the baby was due.

After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain called to offer an apology, it was the father who ultimately apologized after confronting his daughter and discovering that she was indeed pregnant!

Whether completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information, such as health data, is concerned. With a bit more care, the retailer could have foreseen this scenario and used greater discretion when choosing how to reveal the pattern its machine learning analysis had discovered.

Tip

For more detail on how retailers use machine learning to identify pregnancies, see the New York Times Magazine article, titled How Companies Learn Your Secrets, by Charles Duhigg, 2012: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

As machine learning algorithms are more widely applied, we find that computers may learn some unfortunate behaviors of human societies. Sadly, this includes perpetuating race or gender discrimination and reinforcing negative stereotypes. For example, researchers have found that Google's online advertising service is more likely to show ads for high-paying jobs to men than women, and is more likely to display ads for criminal background checks to black people than white people.

Proving that these types of missteps are not limited to Silicon Valley, a Twitter chatbot service developed by Microsoft was quickly taken offline after it began spreading Nazi and anti-feminist propaganda. Often, algorithms that at first seem "content neutral" quickly start to reflect majority beliefs or dominant ideologies. An algorithm created by Beauty.AI to reflect an objective conception of human beauty sparked controversy when it favored almost exclusively white people. Imagine the consequences if this had been applied to facial recognition software for criminal activity!

Tip

For more information about the real-world consequences of machine learning and discrimination see the New York Times article When Algorithms Discriminate, by Claire Cain Miller, 2015: https://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html.

To limit the ability of algorithms to discriminate illegally, certain jurisdictions have well-intentioned laws that prevent the use of racial, ethnic, religious, or other protected class data for business reasons. However, excluding this data from a project may not be enough because machine learning algorithms can still inadvertently learn to discriminate. If a certain segment of people tends to live in a certain region, buys a certain product, or otherwise behaves in a way that uniquely identifies them as a group, machine learning algorithms can infer the protected information from other factors. In such cases, you may need to completely de-identify these people by excluding any potentially identifying data in addition to the already-protected statuses.

Apart from the legal consequences, inappropriate use of data may hurt the bottom line. Customers may feel uncomfortable or become spooked if aspects of their lives they consider private are made public. In recent years, a number of high-profile web applications have experienced a mass exodus of users who felt exploited when the applications' terms of service agreements changed or their data was used for purposes beyond what the users had originally intended. The fact that privacy expectations differ by context, by age cohort, and by locale adds complexity to deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin on your project, in addition to being aware of ever-more-restrictive regulations such as the European Union's newly-implemented General Data Protection Regulation (GDPR) and the inevitable policies that will follow in its footsteps.

Tip

The fact that you can use data for a particular end does not always mean that you should.

Finally, it is important to note that as machine learning algorithms become progressively more important to our everyday lives, there are greater incentives for nefarious actors to work to exploit them. Sometimes, attackers simply want to disrupt algorithms for laughs or notoriety—such as "Google bombing," the crowd-sourced method of tricking Google's algorithms to highly rank a desired page.

Other times, the effects are more dramatic. A timely example of this is the recent wave of so-called fake news and election meddling, propagated via the manipulation of advertising and recommendation algorithms that target people according to their personality. To avoid giving such control to outsiders, when building machine learning systems, it is crucial to consider how they may be influenced by a determined individual or crowd.

Tip

Social media scholar danah boyd (styled lowercase) presented a keynote at the Strata Data Conference 2017 in New York City that discussed the importance of hardening machine learning algorithms to attackers. For a recap, refer to: https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b.

The consequences of malicious attacks on machine learning algorithms can also be deadly. Researchers have shown that by creating an "adversarial attack" that subtly distorts a street sign with carefully chosen graffiti, an attacker might cause an autonomous vehicle to misinterpret a stop sign, potentially resulting in a fatal crash. Even in the absence of ill intent, software bugs and human errors have already led to fatal accidents in autonomous vehicle technology from Uber and Tesla. With such examples in mind, it is of the utmost importance and ethical concern that machine learning practitioners should worry about how their algorithms will be used and abused in the real world.

How machines learn

A formal definition of machine learning attributed to computer scientist Tom M. Mitchell states that a machine learns whenever it is able to utilize its experience such that its performance improves on similar experiences in the future. Although this definition is intuitive, it completely ignores the process of exactly how experience can be translated into future action—and, of course, learning is always easier said than done!

Where human brains are naturally capable of learning from birth, the conditions necessary for computers to learn must be made explicit. For this reason, although it is not strictly necessary to understand the theoretical basis of learning, this foundation helps us to understand, distinguish, and implement machine learning algorithms.

Tip

As you compare machine learning to human learning, you may find yourself examining your own mind in a different light.

Regardless of whether the learner is a human or a machine, the basic learning process is similar. It can be divided into four interrelated components:

Data storage utilizes observation, memory, and recall to provide a factual basis for further reasoning.
Abstraction involves the translation of stored data into broader representations and concepts.
Generalization uses abstracted data to create knowledge and inferences that drive action in new contexts.
Evaluation provides a feedback mechanism to measure the utility of learned knowledge and inform potential improvements.

Figure 1.4: The learning process

Although the learning process has been conceptualized here as four distinct components, they are merely organized this way for illustrative purposes. In reality, the entire learning process is inextricably linked. In human beings, the process occurs subconsciously. We recollect, deduce, induct, and intuit within the confines of our mind's eye, and because this process is hidden, any differences from person to person are attributed to a vague notion of subjectivity. In contrast, computers make these processes explicit, and because the entire process is transparent, the learned knowledge can be examined, transferred, utilized for future action, and treated as a data "science."

The data science buzzword suggests a relationship among the data, the machine, and the people who guide the learning process. The term's growing use in job descriptions and academic degree programs reflects its operationalization as a field of study concerned with both statistical and computational theory, as well as the technological infrastructure enabling machine learning and its applications. The field often asks its practitioners to be compelling storytellers, balancing an audacity in the use of data with the limitations of what one may infer and forecast from the data. To be a strong data scientist, therefore, requires a strong understanding of how the learning algorithms work.

Data storage

All learning begins with data. Humans and computers alike utilize data storage as a foundation for more advanced reasoning. In a human being, this consists of a brain that uses electrochemical signals in a network of biological cells to store and process observations for short- and long-term future recall. Computers have similar capabilities of short- and long-term recall using hard disk drives, flash memory, and random-access memory (RAM) in combination with a central processing unit (CPU).

It may seem obvious, but the ability to store and retrieve data alone is insufficient for learning. Stored data is merely ones and zeros on a disk. It is a collection of memories, meaningless without a broader context. Without a higher level of understanding, knowledge is purely recall, limited to what has been seen before and nothing else.

To better understand the nuances of this idea, it may help to think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (photographic) memory? If so, you may be disappointed to know that perfect recall would unlikely be of much assistance. Even if you could memorize material perfectly, this rote learning would provide no benefit without knowing the exact questions and answers that would appear on the exam. Otherwise, you would need to memorize answers to every question that could conceivably be asked, on a subject in which there is likely to be an infinite number of questions. Obviously, this is an unsustainable strategy.

Instead, a better approach is to spend time selectively, and memorize a relatively small set of representative ideas, while developing an understanding of how the ideas relate and apply to unforeseen circumstances. In this way, important broader patterns are identified, rather than memorizing each and every detail, nuance, and potential application.

Abstraction

This work of assigning a broader meaning to stored data occurs during the abstraction process, in which raw data comes to represent a wider, more abstract concept or idea. This type of connection, say between an object and its representation, is exemplified by the famous René Magritte painting The Treachery of Images:

Figure 1.5: "This is not a pipe." Source: http://collections.lacma.org/node/239578

The painting depicts a tobacco pipe with the caption Ceci n'est pas une pipe ("This is not a pipe"). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. Yet, in spite of the fact that the pipe is not real, anybody viewing the painting easily recognizes it as a pipe. This suggests that observers are able to connect the picture of a pipe to the idea of a pipe, to a memory of a physical pipe that can be held in the hand. Abstracted connections like this are the basis of knowledge representation, the formation of logical structures that assist with turning raw sensory information into meaningful insight.

During a machine's process of knowledge representation, the computer summarizes stored raw data using a model, an explicit description of the patterns within the data. Just like Magritte's pipe, the model representation takes on a life beyond the raw data. It represents an idea greater than the sum of its parts.

There are many different types of models. You may already be familiar with some. Examples include:

Mathematical equations
Relational diagrams, such as trees and graphs
Logical if/else rules
Groupings of data known as clusters

The choice of model is typically not left up to the machine. Instead, the learning task and the type of data on hand inform model selection. Later in this chapter, we will discuss in more detail the methods for choosing the appropriate model type.

The process of fitting a model to a dataset is known as training. When the model has been trained, the data has been transformed into an abstract form that summarizes the original information.

Tip

You might wonder why this step is called "training" rather than "learning." First, note that the process of learning does not end with data abstraction—the learner must still generalize and evaluate its training. Second, the word "training" better connotes the fact that the human teacher trains the machine student to understand the data in a specific way.

It is important to note that a learned model does not itself provide new data, yet it does result in new knowledge. How can this be? The answer is that imposing an assumed structure on the underlying data gives insight into the unseen. It supposes a new concept that describes a manner in which data elements may be related.

Take, for instance, the discovery of gravity. By fitting equations to observational data, Sir Isaac Newton inferred the concept of gravity, but the force we now know as gravity was always present. It simply wasn't recognized until Newton expressed it as an abstract concept that relates some data to other data—specifically, by becoming the g term in a model that explains observations of falling objects.

Figure 1.6: Models are abstractions that explain observed data

Most models will not result in the development of theories that shake up scientific thought for centuries. Still, your abstraction might result in the discovery of important, but previously unseen, patterns and relationships among data. A model trained on genomic data might find several genes that when combined are responsible for the onset of diabetes, banks might discover a seemingly innocuous type of transaction that systematically appears prior to fraudulent activity, or psychologists might identify a combination of personality characteristics indicating a new disorder. These underlying patterns were always present, but by presenting information in a different format, a new idea is conceptualized.

Generalization

The next step in the learning process is to use the abstracted knowledge for future action. However, among the countless underlying patterns that may be identified during the abstraction process and the myriad ways to model those patterns, some patterns will be more useful than others. Unless the production of abstractions is limited to the useful set, the learner will be stuck where it started, with a large pool of information but no actionable insight.

Formally, the term generalization is defined as the process of turning abstracted knowledge into a form that can be utilized for future action, on tasks that are similar, but not identical, to those the learner has seen before. It acts as a search through the entire set of models (that is, theories or inferences) that could be established from the data during training.

If you can imagine a hypothetical set containing every possible way the data might be abstracted, generalization involves the reduction of this set into a smaller and more manageable set of important findings.

In generalization, the learner is tasked with limiting the patterns it discovers to only those that will be most relevant to its future tasks. Normally, it is not feasible to reduce the number of patterns by examining them one-by-one and ranking them by future utility. Instead, machine learning algorithms generally employ shortcuts that reduce the search space more quickly. To this end, the algorithm will employ heuristics, which are educated guesses about where to find the most useful inferences.

Tip

Heuristics utilize approximations and other rules of thumb, which means they are not guaranteed to find the best model of the data. However, without taking these shortcuts, finding useful information in a large dataset would be infeasible.

Heuristics are routinely used by human beings to quickly generalize experience to new scenarios. If you have ever utilized your gut instinct to make a snap decision prior to fully evaluating your circumstances, you were intuitively using mental heuristics.

The incredible human ability to make quick decisions often relies not on computer-like logic, but rather on emotion-guided heuristics. Sometimes, this can result in illogical conclusions. For example, more people express fear of airline travel than automobile travel, despite automobiles being statistically more dangerous. This can be explained by the availability heuristic, which is the tendency for people to estimate the likelihood of an event by how easily examples can be recalled. Accidents involving air travel are highly publicized. Being traumatic events, they are likely to be recalled very easily, whereas car accidents barely warrant a mention in the newspaper.

The folly of misapplied heuristics is not limited to human beings. The heuristics employed by machine learning algorithms also sometimes result in erroneous conclusions. The algorithm is said to have a bias if the conclusions are systematically erroneous, which implies that they are wrong in a consistent or predictable manner.

For example, suppose that a machine learning algorithm learned to identify faces by finding two dark circles representing eyes, positioned above a straight line indicating a mouth. The algorithm might then have trouble with, or be biased against, faces that do not conform to its model. Faces with glasses, turned at an angle, looking sideways, or with certain skin tones might not be detected by the algorithm. Similarly, it could be biased toward faces with other skin tones, face shapes, or characteristics that conform to its understanding of the world.

Figure 1.7: The process of generalizing a learner's experience results in a bias

In modern usage, the word "bias" has come to carry quite negative connotations. Various forms of media frequently claim to be free from bias, and claim to report the facts objectively, untainted by emotion. Still, consider for a moment the possibility that a little bias might be useful. Without a bit of arbitrariness, might it be a little difficult to decide among several competing choices, each with distinct strengths and weaknesses? Indeed, studies in the field of psychology have suggested that individuals born with damage to the portions of the brain responsible for emotion may be ineffectual at decision-making and might spend hours debating simple decisions, such as what color shirt to wear or where to eat lunch. Paradoxically, bias is what blinds us from some information, while also allowing us to utilize other information for action. It is how machine learning algorithms choose among the countless ways to understand a set of data.

Evaluation

Bias is a necessary evil associated with the abstraction and generalization processes inherent in any learning task. In order to drive action in the face of limitless possibility, all learning must have a bias. Consequently, each learning strategy has weaknesses; there is no single learning algorithm to rule them all. Therefore, the final step in the learning process is to evaluate its success, and to measure the learner's performance in spite of its biases. The information gained in the evaluation phase can then be used to inform additional training if needed.

Tip

Once you've had success with one machine learning technique, you might be tempted to apply it to every task. It is important to resist this temptation because no machine learning approach is best for every circumstance. This fact is described by the No Free Lunch theorem, introduced by David Wolpert in 1996. For more information, visit: http://www.no-free-lunch.org.

Generally, evaluation occurs after a model has been trained on an initial training dataset. Then, the model is evaluated on a separate test dataset in order to judge how well its characterization of the training data generalizes to new, unseen cases. It's worth noting that it is exceedingly rare for a model to perfectly generalize to every unforeseen case—mistakes are almost always inevitable.

In part, models fail to generalize perfectly due to the problem of noise, a term that describes unexplained or unexplainable variations in data. Noisy data is caused by seemingly random events, such as:

Measurement error due to imprecise sensors that sometimes add or subtract a small amount from the readings
Issues with human subjects, such as survey respondents reporting random answers to questions in order to finish more quickly
Data quality problems, including missing, null, truncated, incorrectly coded, or corrupted values
Phenomena that are so complex or so little understood that they impact the data in ways that appear to be random

Trying to model noise is the basis of a problem called overfitting; because most noisy data is unexplainable by definition, attempting to explain the noise will result in models that do not generalize well to new cases. Efforts to explain the noise also typically result in more complex models that miss the true pattern the learner is trying to identify.

Figure 1.8: Modeling noise generally results in more complex models and misses underlying patterns

A model that performs relatively well during training but relatively poorly during evaluation is said to be overfitted to the training dataset because it does not generalize well to the test dataset. In practical terms, this means that it has identified a pattern in the data that is not useful for future action; the generalization process has failed. Solutions to the problem of overfitting are specific to particular machine learning approaches. For now, the important point is to be aware of the issue. How well the methods are able to handle noisy data and avoid overfitting is an important point of distinction among them.

Machine learning in practice

So far, we've focused on how machine learning works in theory. To apply the learning process to real-world tasks, we'll use a five-step process. Regardless of the task, any machine learning algorithm can be deployed by following these steps:

Data collection: The data collection step involves gathering the learning material an algorithm will use to generate actionable knowledge. In most cases, the data will need to be combined into a single source, such as a text file, spreadsheet, or database.
Data exploration and preparation: The quality of any machine learning project is based largely on the quality of its input data. Thus, it is important to learn more about the data and its nuances during a practice called data exploration. Additional work is required to prepare the data for the learning process. This involves fixing or cleaning so-called "messy" data, eliminating unnecessary data, and recoding the data to conform to the learner's expected inputs.
Model training: By the time the data has been prepared for analysis, you are likely to have a sense of what you are capable of learning from the data. The specific machine learning task chosen will inform the selection of an appropriate algorithm, and the algorithm will represent the data in the form of a model.
Model evaluation: Each machine learning model results in a biased solution to the learning problem, which means that it is important to evaluate how well the algorithm learned from its experience. Depending on the type of model used, you might be able to evaluate the accuracy of the model using a test dataset, or you may need to develop measures of performance specific to the intended application.
Model improvement: If better performance is needed, it becomes necessary to utilize more advanced strategies to augment the model's performance. Sometimes it may be necessary to switch to a different type of model altogether. You may need to supplement your data with additional data or perform additional preparatory work, as in step two of this process.

After these steps have been completed, if the model appears to be performing well, it can be deployed for its intended task. As the case may be, you might utilize your model to provide score data for predictions (possibly in real time); for projections of financial data; to generate useful insight for marketing or research; or to automate tasks, such as mail delivery or flying aircraft. The successes and failures of the deployed model might even provide additional data to train your next-generation learner.

Types of input data

The practice of machine learning involves matching the characteristics of the input data to the biases of the available learning algorithms. Thus, before applying machine learning to real-world problems, it is important to understand the terminology that distinguishes between input datasets.

The phrase unit of observation is used to describe the smallest entity with measured properties of interest for a study. Commonly, the unit of observation is in the form of persons, objects or things, transactions, time points, geographic regions, or measurements. Sometimes, units of observation are combined to form units, such as person-years, which denote cases where the same person is tracked over multiple years, and each person-year comprises a person's data for one year.

Tip

The unit of observation is related, but not identical, to the unit of analysis, which is the smallest unit from which inference is made. Although it is often the case, the observed and analyzed units are not always the same. For example, data observed from people (the unit of observation) might be used to analyze trends across countries (the unit of analysis).

Datasets that store the units of observation and their properties can be described as collections of:

Examples: Instances of the unit of observation for which properties have been recorded
Features: Recorded properties or attributes of examples that may be useful for learning

It is easiest to understand features and examples through real-world scenarios. For instance, to build a learning algorithm to identify spam emails, the unit of observation could be email messages, examples would be specific individual messages, and the features might consist of the words used in the messages.

For a cancer detection algorithm, the unit of observation could be patients, the examples might include a random sample of cancer patients, and the features may be genomic markers from biopsied cells in addition to patient characteristics, such as weight, height, or blood pressure.

People and machines differ in the types of complexity they are suited to handle in the input data. Humans are comfortable consuming unstructured data, such as free-form text, pictures, or sound. They are also flexible handling cases in which some observations have a wealth of features, while others have very little.

On the other hand, computers generally require data to be structured, which means that each example of the phenomenon has the same features, and these features are organized in a form that a computer may understand. To use the brute force of the machine on large, unstructured datasets usually requires a transformation of the input data to a structured form.

The following spreadsheet shows data that has been gathered in matrix format. In matrix data, each row in the spreadsheet is an example and each column is a feature. Here, the rows indicate examples of automobiles for sale, while the columns record each automobile's features, such as the price, mileage, color, and transmission type. Matrix format data is by far the most common form used in machine learning. As you will see in later chapters, when forms of data are encountered in specialized applications, they are ultimately transformed into a matrix prior to machine learning.

Figure 1.9: A simple dataset in matrix format describing automobiles for sale

A dataset's features may come in various forms. If a feature represents a characteristic measured in numbers, it is unsurprisingly called numeric. Alternatively, if a feature comprises a set of categories, the feature is called categorical or nominal. A special type of categorical variable is called ordinal, which designates a nominal variable with categories falling in an ordered list. One example of an ordinal variable is clothing sizes, such as small, medium, and large; another is a measurement of customer satisfaction on a scale from "not at all happy" to "somewhat happy" to "very happy." For any given dataset, thinking about what the features represent, their types, and their units, will assist with determining an appropriate machine learning algorithm for the learning task.

Types of machine learning algorithms

Machine learning algorithms are divided into categories according to their purpose. Understanding the categories of learning algorithms is an essential first step toward using data to drive the desired action.

A predictive model is used for tasks that involve, as the name implies, the prediction of one value using other values in the dataset. The learning algorithm attempts to discover and model the relationship between the target feature (the feature being predicted) and the other features.

Despite the common use of the word "prediction" to imply forecasting, predictive models need not necessarily foresee events in the future. For instance, a predictive model could be used to predict past events, such as the date of a baby's conception using the mother's present-day hormone levels. Predictive models can also be used in real time to control traffic lights during rush hour.

Now, because predictive models are given clear instructions on what they need to learn and how they are intended to learn it, the process of training a predictive model is known as supervised learning. The supervision does not refer to human involvement, but rather to the fact that the target values provide a way for the learner to know how well it has learned the desired task. Stated more formally, given a set of data, a supervised learning algorithm attempts to optimize a function (the model) to find the combination of feature values that result in the target output.

The often-used supervised machine learning task of predicting which category an example belongs to is known as classification. It is easy to think of potential uses for a classifier. For instance, you could predict whether:

An email message is spam
A person has cancer
A football team will win or lose
An applicant will default on a loan

In classification, the target feature to be predicted is a categorical feature known as the class, which is divided into categories called levels. A class can have two or more levels, and the levels may or may not be ordinal. Classification is so widely used in machine learning that there are many types of classification algorithms, with strengths and weaknesses suited for different types of input data. We will see examples of these later in this chapter and throughout this book.

Supervised learners can also be used to predict numeric data, such as income, laboratory values, test scores, or counts of items. To predict such numeric values, a common form of numeric prediction fits linear regression models to the input data. Although regression is not the only method for numeric prediction, it is by far the most widely used. Regression methods are widely used for forecasting, as they quantify in exact terms the association between the inputs and the target, including both the magnitude and uncertainty of the relationship.

Tip

Since it is easy to convert numbers to categories (for example, ages 13 to 19 are teenagers) and categories to numbers (for example, assign 1 to all males and 0 to all females), the boundary between classification models and numeric prediction models is not necessarily firm.

A descriptive model is used for tasks that would benefit from the insight gained from summarizing data in new and interesting ways. As opposed to predictive models that predict a target of interest, in a descriptive model, no single feature is more important than any other. In fact, because there is no target to learn, the process of training a descriptive model is called unsupervised learning. Although it can be more difficult to think of applications for descriptive models—after all, what good is a learner that isn't learning anything in particular—they are used quite regularly for data mining.

For example, the descriptive modeling task called pattern discovery is used to identify useful associations within data. Pattern discovery is the goal of market basket analysis, which is applied to retailers' transactional purchase data. Here, retailers hope to identify items that are frequently purchased together, such that the learned information can be used to refine marketing tactics. For instance, if a retailer learns that swimming trunks are commonly purchased at the same time as sunscreen, the retailer might reposition the items more closely in the store or run a promotion to "up-sell" customers on associated items.

Tip

Originally used only in retail contexts, pattern discovery is now starting to be used in quite innovative ways. For instance, it can be used to detect patterns of fraudulent behavior, screen for genetic defects, or identify hotspots for criminal activity.

The descriptive modeling task of dividing a dataset into homogeneous groups is called clustering. This is sometimes used for segmentation analysis, which identifies groups of individuals with similar behavior or demographic information in order to target them with advertising campaigns based on their shared characteristics. With this approach, the machine identifies the clusters, but human intervention is required to interpret them. For example, given a grocery store's five customer clusters, the marketing team will need to understand the differences among the groups in order to create a promotion that best suits each group. Despite this human effort, this is still less work than creating a unique appeal for each customer.

Lastly, a class of machine learning algorithms known as meta-learners is not tied to a specific learning task, but rather is focused on learning how to learn more effectively. A meta-learning algorithm uses the result of past learning to inform additional learning.

This encompasses learning algorithms that learn to work together in teams called ensembles, as well as algorithms that seem to evolve over time in a process called reinforcement learning. Meta-learning can be beneficial for very challenging problems or when a predictive algorithm's performance needs to be as accurate as possible.

Some of the most exciting work being done in the field of machine learning today is in the domain of meta-learning. For instance, adversarial learning involves learning about a model's weaknesses in order to strengthen its future performance or harden it against malicious attack. There is also heavy investment in research and development efforts to make bigger and faster ensembles, which can model massive datasets using high-performance computers or cloud-computing environments.

Matching input data to algorithms

The following table lists the general types of machine learning algorithms covered in this book. Although this covers only a fraction of the entire set of machine learning algorithms, learning these methods will provide a sufficient foundation for making sense of any other methods you may encounter in the future.

Model	Learning task	Chapter
Supervised learning algorithms
k-nearest neighbors	Classification	Chapter 3
Naive Bayes	Classification	Chapter 4
Decision trees	Classification	Chapter 5
Classification rule learners	Classification	Chapter 5
Linear regression	Numeric prediction	Chapter 6
Regression trees	Numeric prediction	Chapter 6
Model trees	Numeric prediction	Chapter 6
Neural networks	Dual use	Chapter 7
Support vector machines	Dual use	Chapter 7
Unsupervised learning algorithms
Association rules	Pattern detection	Chapter 8
k-means clustering	Clustering	Chapter 9
Meta-learning algorithms
Bagging	Dual use	Chapter 11
Boosting	Dual use	Chapter 11
Random forests	Dual use	Chapter 11

To begin applying machine learning to a real-world project, you will need to determine which of the four learning tasks your project represents: classification, numeric prediction, pattern detection, or clustering. The task will drive the choice of algorithm. For instance, if you are undertaking pattern detection, you are likely to employ association rules. Similarly, a clustering problem will likely utilize the k-means algorithm, and numeric prediction will utilize regression analysis or regression trees.

For classification, more thought is needed to match a learning problem to an appropriate classifier. In these cases, it is helpful to consider the various distinctions among the algorithms—distinctions that will only be apparent by studying each of the classifiers in depth. For instance, within classification problems, decision trees result in models that are readily understood, while the models of neural networks are notoriously difficult to interpret. If you were designing a credit scoring model, this could be an important distinction because the law often requires that the applicant must be notified about the reasons he or she was rejected for the loan. Even if the neural network is better at predicting loan defaults, if its predictions cannot be explained, then it is useless for this application.

To assist with algorithm selection, in every chapter the key strengths and weaknesses of each learning algorithm are listed. Although you will sometimes find that these characteristics exclude certain models from consideration, in many cases the choice of algorithm is arbitrary. When this is true, feel free to use whichever algorithm you are most comfortable with. Other times, when predictive accuracy is the primary goal, you may need to test several models and choose the one that fits best, or use a meta-learning algorithm that combines several different learners to utilize the strengths of each.

Machine learning with R

Many of the algorithms needed for machine learning are not included as part of the base R installation. Instead, the algorithms are available via a large community of experts who have shared their work freely. These must be installed on top of base R manually. Thanks to R's status as free open-source software, there is no additional charge for this functionality.

A collection of R functions that can be shared among users is called a package. Free packages exist for each of the machine learning algorithms covered in this book. In fact, this book only covers a small portion of all of R's machine learning packages.

If you are interested in the breadth of R packages, you can view a list at Comprehensive R Archive Network (CRAN), a collection of web and FTP sites located around the world to provide the most up-to-date versions of R software and packages. If you obtained the R software via download, it was most likely from CRAN. The CRAN website is available at http://cran.r-project.org/index.html.

Tip

If you do not already have R, the CRAN website also provides installation instructions and information on where to find help if you have trouble.

The Packages link on the left side of the CRAN page will take you to a page where you can browse the packages in alphabetical order or sorted by publication date. At the time of this writing, a total of 13,904 packages were available—over two times the number since the second edition of this book was written, and over three times since the first edition! Clearly, the R community is thriving, and this trend shows no sign of slowing!

The Task Views link on the left side of the CRAN page provides a curated list of packages by subject area. The task view for machine learning, which lists the packages covered in this book (and many more), is available at https://CRAN.R-project.org/view=MachineLearning.

Installing R packages

Despite the vast set of available R add-ons, the package format makes installation and use a virtually effortless process. To demonstrate the use of packages, we will install and load the RWeka package developed by Kurt Hornik, Christian Buchta, and Achim Zeileis (see Open-Source Machine Learning: R Meets Weka, Computational Statistics Vol. 24, pp 225-232 for more information). The RWeka package provides a collection of functions that give R access to the machine learning algorithms in the Java-based Weka software package by Ian H. Witten and Eibe Frank. For more information on Weka, see http://www.cs.waikato.ac.nz/~ml/weka/.

Tip

To use the RWeka package, you will need to have Java installed, if it isn't already (many computers come with Java preinstalled). Java is a set of programming tools, available for free, that allow for the use of cross-platform applications such as Weka. For more information and to download Java for your system, visit http://www.java.com.

The most direct way to install a package is via the install.packages() function. To install the RWeka package, at the R command prompt simply type:

> install.packages("RWeka")

R will then connect to CRAN and download the package in the correct format for your operating system. Some packages, such as RWeka, require additional packages to be installed before they can be used. These are called dependencies. By default, the installer will automatically download and install any dependencies.

Tip

The first time you install a package, R may ask you to choose a CRAN mirror. If this happens, choose the mirror residing at a location close to you. This will generally provide the fastest download speed.

The default installation options are appropriate for most systems. However, in some cases, you may want to install a package to another location. For example, if you do not have root or administrator privileges on your system, you may need to specify an alternative installation path. This can be accomplished using the lib option as follows:

> install.packages("RWeka", lib = "/path/to/library")

The installation function also provides additional options for installing from a local file, installing from source, or using experimental versions. You can read about these options in the help file by using the following command:

> ?install.packages

More generally, the question mark operator can be used to obtain help on any R function. Simply type ? before the name of the function.

Loading and unloading R packages

In order to conserve memory, R does not load every installed package by default. Instead, packages are loaded by users with the library() function as they are needed.

Tip

The name of this function leads some people to incorrectly use the terms "library" and "package" interchangeably. However, to be precise, a library refers to the location where packages are installed and never to a package itself.

To load the RWeka package installed previously, you can type the following:

> library(RWeka)

Aside from RWeka, there are several other R packages that will be used in later chapters. Installation instructions will be provided as these additional packages are needed.

To unload an R package, use the detach() function. For example, to unload the RWeka package shown previously, use the following command:

> detach("package:RWeka", unload = TRUE)

This will free up any resources used by the package.

Installing RStudio

Before you begin working with R, it is highly recommended to also install the open-source RStudio desktop application. RStudio is an additional interface to R that includes functionalities that make it far easier, more convenient, and more interactive to work with R code. It is available free of charge at https://www.rstudio.com/.

Figure 1.10: The RStudio desktop environment makes R easier and more convenient to use

The RStudio interface includes an integrated code editor, an R command-line console, a file browser, and an R object browser. R code syntax is automatically colorized, and the code's output, plots, and graphics are displayed directly within the environment, which makes it much easier to follow long or complex statements and programs. More advanced features allow R project and package management; integration with source control or version control tools, such as Git and Subversion; database connection management; and the compilation of R output to HTML, PDF, or Microsoft Word formats.

RStudio is a key reason why R is a top choice for data scientists today. It wraps the power of R programming and its tremendous library of machine learning and statistical packages in an easy-to-use and easy-to-install development interface. It is not only ideal for learning R, but can also grow with you as you learn R's more advanced functionality.

Filter reviews by

All

Amazon verified reviews

Mauricio Arriaga Aug 22, 2023

Cuenta con ejemplos suficientes para entender; muy bien explicado... adicionalmente de comparar el libro físico, dieron de regalo una versión nueva electrónica

Amazon Verified review

mnomaha Apr 23, 2023

I honestly hate this book, but I'm sure there are positive points to it somewhere. My biggest problem right now is the index. The author was apparently too lazy to put functions in the index or create an appendix with all the functions covered. I had to make my own functional dictionary for all the functions.I think I spent more time researching functions on the internet than actually reading the book.

Kindle User Apr 17, 2023

Book is really hard and difficult to follow. Examples are unrealistically simplified and misleading. Code doesn’t work most of the time. Carelessly written since much of the code doesn't work much of the time.

ferlux47 Apr 06, 2023

Creo que es un buen libro con un resumen de técnicas analíticas para clasificación y simulación, sin embargo el capítulo 2 es una Intro a datos en R, lo cual no me sirve de mucho y por otro lado los algoritmos que se presentan solo es una definición breve con un caso de estudio como ejemplo

Daniel B.G. Nov 27, 2022