A commonly cited formal definition of machine learning, proposed by computer scientist Tom M. Mitchell, says that a machine is said to learn if it is able to take experience and utilize it such that its performance improves up on similar experiences in the future. This definition is fairly exact, yet says little about how machine learning techniques actually learn to transform data into actionable knowledge.
Tip
Although it is not strictly necessary to understand the theoretical basis of machine learning prior to using it, this foundation provides an insight into the distinctions among machine learning algorithms. Because machine learning algorithms are modeled in many ways on human minds, you may even discover yourself examining your own mind in a different light.
Regardless of whether the learner is a human or a machine, the basic learning process is similar. It can be divided into three components as follows:
Data input: It utilizes observation, memory storage, and recall to provide a factual basis for further reasoning.
Abstraction: It involves the translation of data into broader representations.
Generalization: It uses abstracted data to form a basis for action.
To better understand the learning process, think about the last time you studied for a difficult test, perhaps for a university final exam or a career certification. Did you wish for an eidetic (that is, photographic) memory? If so, you may be disappointed to learn that perfect recall is unlikely to save you much effort. Without a higher understanding, your knowledge is limited exactly to the data input, meaning only what you had seen before and nothing more. Therefore, without knowledge of all the questions that could appear on the exam, you would be stuck attempting to memorize answers to every question that could conceivably be asked. Obviously, this is an unsustainable strategy.
Instead, a better strategy is to spend time selectively managing only a smaller set of key ideas. The commonly used learning strategies of creating an outline or a concept map are similar to how a machine performs knowledge abstraction. The tools define relationships among information and in doing so, depict difficult ideas without needing to memorize them word-for-word. It is a more advanced form of learning because it requires that the learner puts the topic into his or her own words.
It is always a tense moment when the exam is graded and the learning strategies are either vindicated or implicated with a high or low mark. Here, one discovers whether the learning strategies generalized to the questions that the teacher or professor had selected. Generalization requires a breadth of abstracted data, as well as a higher-level understanding of how to apply such knowledge to unforeseen topics. A good teacher can be quite helpful in this regard.
Keep in mind that although we have illustrated the learning process as three distinct steps, they are merely organized this way for illustrative purposes. In reality, the three components of learning are inextricably linked. In particular, the stages of abstraction and generalization are so closely related that it would be impossible to perform one without the other. In human beings, the entire process happens subconsciously. We recollect, deduce, induct, and intuit. Yet for a computer, these processes must be made explicit. On the other hand, this is a benefit of machine learning. Because the process is transparent, the learned knowledge can be examined and utilized for future action.
Abstraction and knowledge representation
Representing raw input data in a structured format is the quintessential task for a learning algorithm. Prior to this point, the data is merely ones and zeros on a disk or in memory; they have no meaning. The work of assigning a meaning to data occurs during the
abstraction process.
The connection between ideas and reality is exemplified by the famous René Magritte painting The Treachery of Images shown as follows:
The painting depicts a tobacco pipe with the caption Ceci n'est pas une pipe ("this is not a pipe"). The point Magritte was illustrating is that a representation of a pipe is not truly a pipe. In spite of the fact that the pipe is not real, anybody viewing the painting easily recognizes that the picture is a pipe, suggesting that observers' minds are able to connect the picture of a pipe to the idea of a pipe, which can then be connected to an actual pipe that could be held in the hand. Abstracted connections like this are the basis of
knowledge representation, the formation of logical structures that assist with turning raw sensory information into a meaningful insight.
During the process of knowledge representation, the computer summarizes raw inputs in a model, an explicit description of the structured patterns among data. There are many different types of models. You may already be familiar with some. Examples include:
The choice of model is typically not left up to the machine. Instead, the model is dictated by the learning task and the type of data being analyzed. Later in this chapter, we will discuss methods for choosing the type of model in more detail.
The process of fitting a particular model to a dataset is known as
training. Why is this not called learning? First, note that the learning process does not end with the step of data abstraction. Learning requires an additional step to generalize the knowledge to future data. Second, the term training more accurately describes the actual process undertaken when the model is fitted to the data. Learning implies a sort of inductive, bottom-up reasoning. Training better connotes the fact that the machine learning model is imposed by the human teacher onto the machine student, providing the computer with a structure it attempts to model after.
When the model has been trained, the data has been transformed into an abstract form that summarizes the original information. It is important to note that the model does not itself provide additional data, yet it is sometimes interesting on its own. How can this be? The reason is that by imposing an assumed structure on the underlying data, it gives insight into the unseen and provides a theory about how the data is related. Take for instance the discovery of gravity. By fitting equations to observational data, Sir Isaac Newton deduced the concept of gravity. But gravity was always present. It simply wasn't recognized as a concept until the model noted it in abstract terms—specifically, by becoming the g term in a model that explains observations of falling objects.
Most models will not result in the development of theories that shake up scientific thought for centuries. Still, your model might result in the discovery of previously unseen relationships among data. A model trained on genomic data might find several genes that when combined are responsible for the onset of diabetes; banks might discover a seemingly innocuous type of transaction that systematically appears prior to fraudulent activity; psychologists might identify a combination of characteristics indicating a new disorder. The underlying relationships were always present; but in conceptualizing the information in a different format, a model presents the connections in a new light.
Recall that the learning process is not complete until the learner is able to use its abstracted knowledge for future action. Yet an issue remains before the learner can proceed—there are countless underlying relationships that might be identified during the abstraction process and myriad ways to model these relationships. Unless the number of potential theories is limited, the learner will be unable to utilize the information. It would be stuck where it started, with a large pool of information but no actionable insight.
The term generalization describes the process of turning abstracted knowledge into a form that can be utilized for action. Generalization is a somewhat vague process that is a bit difficult to describe. Traditionally, it has been imagined as a search through the entire set of models (that is, theories) that could have been abstracted during training. Specifically, if you imagine a hypothetical set containing every possible theory that could be established from the data, generalization involves the reduction of this set into a manageable number of important findings.
Generally, it is not feasible to reduce the number of potential concepts by examining them one-by-one and determining which are the most useful. Instead, machine learning algorithms generally employ shortcuts that more quickly divide the set of concepts. Toward this end, the algorithm will employ heuristics, or educated guesses about the where to find the most important concepts.
Tip
Because the heuristics utilize approximations and other rules of thumb, they are not guaranteed to find the optimal set of concepts that model the data. However, without utilizing these shortcuts, finding useful information in a large dataset would be infeasible.
Heuristics are routinely used by human beings to quickly generalize experience to new scenarios. If you have ever utilized gut instinct to make a snap decision prior to fully evaluating your circumstances, you were intuitively using mental heuristics.
For example, the availability heuristic is the tendency for people to estimate the likelihood of an event by how easily examples can be recalled. The availability heuristic might help explain the prevalence of the fear of airline travel relative to automobile travel, despite automobiles being statistically more dangerous. Accidents involving air travel are highly publicized and traumatic events, and are likely to be very easily recalled, whereas car accidents barely warrant a mention in the newspaper.
The preceding example illustrates the potential for heuristics to result in illogical conclusions. Browsing a list of common logical fallacies, one is likely to note many that seem rooted in heuristic-based thinking. For instance, the gambler's fallacy, or the belief that a run of bad luck implies that a stretch of better luck is due, may be resultant from the application of the representativeness heuristic, which erroneously led the gambler to believe that all random sequences are balanced since most random sequences are balanced.
The folly of misapplied heuristics is not limited to human beings. The heuristics employed by machine learning algorithms also sometimes result in erroneous conclusions. If the conclusions are systematically imprecise, the algorithm is said to have a
bias. For example, suppose that a machine learning algorithm learned to identify faces by finding two circles, or eyes, positioned side-by-side above a line for a mouth. The algorithm might then have trouble with, or be biased against faces that do not conform to its model. This may include faces with glasses, turned at an angle, looking sideways, or with darker skin tones. Similarly, it could be biased toward faces with lighter eye colors or other characteristics that do not conform to its understanding of the world.
In modern usage, the word bias has come to carry quite negative connotations. Various forms of media frequently claim to be free from bias, and claim to report the facts objectively, untainted by emotion. Still, consider for a moment the possibility that a little bias might be useful. Without a bit of arbitrariness, might it be a bit difficult to decide among several competing choices, each with distinct strengths and weaknesses? Indeed, some recent studies in the field of psychology have suggested that individuals born with damage to portions of the brain responsible for emotion are ineffectual at decision making, and might spend hours debating simple decisions such as what color shirt to wear or where to eat lunch. Paradoxically, bias is what blinds us from some information while also allowing us to utilize other information for action.
Assessing the success of learning
Bias is a necessary evil associated with the abstraction and generalization process inherent in any machine learning task. Every learner has its weaknesses and is biased in a particular way; there is no single model to rule them all. Therefore, the final step in the generalization process is to determine the model's success in spite of its biases.
After a model has been trained on an initial dataset, the model is tested on a new dataset, and judged on how well its characterization of the training data generalizes to the new data. It's worth noting that it is exceedingly rare for a model to perfectly generalize to every unforeseen case.
In part, the failure for models to perfectly generalize is due to the problem of noise, or unexplained variations in data. Noisy data is caused by seemingly random events, such as:
Measurement error due to imprecise sensors that sometimes add or subtract a bit from the reading
Issues with reporting data, such as respondents reporting random answers to survey questions in order to finish more quickly
Errors caused when data is recorded incorrectly, including missing, null, truncated, incorrectly coded, or corrupted values
Trying to model the noise in data is the basis of a problem called overfitting. Because noise is unexplainable by definition, attempting to explain the noise will result in erroneous conclusions that do not generalize well to new cases. Attempting to generate theories to explain the noise also results in more complex models that are more likely to ignore the true pattern the learner is trying to identify. A model that seems to perform well during training but does poorly during testing is said to be overfitted to the training dataset as it does not generalize well.
Solutions to the problem of overfitting are specific to particular machine learning approaches. For now, the important point is to be aware of the issue. How well models are able to handle noisy data is an important source of distinction among them.