Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.
Most data mining applications work with the same high-level view, where a model learns from some data and is applied to other data, although the details often change quite considerably.
Data mining applications involve creating data sets and tuning the algorithm as explained in the following steps
- We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of the following two aspects:
- Samples: These are objects in the real world, such as a book, photograph, animal, person, or any other object. Samples are also referred to as observations, records or rows, among other naming conventions.
- Features: These are descriptions or measurements of the samples in our dataset. Features could be the length, frequency of a specific word, the number of legs on an animal, date it was created, and so on. Features are also referred to as variables, columns, attributes or covariant, again among other naming conventions.
- The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.
As a simple example, we may wish the computer to be able to categorize people as short or tall. We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:
Person | Height | Short or tall? |
1 | 155cm | Short |
2 | 165cm | Short |
3 | 175cm | Tall |
4 | 185cm | Tall |
As explained above, the next step involves tuning the parameters of our algorithm. As a simple algorithm; if the height is more than x, the person is tall. Otherwise, they are short. Our training algorithms will then look at the data and decide on a good value for x. For the preceding data, a reasonable value for this threshold would be 170 cm. A person taller than 170 cm is considered tall by the algorithm. Anyone else is considered short by this measure. This then lets our algorithm classify new data, such as a person with height 167 cm, even though we may have never seen a person with those measurements before.
In the preceding data, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This feature engineering is a critical problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.
In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to perform every task. This clarity sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.