What are the techniques used in data mining?
Now that we have a sense of where data mining fits in our overall KDD or data science process, we can start to discuss the details of how to get it done.
Since the early days of attempting to define data mining, several broad classes of relevant problems consistently show up again and again. Fayyad et al. name six classes of problems in another important 1996 paper (From Data Mining to Knowledge Discovery in Databases), which we can summarize as follows:
- Classification problems: Here, we have data that needs to be divided into predefined classes, based on some features of the data. We need an algorithm that can use previously classified data to learn how to put unknown data into the correct class.
- Clustering problems: With these problems, we have data that needs to be divided into classes based on its features, but we do not know what the classes are in advance. We need an algorithm that can measure the similarity between data points and automatically divide the data up based on these similarities.
- Regression problems: We have data that needs to be mapped onto a predictor variable, so we need to learn a function that can do this mapping.
- Summarization problems: Suppose we have data that needs to be shortened or summarized in some way. This could be as simple as calculating basic statistics from data, or as complex as learning how to summarize text or finding a topic model for text.
- Dependency modeling problems: For these problems, we have data that might be connected in some way, and we need to develop an algorithm that can calculate the probability of connection or describe the structure of connected data.
- Change and deviation detection problems: In another case, we have data that has changed significantly or where some subset of the data deviates from normative values. To solve these problems, we need an algorithm that can detect these issues automatically.
In a different paper written that same year, those same authors also included a few additional categories:
- Link analysis problems: Here we have data points with relationships between them, and we need to discover and describe these relationships in terms of how much support they have in the data set and how confident we are in the relationship.
- Sequence analysis problems: Imagine that we have data points that follow a sequence, such as a time series or a genome, and we must discover trends or deviations in the sequence, or discover what is causing the sequence or how it will evolve.
Han, Kamber, and Pei, in the textbook we discussed earlier, describe four classes of problems that data mining can help solve, and further, they divide them into descriptive and predictive categories. Descriptive data mining means we are finding patterns that help us understand the data we have. Predictive data mining means we are finding patterns that can help us make predictions about data we do not yet have.
In the descriptive category, they list the following data mining problems:
- Data characterization and data discrimination problems, including data summarization or concept characterization or description.
- Frequency mining, including finding frequent patterns, association rules, and correlations in data.
In the predictive category, they list the following:
- Classification, regression
- Clustering
- Outlier detection and anomaly detection
It is easy to see that there are many similarities between the Fayyad et al. list and the Han et al. list, but that they have just grouped the items differently. Indeed, the items that show up on both lists are exactly the types of data mining problems you are probably already familiar with by now if you have completed earlier data mining projects. Classification, regression, and clustering are very popular, foundational data mining techniques, so they are covered in nearly every data mining book designed for practitioners.
What techniques are we going to use in this book?
Since this book is about mastering data mining, we are going to tackle a few of the techniques that are not covered quite as often in the standard books. Specifically, we will address link analysis via association rules in Chapter 2, Association Rule Mining, and anomaly detection in Chapter 9, Mining for Data Anomalies. We are also going to apply a few data mining techniques to actually assist in data cleaning and pre-processing efforts, namely, in taking care of missing values in Chapter 9, Mining for Data Anomalies, and some data integration via entity matching in Chapter 3, Entity Matching.
In addition to defining data mining in terms of the techniques, sometimes people divide up the various data mining problems based on what type of data they are mining. For example, you may hear people refer to text mining or social network analysis. These refer to the type of data being mined rather than the specific technique being used to mine it. For example, text mining refers to any kind of data mining technique as applied to text documents, and network mining refers to looking for patterns in network graph data. In this book, we will be doing some network mining in Chapter 4, Network Analysis, different types of text document summarization in Chapter 6, Named Entity Recognition in Text, Chapter 7, Automatic Text Summarization, and Chapter 8, Topic Modeling in Text, and some classification of text by its sentiment (the emotion in the text) in Chapter 5, Sentiment Analysis in Text.
If you are anything like me, right about now you might be thinking enough of this background stuff, I want to write some code. I am glad you are getting excited to work on some actual projects. We are almost ready to start coding, but first we need to get a good working environment set up.