What this book covers
Chapter 1, Expanding Your Data Mining Toolbox, gives an introduction to the field of data mining. In this chapter we pay special attention to how data mining relates to similar topics, such as machine learning and data science. We also review many different data mining methodologies, and talk about their various strengths and weaknesses. This foundational knowledge is important as we transition into the remaining chapters of the book, which are much more technique-oriented and focus on the application of specific data mining tools.
Chapter 2, Association Rule Mining, introduces our first data mining tool: mining for co-occurring sets of items, sometimes called frequent itemsets. We extend our understanding of frequent itemset mining to include mining for association rules, and we learn how to evaluate whether the rules we have found are helpful or not. To put our knowledge into practice, at the end of the chapter we implement a small project wherein we find association rules in the keywords chosen to describe a large set of software projects.
Chapter 3, Entity Matching, focuses on finding matching pairs of data elements that may look slightly different but are actually the same. We learn how to determine whether two items are actually the same thing by using the attributes of the data. At the end of the chapter, we implement an entity matching project where we learn to find the software projects that have moved from one hosting service to another, even after changing their names and other important attributes.
Chapter 4, Network Analysis, is a tour through the basics of network or graph analysis, as used to describe the relationships between various interconnected groups of entities. We investigate the various types of network and learn how to describe and measure them. Then we put our learning into practice to describe how a network of software developers has changed over time.
Chapter 5, Sentiment Analysis in Text, is the first of four text mining chapters in this book. This chapter serves as an introduction to the growing field of sentiment, or mood, analysis in text. After comparing various approaches to sentiment mining and learning how to evaluate the results, we practice using a machine learning classifier to determine the sentiment of a set of software developer chat logs and e-mail logs.
Chapter 6, Named Entity Recognition in Text, is about finding proper nouns and proper names in text. We spend some time learning why this task is useful, and why finding named entities can sometimes be more difficult than it sounds. At the end of the chapter we implement a named entity recognition system on several different types of real-world text data including e-mail, chat logs, and board meeting minutes. Along the way we apply different techniques for quantifying the success or failure of our results.
Chapter 7, Automatic Text Summarization, presents several strategies for automatically create condensed summaries of text. This chapter emphasizes extractive summarization tools, which are designed to find the most important sentences in a text sample. To this end, we experiment with three different tools for accomplishing this goal, testing the summarization methods, and learning how they differ. Following the introduction of each tool, we attempt to summarize a common set of text documents and compare the results.
Chapter 8, Topic Modeling in Text, shows how to use software tools to reveal what topics or concepts are present in a given text. Can we train a computer program to infer the themes that are present in large amounts of text? In a series of experiments, we learn how to use common topic modeling libraries to reveal the topics present in software developer e-mails, and how those topics change over time.
Chapter 9, Mining for Data Anomalies, is where we learn how to use data mining and statistical techniques to improve our own data mining process. While all of the other chapters in this book deal with finding different types of patterns in data, here we focus on finding data that is anomalous or that does not match a particular pattern. Whether it is because the data is empty, missing, or just plain weird, this chapter presents strategies for finding or fixing this type of data so that the rest of your data can be mined more effectively.