Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Mastering Data Mining with Python ??? Find patterns hidden in your data
Mastering Data Mining with Python ??? Find patterns hidden in your data

Mastering Data Mining with Python ??? Find patterns hidden in your data: Find patterns hidden in your data

Arrow left icon
Profile Icon Megan Squire
Arrow right icon
$19.99 per month
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.7 (3 Ratings)
Paperback Aug 2016 268 pages 1st Edition
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Megan Squire
Arrow right icon
$19.99 per month
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.7 (3 Ratings)
Paperback Aug 2016 268 pages 1st Edition
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Mastering Data Mining with Python ??? Find patterns hidden in your data

Chapter 1. Expanding Your Data Mining Toolbox

When faced with sensory information, human beings naturally want to find patterns to explain, differentiate, categorize, and predict. This process of looking for patterns all around us is a fundamental human activity, and the human brain is quite good at it. With this skill, our ancient ancestors became better at hunting, gathering, cooking, and organizing. It is no wonder that pattern recognition and pattern prediction were some of the first tasks humans set out to computerize, and this desire continues in earnest today. Depending on the goals of a given project, finding patterns in data using computers nowadays involves database systems, artificial intelligence, statistics, information retrieval, computer vision, and any number of other various subfields of computer science, information systems, mathematics, or business, just to name a few. No matter what we call this activity – knowledge discovery in databases, data mining, data science – its primary mission is always to find interesting patterns.

Despite this humble-sounding mission, data mining has existed for long enough and has built up enough variation in how it is implemented that it has now become a large and complicated field to master. We can think of a cooking school, where every beginner chef is first taught how to boil water and how to use a knife before moving to more advanced skills, such as making puff pastry or deboning a raw chicken. In data mining, we also have common techniques that even the newest data miners will learn: How to build a classifier and how to find clusters in data. The title of this book, however, is Mastering Data Mining with Python, and so, as a mastering-level book, the aim is to teach you some of the techniques you may not have seen in earlier data mining projects.

In this first chapter, we will cover the following topics:

  • What is data mining? We will situate data mining in the growing field of other similar concepts, and we will learn a bit about the history of how this discipline has grown and changed.
  • How do we do data mining? Here, we compare several processes or methodologies commonly used in data mining projects.
  • What are the techniques used in data mining? In this section, we will summarize each of the data analysis techniques that are typically included in a definition of data mining, and we will highlight the more exotic or underappreciated techniques that we will be covering in this mastering-level book.
  • How do we set up a data mining work environment? Finally, we will walk through setting up a Python-based development environment that we will use to complete the projects in the rest of this book.

What is data mining?

We explained earlier that the goal of data mining is to find patterns in data, but this oversimplification falls apart quickly under scrutiny. After all, could we not also say that finding patterns is the goal of classical statistics, or business analytics, or machine learning, or even the newer practices of data science or big data? What is the difference between data mining and all of these other fields, anyway? And while we are at it, why is it called data mining if what we are really doing is mining for patterns? Don't we already have the data?

It was apparent from the beginning that the term data mining is indeed fraught with many problems. The term was originally used as something of a pejorative by statisticians who cautioned against going on fishing expeditions, where a data analyst is casting about for patterns in data without forming proper hypotheses first. Nonetheless, the term rose to prominence in the 1990s, as the popular press caught wind of exciting research that was marrying the mature field of database management systems with the best algorithms from machine learning and artificial intelligence. The inclusion of the word mining inspires visions of a modern-day Gold Rush, in which the persistent and intrepid miner will discover (and perhaps profit from) previously hidden gems. The idea that data itself could be a rare and precious commodity was immediately appealing to the business and technology press, despite efforts by early pioneers to promote the holistic term knowledge discovery in databases (KDD).

The term data mining persisted, however, and ultimately some definitions of the field attempted to re-imagine the term data mining to refer to just one of the steps in a longer, more comprehensive knowledge discovery process. Today, data mining and KDD are considered very similar, closely related terms.

What about other related terms, such as machine learning, predictive analytics, big data, and data science? Are these the same as data mining or KDD? Let's draw some comparisons between each of these terms:

  • Machine learning is a very specific subfield of computer science that focuses on developing algorithms that can learn from data in order to make predictions. Many data mining solutions will use techniques from machine learning, but not all data mining is trying to make predictions or learn from data. Sometimes we just want to find a pattern in the data. In fact, in this book we will be exploring a few data mining solutions that do use machine learning techniques, and many more that do not.
  • Predictive analytics, sometimes just called analytics, is a general term for computational solutions that attempt to make predictions from data in a variety of domains. We can think of the terms business analytics, media analytics, and so on. Some, but not all, predictive analytics solutions will use machine learning techniques to perform their predictions. But again, in data mining, we are not always interested in prediction.
  • Big data is a term that refers to the problems and solutions of dealing with very large sets of data, irrespective of whether we are searching for patterns in that data, or simply storing it. In terms of comparing big data to data mining, many data mining problems are made more interesting when the data sets are large, so solutions discovered for dealing with big data might come in handy to solve a data mining problem. Nonetheless, these two terms are merely complementary, not interchangeable.
  • Data science is the closest of these terms to being interchangeable with the KDD process, of which data mining is one step. Because data science is an extremely popular buzzword at this time, its meaning will continue to evolve and change as the field continues to mature.

To show the relative search interest for these various terms over time, we can look at Google Trends. This tool shows how frequently people are searching for various keywords over time. In the following figure, the newcomer term data science is currently the hot buzzword, with data mining pulling into second place, followed by machine learning, data science, and predictive analytics. (I tried to include the search term knowledge discovery in databases as well, but the results were so close to zero that the line was invisible.) The y-axis shows the popularity of that particular search term as a 0-100 indexed value. In addition, I combined the weekly index values that Google Trends gives into a monthly average for each month in the period 2004-2015.

What is data mining?

Google Trends search results for five common data-related terms

How do we do data mining?

Since data mining is traditionally seen as one of the steps in the overall KDD process, and increasingly in the data science process, in this section we get acquainted with the steps involved. There are several popular methodologies for doing the work of data mining. Here we highlight four methodologies: Two that are taken from textbook introductions to the theory of data mining, one taken from a very practical process used in industry, and one designed for teaching beginners.

The Fayyad et al. KDD process

One early version of the knowledge discovery and data mining process was defined by Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth in a 1996 article (The KDD Process for Extracting Useful Knowledge from Volumes of Data). This article was important at the time for refining the rapidly changing KDD methodology into a concrete set of steps. The following steps lead from raw data at the beginning to knowledge at the end:

  • Data selection: The input to this step is raw data, and the output of this selection step is a smaller subset of the data, called the target data.
  • Data pre-processing: The target data is cleaned, oddities and outliers are removed, and missing data is accounted for. The output of this step is pre-processed data, or cleaned data.
  • Data transformation: The cleaned data is organized into a format appropriate for the mining step, and the number of features or variables is reduced if need be. The output of this step is transformed data.
  • Data mining: The transformed data is mined for patterns using one or more data mining algorithms appropriate to the problem at hand. The output of this step is the discovered patterns.
  • Data interpretation/evaluation: The discovered patterns are evaluated for their ability to solve the problem at hand. The output of this step is knowledge.

Since this process leads from raw data to knowledge, it is appropriate that these authors were the ones who were really committed to the term knowledge discovery in databases rather than simply data mining.

The Han et al. KDD process

Another version of the knowledge discovery process is described in the popular data mining textbook Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei as the following steps, which also lead from raw data to knowledge at the end:

  • Data cleaning: The input to this step is raw data, and the output is cleaned data.
  • Data integration: In this step, the cleaned data is integrated (if it came from multiple sources). The output of this step is integrated data.
  • Data selection: The data set is reduced to only the data needed for the problem at hand. The output of this step is a smaller data set.
  • Data transformation: The smaller data set is consolidated into a form that will work with the upcoming data mining step. This is called transformed data.
  • Data mining: The transformed data is processed by intelligent algorithms that are designed to discover patterns in that data. The output of this step is one or more patterns.
  • Pattern evaluation: The discovered patterns are evaluated for their interestingness and their ability to solve the problem at hand. The output of this step is an interestingness measure applied to each pattern, representing knowledge.
  • Knowledge representation: In this step, the knowledge is communicated to users through various means, including visualization.

In both the Fayyad and Han methodologies, it is expected that the process will iterate multiple times over the steps, if such iteration is needed. For example, if, during the transformation step the person doing the analysis realized that another data cleaning or pre-processing step, is needed, both of these methodologies specify that the analyst should double back and complete a second iteration of the incomplete earlier step.

The CRISP-DM process

A third popular version of the KDD process that is used in many business and applied domains is called CRISP-DM, which stands for CRoss-Industry Standard Process for Data Mining. It consists of the following steps:

  1. Business understanding: In this step, the analyst spends time understanding the reasons for the data mining project from a business perspective.
  2. Data understanding: In this step, the analyst becomes familiar with the data and its potential promises and shortcomings, and begins to generate hypotheses. The analyst is tasked to reassess the business understanding (step 1) if needed.
  3. Data preparation: This step includes all the data selection, integration, transformation, and pre-processing steps that are enumerated as separate steps in the other models. The CRISP-DM model has no expectation of what order these tasks will be done in.
  4. Modeling: This is the step in which the algorithms are applied to the data to discover the patterns. This step is closest to the actual data mining steps in the other KDD models. The analyst is tasked to reassess the data preparation step (step 3) if the modeling and mining step requires it.
  5. Evaluation: The model and discovered patterns are evaluated for their value in answering the business problem at hand. The analyst is tasked with revisiting the business understanding (step 1) if necessary.
  6. Deployment: The discovered knowledge and models are presented and put into production to solve the original problem at hand.

One of the strengths of this methodology is that iteration is built in. Between specific steps, it is expected that the analyst will check that the current step is still in agreement with certain previous steps. Another strength of this method is that the analyst is explicitly reminded to keep the business problem front and center in the project, even down in the evaluation steps.

The Six Steps process

When I teach the introductory data science course at my university, I use a hybrid methodology of my own creation. This methodology is called the Six Steps, and I designed it to be especially friendly for teaching. My Six Steps methodology removes some of the ambiguity that inexperienced students may have with open-ended tasks from CRISP-DM, such as Business Understanding, or a corporate-focused task such as Deployment. In addition, the Six Steps method keeps the focus on developing students' critical thinking skills by requiring them to answer Why are we doing this? and What does it mean? at the beginning and end of the process. My Six Steps method looks like this:

  1. Problem statement: In this step, the students identify what the problem is that they are trying to solve. Ideally, they motivate the case for why they are doing all this work.
  2. Data collection and storage: In this step, students locate data and plan their storage for the data needed for this problem. They also provide information about where the data that is helping them answer their motivating question came from, as well as what format it is in and what all the fields mean.
  3. Data cleaning: In this phase, students carefully select only the data they really need, and pre-process the data into the format required for the mining step.
  4. Data mining: In this step, students formalize their chosen data mining methodology. They describe what algorithms they used and why. The output of this step is a model and discovered patterns.
  5. Representation and visualization: In this step, the students show the results of their work visually. The outputs of this step can be tables, drawings, graphs, charts, network diagrams, maps, and so on.
  6. Problem resolution: This is an important step for beginner data miners. This step explicitly encourages the student to evaluate whether the patterns they showed in step 5 are really an answer to the question or problem they posed in step 1. Students are asked to state the limitations of their model or results, and to identify parts of the motivating question that they could not answer with this method.

Which data mining methodology is the best?

A 2014 survey of the subscribers of Gregory Piatetsky-Shapiro's very popular data mining email newsletter KDNuggets included the question What main methodology are you using for your analytics, data mining, or data science projects?

  • 43% of the poll respondents indicated that they were using the CRISP-DM methodology
  • 27% of the respondents were using their own methodology or a hybrid
  • 7% were using the traditional KDD methodology
  • The remaining respondents chose another KDD method

These results are generally similar to the 2007 results from the same newsletter asking the same question.

My best advice is that it does not matter too much which methodology you use for a data mining project, as long as you just pick one. If you do not have any methodology at all, then you run the risk of forgetting important steps. Choose one of the methods that seems like it might work for your project and your needs, and then just do your best to follow the steps.

For this book, we will vary our data mining methodology depending on which technique we are looking at in a given chapter. For example, even though the focus of the book as a whole is on the data mining step, we still need to motivate each chapter-length project with a healthy dose of Business Understanding (CRISP-DM) or Problem Statement (Six Steps) so that we understand why we are doing the tasks and what the results mean. In addition, in order to learn a particular data mining method, we may also have to do some pre-processing, whether we call that data cleaning, integration, or transformation. But in general, we will try to keep these tasks to a minimum so that our focus on data mining remains clear. One prominent exception will be in the final chapter, where we will show specific methods for dealing with missing data and anomalies. Finally, even though data visualization is typically very important for representing the results of your data mining process to your audience, we will also keep these tasks to a minimum so that we can remain focused on the primary job at hand: Data mining.

What are the techniques used in data mining?

Now that we have a sense of where data mining fits in our overall KDD or data science process, we can start to discuss the details of how to get it done.

Since the early days of attempting to define data mining, several broad classes of relevant problems consistently show up again and again. Fayyad et al. name six classes of problems in another important 1996 paper (From Data Mining to Knowledge Discovery in Databases), which we can summarize as follows:

  • Classification problems: Here, we have data that needs to be divided into predefined classes, based on some features of the data. We need an algorithm that can use previously classified data to learn how to put unknown data into the correct class.
  • Clustering problems: With these problems, we have data that needs to be divided into classes based on its features, but we do not know what the classes are in advance. We need an algorithm that can measure the similarity between data points and automatically divide the data up based on these similarities.
  • Regression problems: We have data that needs to be mapped onto a predictor variable, so we need to learn a function that can do this mapping.
  • Summarization problems: Suppose we have data that needs to be shortened or summarized in some way. This could be as simple as calculating basic statistics from data, or as complex as learning how to summarize text or finding a topic model for text.
  • Dependency modeling problems: For these problems, we have data that might be connected in some way, and we need to develop an algorithm that can calculate the probability of connection or describe the structure of connected data.
  • Change and deviation detection problems: In another case, we have data that has changed significantly or where some subset of the data deviates from normative values. To solve these problems, we need an algorithm that can detect these issues automatically.

In a different paper written that same year, those same authors also included a few additional categories:

  • Link analysis problems: Here we have data points with relationships between them, and we need to discover and describe these relationships in terms of how much support they have in the data set and how confident we are in the relationship.
  • Sequence analysis problems: Imagine that we have data points that follow a sequence, such as a time series or a genome, and we must discover trends or deviations in the sequence, or discover what is causing the sequence or how it will evolve.

Han, Kamber, and Pei, in the textbook we discussed earlier, describe four classes of problems that data mining can help solve, and further, they divide them into descriptive and predictive categories. Descriptive data mining means we are finding patterns that help us understand the data we have. Predictive data mining means we are finding patterns that can help us make predictions about data we do not yet have.

In the descriptive category, they list the following data mining problems:

  • Data characterization and data discrimination problems, including data summarization or concept characterization or description.
  • Frequency mining, including finding frequent patterns, association rules, and correlations in data.

In the predictive category, they list the following:

  • Classification, regression
  • Clustering
  • Outlier detection and anomaly detection

It is easy to see that there are many similarities between the Fayyad et al. list and the Han et al. list, but that they have just grouped the items differently. Indeed, the items that show up on both lists are exactly the types of data mining problems you are probably already familiar with by now if you have completed earlier data mining projects. Classification, regression, and clustering are very popular, foundational data mining techniques, so they are covered in nearly every data mining book designed for practitioners.

What techniques are we going to use in this book?

Since this book is about mastering data mining, we are going to tackle a few of the techniques that are not covered quite as often in the standard books. Specifically, we will address link analysis via association rules in Chapter 2, Association Rule Mining, and anomaly detection in Chapter 9, Mining for Data Anomalies. We are also going to apply a few data mining techniques to actually assist in data cleaning and pre-processing efforts, namely, in taking care of missing values in Chapter 9, Mining for Data Anomalies, and some data integration via entity matching in Chapter 3, Entity Matching.

In addition to defining data mining in terms of the techniques, sometimes people divide up the various data mining problems based on what type of data they are mining. For example, you may hear people refer to text mining or social network analysis. These refer to the type of data being mined rather than the specific technique being used to mine it. For example, text mining refers to any kind of data mining technique as applied to text documents, and network mining refers to looking for patterns in network graph data. In this book, we will be doing some network mining in Chapter 4, Network Analysis, different types of text document summarization in Chapter 6, Named Entity Recognition in Text, Chapter 7, Automatic Text Summarization, and Chapter 8, Topic Modeling in Text, and some classification of text by its sentiment (the emotion in the text) in Chapter 5, Sentiment Analysis in Text.

If you are anything like me, right about now you might be thinking enough of this background stuff, I want to write some code. I am glad you are getting excited to work on some actual projects. We are almost ready to start coding, but first we need to get a good working environment set up.

How do we set up our data mining work environment?

The previous sections were included to give us a better sense of what we are going to build and why. Now it is time to begin setting up a development environment that will support us as we work through all of these projects. Since this book is designed to teach us how to build the software to mine data for patterns, we will be writing our programs from scratch using a general purpose programming language. The Python programming language has a very strong – and still growing – community dedicated to data mining. This community has contributed some very handy libraries that we can use for efficient processing, and numerous data types that we can rely on to make our work go faster.

At the time of writing, there are two versions of Python available for download: Python 2 (the latest version is 2.7), now considered legacy, and Python 3 (the latest version is 3.5). We will be using Python 3 in this book. Because we have so many related packages and libraries we need to use to make our data mining experience as painless as possible, and because some of them can be a bit difficult to install, I recommend using a Python distribution designed for scientific and mathematical computing. Specifically, I recommend the Anaconda distribution of Python 3.5 made by Continuum Analytics. Their basic distribution of Python is free, and all the pieces are guaranteed to work together without us having to do the frustrating work of ensuring compatibility.

To download the Anaconda Python distribution, point your browser to the Continuum Analytics web site at https://www.continuum.io and follow the prompts to download the free Anaconda version (currently numbered 3.5 or above) that will work with your operating system.

Upon launching the software, you will see a splash screen that looks like the following screenshot:

How do we set up our data mining work environment?

Continuum Anaconda Navigator

Depending on the version you are using and when you downloaded it, you may notice a few Update buttons in addition to the Launch button for each application within Anaconda. You can click each to update the package if your software version is indicating that you need to do this.

To get started writing Python code, click Spyder to launch the code editor and the integrated development environment. If you would rather use your own text editor, such as TextWrangler on MacOS or Sublime editor on Windows, that is perfectly fine. You can run the Python code from the command line.

Spend a few moments getting Spyder configured to your liking, setting its colors and general layout, or just keep the defaults. For my own workspace, I moved around a few of the console windows, set up a working directory, and made a few customization tweaks that made me feel at home in this new editor. You can do the same to make your development environment comfortable for you.

Now we are ready to test the editor and get our libraries installed. To test the Spyder editor and see how it works, click File and select New File. Then type a simple hello world statement, as follows:

print ('hello world')

Run the program, either by clicking the green arrow, by pressing F5, or by clicking Run from inside the Run menu. Either way, the program will execute and you will see your output in the console output window.

At this point, we know Spyder and Python are working, and we are ready to test and install some libraries.

First, open a new file and save it as packageTest.py. In this test program, we will determine whether Scikit-learn was installed properly with Anaconda. Scikit-learn is a very important package that includes many machine learning functions, as well as canned data sets to test those functions. Many, many books and tutorials use Scikit-learn examples for teaching data mining, so this is a good package to have in our toolkit. We will use this package in several chapters in this book.

Running the following small program from the Scikit-learn tutorial on its website (found at http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-an-example-dataset) will tell us if our environment is set up properly:

from sklearn import datasets
iris = datasets.load_iris()
digits = datasets.load_digits()
print (digits.data)

If this program runs properly, it will produce output in the console window showing a series of numbers in a list-like data structure, like this:

[[  0.   0.   5. ...,   0.   0.   0.]
[  0.   0.   0. ...,  10.   0.   0.]
[  0.   0.   0. ...,  16.   9.   0.]
...,
[  0.   0.   1. ...,   6.   0.   0.]
[  0.   0.   2. ...,  12.   0.   0.]
[  0.   0.  10. ...,  12.   1.   0.]

For our purposes, this output is sufficient to show that Scikit-learn is installed properly. Next, add a line that will help us learn about the data type of this digits.data structure, as follows:

print (type(digits.data))

The output is as follows:

<class 'numpy.ndarray'>

From this output, we can confirm that Scikit-learn relies on another important package called Numpy to handle some of its data structures. Anaconda has also installed Numpy properly for us, which is exactly what we wanted to confirm.

Next, we will test whether our network analysis libraries are included. We will use Networkx library later in the network mining we will do in Chapter 4, Network Analysis to build a graphical social network. The following code sample creates a tiny network with one node, and prints its type to the screen:

import networkx as nx
G=nx.Graph()
G.add_node(1)
print (type(G))

The output is as follows:

<class 'networkx.classes.graph.Graph'>

This is exactly the output we wanted to see, as it tells us that Networkx is installed and working properly.

Next we will test some of the text mining software we need in later chapters. Conveniently, the Natural Language Toolkit (NLTK), is also installed with Anaconda. However, it has its own graphical downloader tool for the various corpora and word lists that it uses. Anaconda does not come with these installed, so we will have to do it. To get word lists and dictionaries, we will create a new Python file, import the NLTK module, then prompt NLTK to start the graphical Downloader:

import nltk
nltk.download()

A new Downloader window will open in Anaconda that looks like this:

How do we set up our data mining work environment?

NLTK Downloader dialogue window

Inside this Downloader window, select all from the list of Identifiers, change the Download Directory location (optional), and press the Download button. The red progress bar in the bottom-left of the Downloader window will animate as each collection is installed. This may take several minutes if your connection is slow. This mid-download step is shown in the following screenshot:

How do we set up our data mining work environment?

NLTK Downloader in progress

Once the Downloader has finished installing the NLTK corpora, we can test whether they work properly. Here is a short Python program where we ask NLTK to use the Brown University corpora and print the first 10 words:

from nltk.corpus import brown
print (brown.words()[0:10])

The output of this program is as follows, a list of the first 10 words in the NLTK Brown text corpus, which happens to be from a news story:

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']

With this output, we can be confident that NLTK is installed and all the necessary corpora have also been installed.

Next, we will install a text mining module called Gensim that we will need later for doing topic modeling. Gensim does not come pre-installed as part of Anaconda by default, but instead it is one of several hundred packages that are easily added by using Anaconda's built-in conda installer. From the Anaconda Tools menu, choose Open a Terminal and type conda install gensim. If you are prompted to update numpy and scipy, type y for yes, and the installation will proceed.

When the installation is finished, start up a new Python program and type this shortened version of the Gensim test program from its website:

from gensim import corpora, models, similarities
test = [[(0, 1.0), (1, 1.0), (2, 1.0)]]
print (test)

This program does not do much more than test if the module is imported properly and then print a list to the screen, but that is enough for now.

Finally, since this is a book about data mining, or knowledge discovery in databases, having some kind of database software to work with is definitely a good idea. Because it is free software, easy to install, and available for many operating systems, I chose MySQL to implement the projects in this book.

To get MySQL, head to the download page for the free Community Edition, available at http://dev.mysql.com/downloads/mysql/ for whatever OS you are using.

To get Anaconda Python to talk to MySQL, we will need to install some MySQL Python drivers. I like the pymysql drivers since they are fairly robust and lack some of the bugs that come with the standard drivers. From within Anaconda, start up a terminal window and run the following command:

conda install pymysql

It looks like all of our modules are installed and ready to be used as we need them throughout the book. If we decide we need additional modules, or if one of them goes out of date, we now know how to install it or upgrade it as needed.

Summary

In this chapter, we learned what it would take to expand our data mining toolbox to the master level. First we took a long view of the field as a whole, starting with the history of data mining as a piece of the knowledge discovery in databases (KDD) process. We also compared the field of data mining to other similar fields such as data science, machine learning, and big data.

Next, we outlined the common tools and techniques that most experts consider to be most important to the KDD process, paying special attention to the techniques that are used most frequently in the mining and analysis steps. To really master data mining, it is important that we work on problems that are different than simple textbook examples. For this reason, we will be working on more exotic data mining techniques such as generating summaries and finding outliers, and focusing on more unusual data types, such as text and networks.

Finally, in this chapter we put together a robust data mining system for ourselves. Our workspace centers around the powerful, general-purpose programming language, Python, and its many useful data mining packages, such as NLTK, Gensim, Numpy, Networkx, and Scikit-learn, and it is complemented by an easy-to-use and free MySQL database.

Now, all this discussion of software packages has got me thinking: Have you ever wondered what packages are used most frequently together? Is the combination of NLTK and Networkx a common thing to see, or is this a rather unusual pairing of libraries? In the next chapter, we will work on solving exactly that type of problem. In Chapter 2, Association Rule Mining, we will learn how to generate a list of frequently-found pairs, triples, quadruples, and more, and then we will attempt to make predictions based on the patterns we found.

Left arrow icon Right arrow icon

Key benefits

  • Dive deeper into data mining with Python – don’t be complacent, sharpen your skills!
  • From the most common elements of data mining to cutting-edge techniques, we’ve got you covered for any data-related challenge
  • Become a more fluent and confident Python data-analyst, in full control of its extensive range of libraries

Description

Data mining is an integral part of the data science pipeline. It is the foundation of any successful data-driven strategy – without it, you'll never be able to uncover truly transformative insights. Since data is vital to just about every modern organization, it is worth taking the next step to unlock even greater value and more meaningful understanding. If you already know the fundamentals of data mining with Python, you are now ready to experiment with more interesting, advanced data analytics techniques using Python's easy-to-use interface and extensive range of libraries. In this book, you'll go deeper into many often overlooked areas of data mining, including association rule mining, entity matching, network mining, sentiment analysis, named entity recognition, text summarization, topic modeling, and anomaly detection. For each data mining technique, we'll review the state-of-the-art and current best practices before comparing a wide variety of strategies for solving each problem. We will then implement example solutions using real-world data from the domain of software engineering, and we will spend time learning how to understand and interpret the results we get. By the end of this book, you will have solid experience implementing some of the most interesting and relevant data mining techniques available today, and you will have achieved a greater fluency in the important field of Python data analytics.

Who is this book for?

This book is for data scientists who are already familiar with some basic data mining techniques such as SQL and machine learning, and who are comfortable with Python. If you are ready to learn some more advanced techniques in data mining in order to become a data mining expert, this is the book for you!

What you will learn

  • Explore techniques for finding frequent itemsets and association rules in large data sets
  • Learn identification methods for entity matches across many different types of data
  • Identify the basics of network mining and how to apply it to real-world data sets
  • Discover methods for detecting the sentiment of text and for locating named entities in text
  • Observe multiple techniques for automatically extracting summaries and generating topic models for text
  • See how to use data mining to fix data anomalies and how to use machine learning to identify outliers in a data set

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 29, 2016
Length: 268 pages
Edition : 1st
Language : English
ISBN-13 : 9781785889950
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Aug 29, 2016
Length: 268 pages
Edition : 1st
Language : English
ISBN-13 : 9781785889950
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 169.97
Advanced Machine Learning with Python
$48.99
Mastering Data Mining with Python ??? Find patterns hidden in your data
$54.99
Python Machine Learning Cookbook
$65.99
Total $ 169.97 Stars icon
Banner background image

Table of Contents

10 Chapters
1. Expanding Your Data Mining Toolbox Chevron down icon Chevron up icon
2. Association Rule Mining Chevron down icon Chevron up icon
3. Entity Matching Chevron down icon Chevron up icon
4. Network Analysis Chevron down icon Chevron up icon
5. Sentiment Analysis in Text Chevron down icon Chevron up icon
6. Named Entity Recognition in Text Chevron down icon Chevron up icon
7. Automatic Text Summarization Chevron down icon Chevron up icon
8. Topic Modeling in Text Chevron down icon Chevron up icon
9. Mining for Data Anomalies Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Half star icon Empty star icon Empty star icon 2.7
(3 Ratings)
5 star 33.3%
4 star 0%
3 star 0%
2 star 33.3%
1 star 33.3%
Sanjeev Jaiswal Sep 08, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The title justifies the contents inside this book. Author has done really a commendable job. She has explained the concepts that a reader would need to explore data mining technology.I am one of the reviewer of this book and this way I got chance to read this book before it got published.User should be from Data mining domain and very fluent in Python to understand what she wants to explain. She has taken much time to write this book to explain the complex data structure and other concepts in easy manner.I would personally recommend this book for those who wants to get their hands dirty in data Mining with Python.
Amazon Verified review Amazon
Dimitri Shvorob Oct 11, 2016
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
The author's first ouvre, "Clean Data", was a dishonestly marketed atrocity, so when I recently came across a PDF of Prof. Squire's second book, and then saw it supported by a five-star review of the kind that many Packt books get *initially*, I decided to get involved. Not surprisingly, my opinion is not as complimentary. "Mastering Data Mining" is much better than "Clean Data", but three nasty things about "Clean Data" carry over in full. One is deceptive marketing: here, a book that's 80% about text analysis is presented as a "comprehensive guide to advance [sic] data analytics techniques". Another is the book being a recycling/repackaging of the author's past projects, as opposed to having material written for the book. Going back to the first point, if you were writing a book about data mining, you sure would go beyond text analysis, but if you are repackaging what you have, and what you have is text analysis, then that has to do. Finally, there's the bizarre coding. In "Clean Data", this was about a mess of typically un-commented code in different languages - some, like PHP, looking decidedly outmoded. (What kind of a teacher would be teaching her students PHP in 2016?) Here, it's all Python - but, first, it's a bad-beginner Python, and second and most spectacularly, Python used simply for automation. Prof. Squire insists on storing *and manipulating* data in a database - so instead of loading a dataset into Python and manipulating it using Python's vast arsenal, you witness the moronic sight of Python being used to fire SQL queries. Is teaching*this* really helping the beginners? I think not - and point readers to real data-science-with-Python books, by Raschka, Layton, and Coelho and Richert.
Amazon Verified review Amazon
ConBor Oct 27, 2019
Full star icon Empty star icon Empty star icon Empty star icon Empty star icon 1
Rookie attempt at authoring a resource book. Should have just written a paper and called it a day.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.