Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Learning Data Mining with Python

You're reading from   Learning Data Mining with Python Harness the power of Python to analyze data and create insightful predictive models

Arrow left icon
Product type Paperback
Published in Jul 2015
Publisher Packt
ISBN-13 9781784396053
Length 344 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Robert Layton Robert Layton
Author Profile Icon Robert Layton
Robert Layton
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Getting Started with Data Mining FREE CHAPTER 2. Classifying with scikit-learn Estimators 3. Predicting Sports Winners with Decision Trees 4. Recommending Movies Using Affinity Analysis 5. Extracting Features with Transformers 6. Social Media Insight Using Naive Bayes 7. Discovering Accounts to Follow Using Graph Mining 8. Beating CAPTCHAs with Neural Networks 9. Authorship Attribution 10. Clustering News Articles 11. Classifying Objects in Images Using Deep Learning 12. Working with Big Data A. Next Steps… Index

Introducing data mining

Data mining provides a way for a computer to learn how to make decisions with data. This decision could be predicting tomorrow's weather, blocking a spam email from entering your inbox, detecting the language of a website, or finding a new romance on a dating site. There are many different applications of data mining, with new applications being discovered all the time.

Data mining is part of algorithms, statistics, engineering, optimization, and computer science. We also use concepts and knowledge from other fields such as linguistics, neuroscience, or town planning. Applying it effectively usually requires this domain-specific knowledge to be integrated with the algorithms.

Most data mining applications work with the same high-level view, although the details often change quite considerably. We start our data mining process by creating a dataset, describing an aspect of the real world. Datasets comprise of two aspects:

  • Samples that are objects in the real world. This can be a book, photograph, animal, person, or any other object.
  • Features that are descriptions of the samples in our dataset. Features could be the length, frequency of a given word, number of legs, date it was created, and so on.

The next step is tuning the data mining algorithm. Each data mining algorithm has parameters, either within the algorithm or supplied by the user. This tuning allows the algorithm to learn how to make decisions about the data.

As a simple example, we may wish the computer to be able to categorize people as "short" or "tall". We start by collecting our dataset, which includes the heights of different people and whether they are considered short or tall:

Person

Height

Short or tall?

1

155cm

Short

2

165cm

Short

3

175cm

Tall

4

185cm

Tall

The next step involves tuning our algorithm. As a simple algorithm; if the height is more than x, the person is tall, otherwise they are short. Our training algorithm will then look at the data and decide on a good value for x. For the preceding dataset, a reasonable value would be 170 cm. Anyone taller than 170 cm is considered tall by the algorithm. Anyone else is considered short.

In the preceding dataset, we had an obvious feature type. We wanted to know if people are short or tall, so we collected their heights. This engineering feature is an important problem in data mining. In later chapters, we will discuss methods for choosing good features to collect in your dataset. Ultimately, this step often requires some expert domain knowledge or at least some trial and error.

Note

In this book, we will introduce data mining through Python. In some cases, we choose clarity of code and workflows, rather than the most optimized way to do this. This sometimes involves skipping some details that can improve the algorithm's speed or effectiveness.

You have been reading a chapter from
Learning Data Mining with Python
Published in: Jul 2015
Publisher: Packt
ISBN-13: 9781784396053
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime