Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Learning Data Mining with Python
Learning Data Mining with Python

Learning Data Mining with Python: Harness the power of Python to analyze data and create insightful predictive models

eBook
$27.98 $39.99
Paperback
$48.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Learning Data Mining with Python

Chapter 2. Classifying with scikit-learn Estimators

The scikit-learn library is a collection of data mining algorithms, written in Python and using a common programming interface. This allows users to easily try different algorithms as well as utilize standard tools for doing effective testing and parameter searching. There are a large number of algorithms and utilities in scikit-learn.

In this chapter, we focus on setting up a good framework for running data mining procedures. This will be used in later chapters, which are all focused on applications and techniques to use in those situations.

The key concepts introduced in this chapter are as follows:

  • Estimators: This is to perform classification, clustering, and regression
  • Transformers: This is to perform preprocessing and data alterations
  • Pipelines: This is to put together your workflow into a replicable format

scikit-learn estimators

Estimators are scikit-learn's abstraction, allowing for the standardized implementation of a large number of classification algorithms. Estimators are used for classification. Estimators have the following two main functions:

  • fit(): This performs the training of the algorithm and sets internal parameters. It takes two inputs, the training sample dataset and the corresponding classes for those samples.
  • predict(): This predicts the class of the testing samples that is given as input. This function returns an array with the predictions of each input testing sample.

Most scikit-learn estimators use the NumPy arrays or a related format for input and output.

There are a large number of estimators in scikit-learn. These include support vector machines (SVM), random forests, and neural networks. Many of these algorithms will be used in later chapters. In this chapter, we will use a different estimator from scikit-learn: nearest neighbor.

Note

For this chapter, you will...

Preprocessing using pipelines

When taking measurements of real-world objects, we can often get features in very different ranges. For instance, if we are measuring the qualities of an animal, we might have several features, as follows:

  • Number of legs: This is between the range of 0-8 for most animals, while some have many more!
  • Weight: This is between the range of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
  • Number of hearts: This can be between zero to five, in the case of the earthworm.

For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.

One of the methods to overcome this is to use a process called preprocessing to normalize the features so that they...

Pipelines

As experiments grow, so does the complexity of the operations. We may split up our dataset, binarize features, perform feature-based scaling, perform sample-based scaling, and many more operations.

Keeping track of all of these operations can get quite confusing and can result in being unable to replicate the result. Problems include forgetting a step, incorrectly applying a transformation, or adding a transformation that wasn't needed.

Another issue is the order of the code. In the previous section, we created our X_transformed dataset and then created a new estimator for the cross validation. If we had multiple steps, we would need to track all of these changes to the dataset in the code.

Pipelines are a construct that addresses these problems (and others, which we will see in the next chapter). Pipelines store the steps in your data mining workflow. They can take your raw data in, perform all the necessary transformations, and then create a prediction. This allows us to use...

Summary

In this chapter, we used several of scikit-learn's methods for building a standard workflow to run and evaluate data mining models. We introduced the Nearest Neighbors algorithm, which is already implemented in scikit-learn as an estimator. Using this class is quite easy; first, we call the fit function on our training data, and second, we use the predict function to predict the class of testing samples.

We then looked at preprocessing by fixing poor feature scaling. This was done using a Transformer object and the MinMaxScaler class. These functions also have a fit method and then a transform, which takes a dataset as an input and returns a transformed dataset as an output.

In the next chapter, we will use these concepts in a larger example, predicting the outcome of sports matches using real-world data.

Left arrow icon Right arrow icon
Download code icon Download Code

Description

If you are a programmer who wants to get started with data mining, then this book is for you.

Who is this book for?

If you are a programmer who wants to get started with data mining, then this book is for you.

What you will learn

  • Apply data mining concepts to realworld problems
  • Predict the outcome of sports matches based on past results
  • Determine the author of a document based on their writing style
  • Use APIs to download datasets from social media and other online services
  • Find and extract good features from difficult datasets
  • Create models that solve realworld problems
  • Design and develop data mining applications using a variety of datasets
  • Set up reproducible experiments and generate robust results
  • Recommend movies, online celebrities, and news articles based on personal preferences
  • Compute on big data, including realtime data from the Internet

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 29, 2015
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781784391201
Category :
Languages :
Concepts :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Jul 29, 2015
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781784391201
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 146.97
Python Machine Learning
$48.99
Learning Data Mining with Python
$48.99
Python Data Visualization Cookbook (Second Edition)
$48.99
Total $ 146.97 Stars icon

Table of Contents

14 Chapters
1. Getting Started with Data Mining Chevron down icon Chevron up icon
2. Classifying with scikit-learn Estimators Chevron down icon Chevron up icon
3. Predicting Sports Winners with Decision Trees Chevron down icon Chevron up icon
4. Recommending Movies Using Affinity Analysis Chevron down icon Chevron up icon
5. Extracting Features with Transformers Chevron down icon Chevron up icon
6. Social Media Insight Using Naive Bayes Chevron down icon Chevron up icon
7. Discovering Accounts to Follow Using Graph Mining Chevron down icon Chevron up icon
8. Beating CAPTCHAs with Neural Networks Chevron down icon Chevron up icon
9. Authorship Attribution Chevron down icon Chevron up icon
10. Clustering News Articles Chevron down icon Chevron up icon
11. Classifying Objects in Images Using Deep Learning Chevron down icon Chevron up icon
12. Working with Big Data Chevron down icon Chevron up icon
A. Next Steps… Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(7 Ratings)
5 star 28.6%
4 star 28.6%
3 star 28.6%
2 star 14.3%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Anon Oct 24, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Pretty good book on the subject matter, I especially enjoyed the variety in examples for applications of machine learning. Other books similar to the subject like Mastering Machine Learning with Scikit-Learn are alright, but this is definitely a cool addition to such a library or collection of similar topic books.The author uses scikit-learn, python libraries in general. Pretty easy to understand, and definitely nice as a reference in case you are facing a similar problem at work or school and want to consult with a tutorial in a book.Definitely worth looking into.
Amazon Verified review Amazon
Amazon Reader Aug 23, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is the most excellent book on Data Mining and Python I have come across. The books comes with plenty of code examples explained in simple and easy to understand language. I would highly recommend this book to novice users and enthusiasts.
Amazon Verified review Amazon
Mouha Apr 13, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The Robert's book is one of those I've finished used and reuse. I have many books on AI, Machine learning /data mining . Very few give me access to the minimum knowledge so I'd be able to use AI by myself. Jeff Heaton book was one of them, now I can add this book because it allows you to understand the main algorithms in this area, in a way that even you are not strong in maths through Python code you can really apply each algorithms in a minute. Really easy and understandable. Some could argue that the author doesn't dive deeply in the explanation: I think this is on purpose, and btw there are so much book about the theory. I didn't put 5 start because of some (small) cons : In chapter 8 "Beating Captcha...." The author would have use a recent framework like FANN instead of Pybrain which seems to be abandoned since years. This is not a showstopper anyway. I was so happy to use NN which for me is a kind of magic sometimes.
Amazon Verified review Amazon
Dimitri Shvorob Aug 20, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Wishing to learn Python's machine-learning toolkit - I am an emigrant from R Country - I rounded up several relevant books, and set out to narrow the field to one or two suitable for further study. My haul included (in no particular order)"Machine Learning in Python" by Bowles, published in 2015 by Wiley, 360 pages, $25 for the cheapest hardcopy now available from Amazon (including shipping)"Designing Machine Learning Systems with Python" by Julian, 2016, Packt, 232 pages, $42"Mastering Python for Data Science" by Madhavan, 2015, Packt, 294 pages, $39"Learning Data Mining with Python" by Layton, 2015, 369 pages, $43"Python Data Science Cookbook" by Subramanian, 2015, 347 pages, $48"Data Science From Scratch" by Grus, 2015, 330 pages, $24"Learning scikit-learn" by Moncecchi and Garreta, 2013, 118 pages, $28"Building Machine Learning Systems with Python" by Coelho and Richert, 2015, 305 pages, $49"Python Machine Learning" by Raschka, 2015, 454 pages, $34The whittling-down turned out to be harder than expected: Python titles are better than R counterparts, and Madhavan's book alone was easy to dismiss. Subramanian, Moncecchi-Garreta and Julian did not make the cut based on comparison with alternatives, but were not of themselves bad. Grus is the beginner's best bet - beginners can stop reading here - while Bowles is a book which I like a lot, but which may be a bit too specialist. As a reviewer, thinking about what other "intermediate" readers might find useful, I end up pointing to the trio of Raschka, Layton and Coelho-Richert as the books worth choosing from.I distinguish Raschka, in appreciation of his more pedagogical style - or maybe I am just giving the top spot to the thickest book! - but the other two titles are definitely worth checking out. Compared to Coelho-Richert (CR), Layton's book surveys a wider range of algorithms - a good third of CR's page count is devoted to text analysis, which means less space for everything else - but strangely neglects regression, my own primary interest. (This is why I dock one star). The writing is more "cohesive" and methodical - but while Coelho and Richert know to "liven up" the early chapters with visualizations, Layton does not use "matplotlib" till page 98. (And after that, you see charts in the chapter on graph mining - notably, a topic you don't find in the other two books). Get both, and see which one you prefer.
Amazon Verified review Amazon
Amazon Customer Aug 04, 2016
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Fine for introducing the learner to data mining with Python...but not much else. Many typos in the code and text, key concepts and vocabulary poorly assumed to be understood by the reader. Not good continuity either. Definitely written by a committee.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.