Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Learning Data Mining with Python
Learning Data Mining with Python

Learning Data Mining with Python: Harness the power of Python to analyze data and create insightful predictive models

eBook
€20.98 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Learning Data Mining with Python

Chapter 2. Classifying with scikit-learn Estimators

The scikit-learn library is a collection of data mining algorithms, written in Python and using a common programming interface. This allows users to easily try different algorithms as well as utilize standard tools for doing effective testing and parameter searching. There are a large number of algorithms and utilities in scikit-learn.

In this chapter, we focus on setting up a good framework for running data mining procedures. This will be used in later chapters, which are all focused on applications and techniques to use in those situations.

The key concepts introduced in this chapter are as follows:

  • Estimators: This is to perform classification, clustering, and regression
  • Transformers: This is to perform preprocessing and data alterations
  • Pipelines: This is to put together your workflow into a replicable format

scikit-learn estimators

Estimators are scikit-learn's abstraction, allowing for the standardized implementation of a large number of classification algorithms. Estimators are used for classification. Estimators have the following two main functions:

  • fit(): This performs the training of the algorithm and sets internal parameters. It takes two inputs, the training sample dataset and the corresponding classes for those samples.
  • predict(): This predicts the class of the testing samples that is given as input. This function returns an array with the predictions of each input testing sample.

Most scikit-learn estimators use the NumPy arrays or a related format for input and output.

There are a large number of estimators in scikit-learn. These include support vector machines (SVM), random forests, and neural networks. Many of these algorithms will be used in later chapters. In this chapter, we will use a different estimator from scikit-learn: nearest neighbor.

Note

For this chapter, you will...

Preprocessing using pipelines

When taking measurements of real-world objects, we can often get features in very different ranges. For instance, if we are measuring the qualities of an animal, we might have several features, as follows:

  • Number of legs: This is between the range of 0-8 for most animals, while some have many more!
  • Weight: This is between the range of only a few micrograms, all the way to a blue whale with a weight of 190,000 kilograms!
  • Number of hearts: This can be between zero to five, in the case of the earthworm.

For a mathematical-based algorithm to compare each of these features, the differences in the scale, range, and units can be difficult to interpret. If we used the above features in many algorithms, the weight would probably be the most influential feature due to only the larger numbers and not anything to do with the actual effectiveness of the feature.

One of the methods to overcome this is to use a process called preprocessing to normalize the features so that they...

Pipelines

As experiments grow, so does the complexity of the operations. We may split up our dataset, binarize features, perform feature-based scaling, perform sample-based scaling, and many more operations.

Keeping track of all of these operations can get quite confusing and can result in being unable to replicate the result. Problems include forgetting a step, incorrectly applying a transformation, or adding a transformation that wasn't needed.

Another issue is the order of the code. In the previous section, we created our X_transformed dataset and then created a new estimator for the cross validation. If we had multiple steps, we would need to track all of these changes to the dataset in the code.

Pipelines are a construct that addresses these problems (and others, which we will see in the next chapter). Pipelines store the steps in your data mining workflow. They can take your raw data in, perform all the necessary transformations, and then create a prediction. This allows us to use...

Summary

In this chapter, we used several of scikit-learn's methods for building a standard workflow to run and evaluate data mining models. We introduced the Nearest Neighbors algorithm, which is already implemented in scikit-learn as an estimator. Using this class is quite easy; first, we call the fit function on our training data, and second, we use the predict function to predict the class of testing samples.

We then looked at preprocessing by fixing poor feature scaling. This was done using a Transformer object and the MinMaxScaler class. These functions also have a fit method and then a transform, which takes a dataset as an input and returns a transformed dataset as an output.

In the next chapter, we will use these concepts in a larger example, predicting the outcome of sports matches using real-world data.

Left arrow icon Right arrow icon
Download code icon Download Code

Description

If you are a programmer who wants to get started with data mining, then this book is for you.

Who is this book for?

If you are a programmer who wants to get started with data mining, then this book is for you.

What you will learn

  • Apply data mining concepts to realworld problems
  • Predict the outcome of sports matches based on past results
  • Determine the author of a document based on their writing style
  • Use APIs to download datasets from social media and other online services
  • Find and extract good features from difficult datasets
  • Create models that solve realworld problems
  • Design and develop data mining applications using a variety of datasets
  • Set up reproducible experiments and generate robust results
  • Recommend movies, online celebrities, and news articles based on personal preferences
  • Compute on big data, including realtime data from the Internet

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 29, 2015
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781784396053
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jul 29, 2015
Length: 344 pages
Edition : 1st
Language : English
ISBN-13 : 9781784396053
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 110.97
Python Machine Learning
€36.99
Learning Data Mining with Python
€36.99
Python Data Visualization Cookbook (Second Edition)
€36.99
Total 110.97 Stars icon

Table of Contents

14 Chapters
1. Getting Started with Data Mining Chevron down icon Chevron up icon
2. Classifying with scikit-learn Estimators Chevron down icon Chevron up icon
3. Predicting Sports Winners with Decision Trees Chevron down icon Chevron up icon
4. Recommending Movies Using Affinity Analysis Chevron down icon Chevron up icon
5. Extracting Features with Transformers Chevron down icon Chevron up icon
6. Social Media Insight Using Naive Bayes Chevron down icon Chevron up icon
7. Discovering Accounts to Follow Using Graph Mining Chevron down icon Chevron up icon
8. Beating CAPTCHAs with Neural Networks Chevron down icon Chevron up icon
9. Authorship Attribution Chevron down icon Chevron up icon
10. Clustering News Articles Chevron down icon Chevron up icon
11. Classifying Objects in Images Using Deep Learning Chevron down icon Chevron up icon
12. Working with Big Data Chevron down icon Chevron up icon
A. Next Steps… Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Half star icon Empty star icon 3.7
(7 Ratings)
5 star 28.6%
4 star 28.6%
3 star 28.6%
2 star 14.3%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Anon Oct 24, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Pretty good book on the subject matter, I especially enjoyed the variety in examples for applications of machine learning. Other books similar to the subject like Mastering Machine Learning with Scikit-Learn are alright, but this is definitely a cool addition to such a library or collection of similar topic books.The author uses scikit-learn, python libraries in general. Pretty easy to understand, and definitely nice as a reference in case you are facing a similar problem at work or school and want to consult with a tutorial in a book.Definitely worth looking into.
Amazon Verified review Amazon
Amazon Reader Aug 23, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is the most excellent book on Data Mining and Python I have come across. The books comes with plenty of code examples explained in simple and easy to understand language. I would highly recommend this book to novice users and enthusiasts.
Amazon Verified review Amazon
Mouha Apr 13, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
The Robert's book is one of those I've finished used and reuse. I have many books on AI, Machine learning /data mining . Very few give me access to the minimum knowledge so I'd be able to use AI by myself. Jeff Heaton book was one of them, now I can add this book because it allows you to understand the main algorithms in this area, in a way that even you are not strong in maths through Python code you can really apply each algorithms in a minute. Really easy and understandable. Some could argue that the author doesn't dive deeply in the explanation: I think this is on purpose, and btw there are so much book about the theory. I didn't put 5 start because of some (small) cons : In chapter 8 "Beating Captcha...." The author would have use a recent framework like FANN instead of Pybrain which seems to be abandoned since years. This is not a showstopper anyway. I was so happy to use NN which for me is a kind of magic sometimes.
Amazon Verified review Amazon
Dimitri Shvorob Aug 20, 2016
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
Wishing to learn Python's machine-learning toolkit - I am an emigrant from R Country - I rounded up several relevant books, and set out to narrow the field to one or two suitable for further study. My haul included (in no particular order)"Machine Learning in Python" by Bowles, published in 2015 by Wiley, 360 pages, $25 for the cheapest hardcopy now available from Amazon (including shipping)"Designing Machine Learning Systems with Python" by Julian, 2016, Packt, 232 pages, $42"Mastering Python for Data Science" by Madhavan, 2015, Packt, 294 pages, $39"Learning Data Mining with Python" by Layton, 2015, 369 pages, $43"Python Data Science Cookbook" by Subramanian, 2015, 347 pages, $48"Data Science From Scratch" by Grus, 2015, 330 pages, $24"Learning scikit-learn" by Moncecchi and Garreta, 2013, 118 pages, $28"Building Machine Learning Systems with Python" by Coelho and Richert, 2015, 305 pages, $49"Python Machine Learning" by Raschka, 2015, 454 pages, $34The whittling-down turned out to be harder than expected: Python titles are better than R counterparts, and Madhavan's book alone was easy to dismiss. Subramanian, Moncecchi-Garreta and Julian did not make the cut based on comparison with alternatives, but were not of themselves bad. Grus is the beginner's best bet - beginners can stop reading here - while Bowles is a book which I like a lot, but which may be a bit too specialist. As a reviewer, thinking about what other "intermediate" readers might find useful, I end up pointing to the trio of Raschka, Layton and Coelho-Richert as the books worth choosing from.I distinguish Raschka, in appreciation of his more pedagogical style - or maybe I am just giving the top spot to the thickest book! - but the other two titles are definitely worth checking out. Compared to Coelho-Richert (CR), Layton's book surveys a wider range of algorithms - a good third of CR's page count is devoted to text analysis, which means less space for everything else - but strangely neglects regression, my own primary interest. (This is why I dock one star). The writing is more "cohesive" and methodical - but while Coelho and Richert know to "liven up" the early chapters with visualizations, Layton does not use "matplotlib" till page 98. (And after that, you see charts in the chapter on graph mining - notably, a topic you don't find in the other two books). Get both, and see which one you prefer.
Amazon Verified review Amazon
Amazon Customer Aug 04, 2016
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
Fine for introducing the learner to data mining with Python...but not much else. Many typos in the code and text, key concepts and vocabulary poorly assumed to be understood by the reader. Not good continuity either. Definitely written by a committee.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.