Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Large Scale Machine Learning with Python
Large Scale Machine Learning with Python

Large Scale Machine Learning with Python: Learn to build powerful machine learning models quickly and deploy large-scale predictive applications

Arrow left icon
Profile Icon Sjardin Profile Icon Luca Massaron Profile Icon Alberto Boschetti
Arrow right icon
$38.99 $43.99
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (3 Ratings)
eBook Aug 2016 420 pages 1st Edition
eBook
$38.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Sjardin Profile Icon Luca Massaron Profile Icon Alberto Boschetti
Arrow right icon
$38.99 $43.99
Full star icon Full star icon Full star icon Full star icon Empty star icon 4 (3 Ratings)
eBook Aug 2016 420 pages 1st Edition
eBook
$38.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$38.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Table of content icon View table of contents Preview book icon Preview Book

Large Scale Machine Learning with Python

Chapter 2. Scalable Learning in Scikit-learn

Loading a dataset into memory, preparing a data matrix, training a machine learning algorithm, and testing its generalization capabilities using out-of-sample observations are often not such a big deal given the quite powerful and yet affordable computers of this day and age. However, more and more frequently, the scale of the data to be elaborated is so huge that loading it into the core memory of your computer is not possible and, even if manageable, the result is intractable both in terms of data management and machine learning.

Alternative viable strategies beyond the core memory processing are possible: splitting the data into samples, using parallelism, and finally learning in small batches or by single instances. The present chapter will focus on the out-of-the-box solution that the Scikit-learn package offers: the streaming of mini batches of instances (our observations) from data storage and the incremental learning based on...

Out-of-core learning

Out-of-core learning refers to a set of algorithms working with data that cannot fit into the memory of a single computer, but that can easily fit into some data storage such as a local hard disk or web repository. Your available RAM, the core memory on your single machine, may indeed range from a few gigabytes (sometimes 2 GB, more commonly 4 GB, but we assume that you have 2 GB at maximum) up to 256 GB on large server machines. Large servers are like the ones you can get on cloud computing services such as Amazon Elastic Compute Cloud (EC2), whereas your storage capabilities can easily exceed terabytes of capacity using just an external drive (most likely about 1 TB but it can reach up to 4 TB).

As machine learning is based on globally reducing a cost function, many algorithms initially have been thought to work using all the available data and having access to it at each iteration of the optimization process. This is particularly true for all algorithms based on...

Streaming data from sources

Some data is really streaming through your computer when you have a generative process that transmits data, which you can process on the fly or just discard, but not recall afterward unless you have stored it away in some data archival repository somewhere. It is like dragging water from a flowing river—the river keeps on flowing but you can filter and process all the water as it goes. It's a completely different strategy from processing all the data at once, which is more like putting all the water in a dam (an analogy for working with all the data in-memory).

As an example of streaming, we could quote the data flow produced instant by instant by a sensor or, even more simply, a Twitter streamline of tweets. Generally, the main sources of data streams are as follows:

  • Environment sensors measuring temperature, pressure, and humidity
  • GPS tracking sensors recording the location (latitude/longitude)
  • Satellites recording image data
  • Surveillance videos and...

Stochastic learning

Having defined the streaming process, it is now time to glance at the learning process as it is the learning and its specific needs that determine the best way to handle data and transform it in the preprocessing phase.

Online learning, contrary to batch learning, works with a larger number of iterations and gets directions from each single instance at a time, thus allowing a more erratic learning procedure than an optimization made on a batch, which immediately tends to get the right direction expressed from the data as a whole.

Batch gradient descent

The core algorithm for machine learning, gradient descent, is therefore revisited in order to adapt to online learning. When working on batch data, gradient descent can minimize the cost function of a linear regression analysis using much less computations than statistical algorithms. The complexity of gradient descent ranks in the order O(n*p), making learning regression coefficients feasible even in the occurrence of a...

Feature management with data streams

Data streams pose the problem that you cannot evaluate as you would do when working on a complete in-memory dataset. For a correct and optimal approach to feed your SGD out-of-core algorithm, you first have to survey the data (by taking a chuck of the initial instances of the file, for example) and find out the type of data you have at hand.

We distinguish among the following types of data:

  • Quantitative values
  • Categorical values encoded with integer numbers
  • Unstructured categorical values expressed in textual form

When data is quantitative, it could just be fed to the SGD learner but for the fact that the algorithm is quite sensitive to feature scaling; that is, you have to bring all the quantitative features into the same range of values or the learning process won't converge easily and correctly. Possible scaling strategies are converting all the values in the range [0,1], [-1,1] or standardizing the variable by centering its mean to zero and converting...

Summary

In this chapter, we have seen how learning is possible out-of-core by streaming data, no matter how big it is, from a text file or database on your hard disk. These methods certainly apply to much bigger datasets than the examples that we used to demonstrate them (which actually could be solved in-memory using non-average, powerful hardware).

We also explained the core algorithm that makes out-of-core learning possible—SGD—and we examined its strength and weakness, emphasizing the necessity of streams to be really stochastic (which means in a random order) to be really effective, unless the order is part of the learning objectives. In particular, we introduced the Scikit-learn implementation of SGD, limiting our focus to the linear and logistic regression loss functions.

Finally, we discussed data preparation, introduced the hashing trick and validation strategies for streams, and wrapped up the acquired knowledge on SGD fitting two different models—classification...

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Design, engineer and deploy scalable machine learning solutions with the power of Python
  • Take command of Hadoop and Spark with Python for effective machine learning on a map reduce framework
  • Build state-of-the-art models and develop personalized recommendations to perform machine learning at scale

Description

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Who is this book for?

This book is for anyone who intends to work with large and complex data sets. Familiarity with basic Python and machine learning concepts is recommended. Working knowledge in statistics and computational mathematics would also be helpful.

What you will learn

  • Apply the most scalable machine learning algorithms
  • Work with modern state-of-the-art large-scale machine learning techniques
  • Increase predictive accuracy with deep learning and scalable data-handling techniques
  • Improve your work by combining the MapReduce framework with Spark
  • Build powerful ensembles at scale
  • Use data streams to train linear and non-linear predictive models from extremely large datasets using a single machine

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 03, 2016
Length: 420 pages
Edition : 1st
Language : English
ISBN-13 : 9781785888021
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Billing Address

Product Details

Publication date : Aug 03, 2016
Length: 420 pages
Edition : 1st
Language : English
ISBN-13 : 9781785888021
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 169.97
Python Machine Learning Cookbook
$65.99
Advanced Machine Learning with Python
$48.99
Large Scale Machine Learning with Python
$54.99
Total $ 169.97 Stars icon

Table of Contents

11 Chapters
1. First Steps to Scalability Chevron down icon Chevron up icon
2. Scalable Learning in Scikit-learn Chevron down icon Chevron up icon
3. Fast SVM Implementations Chevron down icon Chevron up icon
4. Neural Networks and Deep Learning Chevron down icon Chevron up icon
5. Deep Learning with TensorFlow Chevron down icon Chevron up icon
6. Classification and Regression Trees at Scale Chevron down icon Chevron up icon
7. Unsupervised Learning at Scale Chevron down icon Chevron up icon
8. Distributed Environments – Hadoop and Spark Chevron down icon Chevron up icon
9. Practical Machine Learning with Spark Chevron down icon Chevron up icon
A. Introduction to GPUs and Theano Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
(3 Ratings)
5 star 66.7%
4 star 0%
3 star 0%
2 star 33.3%
1 star 0%
Z.V. Sep 19, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is the best book for Python-based data science, focusing on ML and big data I have encountered (and I’ve been around!). The authors cover a wide-range of intermediate and advanced topics, which they explain in terms of theory and applications. I particularly liked the Unsupervised Learning chapter, where they not only covered the quite popular k-means algorithm, but also provided a couple of heuristics for finding the optimum number of clusters while they wrote a few words about one of its most powerful variants (k-means++) too.Although Python falls short when it comes to handling large data sets or multiple CPUs/GPUs on its own, the authors describe the various solutions to these issues via the use of large scale frameworks, such as Spark, making Python a versatile tool for big data scenarios. Also, they introduce the various packages required to accomplish all the analytics-related tasks, making this book also a great reference manual for all data scientists who veer towards this language.Personally I lean towards more elegant and more modern programming tools, such a s Julia and Scala, but I found this book quite refreshing and insightful, definitely a great addition to my data science library. If you are someone who takes data science seriously and has learned the basics, I would highly recommend this book for you.
Amazon Verified review Amazon
Oleg Okun Aug 21, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Disclosure: I was a technical reviewer of this book.Many books when their subject is Machine Learning with Python concentrate on a few most known and used libraries to explain Machine Learning tasks and solutions. Although I don't want to say that such books are useless for readers, they may still leave gaps in understanding of how a certain method or library would work in real-world scenarios. Authors of the book "Large Scale Machine Learning with Python" set up an ambitious goal to teach readers how to solve real-world Machine Learning problems by employing a variety of libraries, frameworks, and tools relying on Python. This advantageously differentiates a given book from many other books on the same subject.The following practical situations are considered and their solutions are presented:- Tall datasets when the number of cases is large, compared to the number of features.- Wide datasets when the number of features is large, compared to the number of cases.- Both tall and wide datasets when both the number of features and the number of cases are large.- Sparse datasets when there are many zero-valued elements.The book treats the problem of scalability from different angles, such as fast batch (offline) processing, incremental online processing (one instance at a time arrives), streaming processing (a chunk of instances at a time arrives) and distributed processing. Popular libraries and frameworks, such as Gensim, H2O, XGBoost, TensorFlow, Theano, Theanets, Keras, Vowpal Wabbit, and Spark and their applications are explained through numerous Python snippets. In my opinion, this is one of the first books presenting all these tools under one cover.In addition to Python code, the book also covers such advanced topics like Deep Learning, Ensemble Learning, validation of streaming algorithm performance, and GPU processing.I recommend this book as a good companion to any Machine Learning practitioner who already has fairly good understanding of theory behind Machine Learning algorithms.
Amazon Verified review Amazon
M. Athar Aug 31, 2017
Full star icon Full star icon Empty star icon Empty star icon Empty star icon 2
This book is just too all over the place to be useful. Most of the stuff you can learn for free by going through the documentation for the various technologies discussed.No real discussion on RNNs, or calculus on computational graphs (which bascially defeats the purpose of tensorflow).
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.