Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Mastering Machine Learning with Spark 2.x
Mastering Machine Learning with Spark 2.x

Mastering Machine Learning with Spark 2.x: Harness the potential of machine learning, through spark

Arrow left icon
Profile Icon Max Pumperla Profile Icon Tellez Profile Icon Malohlava
Arrow right icon
$54.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (1 Ratings)
Paperback Aug 2017 340 pages 1st Edition
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Max Pumperla Profile Icon Tellez Profile Icon Malohlava
Arrow right icon
$54.99
Full star icon Full star icon Full star icon Full star icon Full star icon 5 (1 Ratings)
Paperback Aug 2017 340 pages 1st Edition
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$9.99 $43.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Table of content icon View table of contents Preview book icon Preview Book

Mastering Machine Learning with Spark 2.x

Introduction to Large-Scale Machine Learning and Spark

"Information is the oil of the 21st century, and analytics is the combustion engine."


--Peter Sondergaard, Gartner Research

By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_data_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop.

However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverage the power of modern computers to apply various mathematical methods in order to learn more about data and patterns, and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation-expensive statistical or machine learning algorithms have started to help extract nuggets of information from data.

One of the first well-adopted big data technologies was Hadoop, which allows for the MapReduce computation by saving intermediate results on a disk. However, it still lacks proper big data tools for information extraction. Nevertheless, Hadoop was just the beginning. With the growing size of machine memory, new in-memory computation frameworks appeared, and they also started to provide basic support for conducting data analysis and modeling—for example, SystemML or Spark ML for Spark and FlinkML for Flink. These frameworks represent only the tip of the iceberg—there is a lot more in the big data ecosystem, and it is permanently evolving, since the volume of data is constantly growing, demanding new big data algorithms and processing methods. For example, the Internet of Things (IoT) represents a new domain that produces huge amount of streaming data from various sources (for example, home security system, Alexa Echo, or vital sensors) and brings not only an unlimited potential to mind useful information from data, but also demands new kind of data processing and modeling methods.

Nevertheless, in this chapter, we will start from the beginning and explain the following topics:

  • Basic working tasks of data scientists
  • Aspect of big data computation in distributed environment
  • The big data ecosystem
  • Spark and its machine learning support

Data science

Finding a uniform definition of data science, however, is akin to tasting wine and comparing flavor profiles among friends—everyone has their own definition and no one description is more accurate than the other. At its core, however, data science is the art of asking intelligent questions about data and receiving intelligent answers that matter to key stakeholders. Unfortunately, the opposite also holds true—ask lousy questions of the data and get lousy answers! Therefore, careful formulation of the question is the key for extracting valuable insights from your data. For this reason, companies are now hiring data scientists to help formulate and ask these questions.

Figure 1 - Growing Google Trend of big data and data science

The sexiest role of the 21st century – data scientist?

At first, it's easy to paint a stereotypical picture of what a typical data scientist looks like: t-shirt, sweatpants, thick-rimmed glasses, and debugging a chunk of code in IntelliJ... you get the idea. Aesthetics aside, what are some of the traits of a data scientist? One of our favorite posters describing this role is shown here in the following diagram:

Figure 2 - What is a data scientist?

Math, statistics, and general knowledge of computer science is given, but one pitfall that we see among practitioners has to do with understanding the business problem, which goes back to asking intelligent questions of the data. It cannot be emphasized enough: asking more intelligent questions of the data is a function of the data scientist's understanding of the business problem and the limitations of the data; without this fundamental understanding, even the most intelligent algorithm would be unable to come to solid conclusions based on a wobbly foundation.

A day in the life of a data scientist

This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail!

So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality.

Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows:

If you can open your dataset using Excel, you are working with small data.

Working with big data

What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names:

B
D
X
A
D
A

We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq:

bash> sort file | uniq -c

The output is as shown ahead:

2 A
1 B
1 D
1 X

However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated.

The machine learning algorithm using a distributed environment

Machine learning algorithms combine simple tasks into complex patterns, that are even more complicated in distributed environment. Let's take a simple decision tree algorithm (reference), for example. This particular algorithm creates a binary tree that tries to fit training data and minimize prediction errors. However, in order to do this, it has to decide about the branch of tree it has to send every data point to (don't worry, we'll cover the mechanics of how this algorithm works along with some very useful parameters that you can learn in later in the book). Let's demonstrate it with a simple example:

Figure 3 - Example of red and blue data points covering 2D space.

Consider the situation depicted in preceding figure. A two-dimensional board with many points colored in two colors: red and blue. The goal of the decision tree is to learn and generalize the shape of data and help decide about the color of a new point. In our example, we can easily see that the points almost follow a chessboard pattern. However, the algorithm has to figure out the structure by itself. It starts by finding the best position of a vertical or horizontal line, which would separate the red points from the blue points.

The found decision is stored in the tree root and the steps are recursively applied on both the partitions. The algorithm ends when there is a single point in the partition:

Figure 4 - The final decision tree and projection of its prediction to the original space of points.

Splitting of data into multiple machines

For now, let's assume that the number of points is huge and cannot fit into the memory of a single machine. Hence, we need multiple machines, and we have to partition data in such a way that each machine contains only a subset of data. This way, we solve the memory problem; however, it also means that we need to distribute the computation around a cluster of machines. This is the first difference from single-machine computing. If your data fits into a single machine memory, it is easy to make decisions about data, since the algorithm can access them all at once, but in the case of a distributed algorithm, this is not true anymore and the algorithm has to be "clever" about accessing the data. Since our goal is to build a decision tree that predicts the color of a new point in the board, we need to figure out how to make the tree that will be the same as a tree built on a single machine.

The naive solution is to build a trivial tree that separates the points based on machine boundaries. But this is obviously a bad solution, since data distribution does not reflect color points at all.

Another solution tries all the possible split decisions in the direction of the X and Y axes and tries to do the best in separating both colors, that is, divides the points into two groups and minimizes the number of points of another color. Imagine that the algorithm is testing the split via the line, X = 1.6. This means that the algorithm has to ask each machine in the cluster to report the result of splitting the machine's local data, merge the results, and decide whether it is the right splitting decision. If it finds an optimal split, it needs to inform all the machines about the decision in order to record which partition each point belongs to.

Compared with the single machine scenario, the distributed algorithm constructing decision tree is more complex and requires a way of distributing the computation among machines. Nowadays, with easy access to a cluster of machines and an increasing demand for the analysis of larger datasets, it becomes a standard requirement.

Even these two simple examples show that for a larger data, proper computation and distributed infrastructure is required, including the following:

  • A distributed data storage, that is, if the data cannot fit into a single node, we need a way to distribute and process them on multiple machines
  • A computation paradigm to process and transform the distributed data and to apply mathematical (and statistical) algorithms and workflows
  • Support to persist and reuse defined workflows and models
  • Support to deploy statistical models in production

In short, we need a framework that will support common data science tasks. It can be considered an unnecessary requirement, since data scientists prefer using existing tools, such as R, Weka, or Python's scikit. However, these tools are neither designed for large-scale distributed processing nor for the parallel processing of large data. Even though there are libraries for R or Python that support limited parallel or distributed programming, their main limitation is that the base platforms, that is R and Python, were not designed for this kind of data processing and computation.

From Hadoop MapReduce to Spark

With a growing amount of data, the single-machine tools were not able to satisfy the industry needs and thereby created a space for new data processing methods and tools, especially Hadoop MapReduce, which is based on an idea originally described in the Google paper, MapReduce: Simplified Data Processing on Large Clusters (https://research.google.com/archive/mapreduce.html). On the other hand, it is a generic framework without any explicit support or libraries to create machine learning workflows. Another limitation of classical MapReduce is that it performs many disk I/O operations during the computation instead of benefiting from machine memory.

As you have seen, there are several existing machine learning tools and distributed platforms, but none of them is an exact match for performing machine learning tasks with large data and distributed environment. All these claims open the doors for Apache Spark.

Enter the room, Apache Spark!

Created in 2010 at the UC Berkeley AMP Lab (Algorithms, Machines, People), the Apache Spark project was built with an eye for speed, ease of use, and advanced analytics. One key difference between Spark and other distributed frameworks such as Hadoop is that datasets can be cached in memory, which lends itself nicely to machine learning, given its iterative nature (more on this later!) and how data scientists are constantly accessing the same data many times over.

Spark can be run in a variety of ways, such as the following:

  • Local mode: This entails a single Java Virtual Machine (JVM) executed on a single host
  • Standalone Spark cluster: This entails multiple JVMs on multiple hosts
  • Via resource manager such as Yarn/Mesos: This application deployment is driven by a resource manager, which controls the allocation of nodes, application, distribution, and deployment

What is Databricks?

If you know about the Spark project, then chances are high that you have also heard of a company called Databricks. However, you might not know how Databricks and the Spark project are related to one another. In short, Databricks was founded by the creators of the Apache Spark project and accounts for over 75% of the code base for the Spark project. Aside from being a huge force behind the Spark project with respect to development, Databricks also offers various certifications in Spark for developers, administrators, trainers, and analysts alike. However, Databricks is not the only main contributor to the code base; companies such as IBM, Cloudera, and Microsoft also actively participate in Apache Spark development.

As a side note, Databricks also organizes the Spark Summit (in both Europe and the US), which is the premier Spark conference and a great place to learn about the latest developments in the project and how others are using Spark within their ecosystem.

Throughout this book, we will give recommended links that we read daily that offer great insights and also important changes with respect to the new versions of Spark. One of the best resources here is the Databricks blog, which is constantly being updated with great content. Be sure to regularly check this out at https://databricks.com/blog.

Also, here is a link to see the past Spark Summit talks, which you may find helpful:
http://slideshare.net/databricks.

Inside the box

So, you have downloaded the latest version of Spark (depending on how you plan on launching Spark) and you have run the standard Hello, World! example....what now?!

Spark comes equipped with five libraries, which can be used separately--or in unison--depending on the task we are trying to solve. Note that in this book, we plan on using a variety of different libraries, all within the same application so that you will have the maximum exposure to the Spark platform and better understand the benefits (and limitations) of each library. These five libraries are as follows:

  • Core: This is the Spark core infrastructure, providing primitives to represent and store data called Resilient Distributed Dataset (RDDs) and manipulate data with tasks and jobs.
  • SQL : This library provides user-friendly API over core RDDs by introducing DataFrames and SQL to manipulate with the data stored.
  • MLlib (Machine Learning Library) : This is Spark's very own machine learning library of algorithms developed in-house that can be used within your Spark application.
  • Graphx : This is used for graphs and graph-calculations; we will explore this particular library in depth in a later chapter.
  • Streaming : This library allows real-time streaming of data from various sources, such as Kafka, Twitter, Flume, and TCP sockets, to name a few. Many of the applications we will build in this book will leverage the MLlib and Streaming libraries to build our applications.

The Spark platform can also be extended by third-party packages. There are many of them, for example, support for reading CSV or Avro files, integration with Redshift, and Sparkling Water, which encapsulates the H2O machine learning library.

Introducing H2O.ai

H2O is an open source, machine learning platform that plays extremely well with Spark; in fact, it was one of the first third-party packages deemed "Certified on Spark".

Sparkling Water (H2O + Spark) is H2O's integration of their platform within the Spark project, which combines the machine learning capabilities of H2O with all the functionality of Spark. This means that users can run H2O algorithms on Spark RDD/DataFrame for both exploration and deployment purposes. This is made possible because H2O and Spark share the same JVM, which allows for seamless transitions between the two platforms. H2O stores data in the H2O frame, which is a columnar-compressed representation of your dataset that can be created from Spark RDD and/or DataFrame. Throughout much of this book, we will be referencing algorithms from Spark's MLlib library and H2O's platform, showing how to use both the libraries to get the best results possible for a given task.

The following is a summary of the features Sparkling Water comes equipped with:

  • Use of H2O algorithms within a Spark workflow
  • Transformations between Spark and H2O data structures
  • Use of Spark RDD and/or DataFrame as inputs to H2O algorithms
  • Use of H2O frames as inputs into MLlib algorithms (will come in handy when we do feature engineering later)
  • Transparent execution of Sparkling Water applications on top of Spark (for example, we can run a Sparkling Water application within a Spark stream)
  • The H2O user interface to explore Spark data

Design of Sparkling Water

Sparkling Water is designed to be executed as a regular Spark application. Consequently, it is launched inside a Spark executor created after submitting the application. At this point, H2O starts services, including a distributed key-value (K/V) store and memory manager, and orchestrates them into a cloud. The topology of the created cloud follows the topology of the underlying Spark cluster.

As stated previously, Sparkling Water enables transformation between different types of RDDs/DataFrames and H2O's frame, and vice versa. When converting from a hex frame to an RDD, a wrapper is created around the hex frame to provide an RDD-like API. In this case, data is not duplicated but served directly from the underlying hex frame. Converting from an RDD/DataFrame to a H2O frame requires data duplication because it transforms data from Spark into H2O-specific storage. However, data stored in an H2O frame is heavily compressed and does not need to be preserved as an RDD anymore:

Data sharing between sparkling water and Spark

What's the difference between H2O and Spark's MLlib?

As stated previously, MLlib is a library of popular machine learning algorithms built using Spark. Not surprisingly, H2O and MLlib share many of the same algorithms but differ in both their implementation and functionality. One very handy feature of H2O is that it allows users to visualize their data and perform feature engineering tasks, which we will cover in depth in later chapters. The visualization of data is accomplished by a web-friendly GUI and allows users a friendly interface to seamlessly switch between a code shell and a notebook-friendly environment. The following is an example of the H2O notebook - called Flow - that you will become familiar with soon:

One other nice addition is that H2O allows data scientists to grid search many hyper-parameters that ship with their algorithms. Grid search is a way of optimizing all the hyperparameters of an algorithm to make model configuration easier. Often, it is difficult to know which hyperparameters to change and how to change them; the grid search allows us to explore many hyperparameters simultaneously, measure the output, and help select the best models based on our quality requirements. The H2O grid search can be combined with model cross-validation and various stopping criteria, resulting in advanced strategies such as picking 1000 random parameters from a huge parameters hyperspace and finding the best model that can be trained under two minutes and with AUC greater than 0.7

Data munging

Raw data for problems often comes from multiple sources with different and often incompatible formats. The beauty of the Spark programming model is its ability to define data operations that process the incoming data and transform it into a regular form that can be used for further feature engineering and model building. This process is commonly referred to as data munging and is where much of the battle is won with respect to data science projects. We keep this section intentionally brief because the best way to showcase the power--and necessity!--of data munging is by example. So, take heart; we have plenty of practice to go through in this book, which emphasizes this essential process.

Data science - an iterative process

Often, the process flow of many big data projects is iterative, which means a lot of back-and-forth testing new ideas, new features to include, tweaking various hyper-parameters, and so on, with a fail fast attitude. The end result of these projects is usually a model that can answer a question being posed. Notice that we didn't say accurately answer a question being posed! One pitfall of many data scientists these days is their inability to generalize a model for new data, meaning that they have overfit their data so that the model provides poor results when given new data. Accuracy is extremely task-dependent and is usually dictated by the business needs with some sensitivity analysis being done to weigh the cost-benefits of the model outcomes. However, there are a few standard accuracy measures that we will go over throughout this book so that you can compare various models to see how changes to the model impact the result.

H2O is constantly giving meetup talks and inviting others to give machine learning meetups around the US and Europe. Each meetup or conference slides is available on SlideShare (http://www.slideshare.com/0xdata) or YouTube. Both the sites serve as great sources of information not only about machine learning and statistics but also about distributed systems and computation. For example, one of the most interesting presentations highlights the "Top 10 pitfalls in a data scientist job" (http://www.slideshare.net/0xdata/h2o-world-top-10-data-science-pitfalls-mark-landry)

Summary

In this chapter, we wanted to give you a brief glimpse into the life of a data scientist, what this entails, and some of the challenges that data scientists consistently face. In light of these challenges, we feel that the Apache Spark project is ideally positioned to help tackle these topics, which range from data ingestion and feature extraction/creation to model building and deployment. We intentionally kept this chapter short and light on verbiage because we feel working through examples and different use cases is a better use of time as opposed to speaking abstractly and at length about a given data science topic. Throughout the rest of this book, we will focus solely on this process while giving best-practice tips and recommended reading along the way for users who wish to learn more. Remember that before embarking on your next data science project, be sure to clearly define the problem beforehand, so you can ask an intelligent question of your data and (hopefully) get an intelligent answer!

One awesome website for all things data science is KDnuggets (http://www.kdnuggets.com). Here's a great article on the language all data scientists must learn in order to be successful (http://www.kdnuggets.com/2015/09/one-language-data-scientist-must-master.html).

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • • Process and analyze big data in a distributed and scalable way
  • • Write sophisticated Spark pipelines that incorporate elaborate extraction
  • • Build and use regression models to predict flight delays

Description

The purpose of machine learning is to build systems that learn from data. Being able to understand trends and patterns in complex data is critical to success; it is one of the key strategies to unlock growth in the challenging contemporary marketplace today. With the meteoric rise of machine learning, developers are now keen on finding out how can they make their Spark applications smarter. This book gives you access to transform data into actionable knowledge. The book commences by defining machine learning primitives by the MLlib and H2O libraries. You will learn how to use Binary classification to detect the Higgs Boson particle in the huge amount of data produced by CERN particle collider and classify daily health activities using ensemble Methods for Multi-Class Classification. Next, you will solve a typical regression problem involving flight delay predictions and write sophisticated Spark pipelines. You will analyze Twitter data with help of the doc2vec algorithm and K-means clustering. Finally, you will build different pattern mining models using MLlib, perform complex manipulation of DataFrames using Spark and Spark SQL, and deploy your app in a Spark streaming environment.

Who is this book for?

Are you a developer with a background in machine learning and statistics who is feeling limited by the current slow and “small data” machine learning tools? Then this is the book for you! In this book, you will create scalable machine learning applications to power a modern data-driven business using Spark. We assume that you already know the machine learning concepts and algorithms and have Spark up and running (whether on a cluster or locally) and have a basic knowledge of the various libraries contained in Spark.

What you will learn

  • • Use Spark streams to cluster tweets online
  • • Run the PageRank algorithm to compute user influence
  • • Perform complex manipulation of DataFrames using Spark
  • • Define Spark pipelines to compose individual data transformations
  • • Utilize generated models for off-line/on-line prediction
  • • Transfer the learning from an ensemble to a simpler Neural Network
  • • Understand basic graph properties and important graph operations
  • • Use GraphFrames, an extension of DataFrames to graphs, to study graphs using an elegant query language
  • • Use K-means algorithm to cluster movie reviews dataset
Estimated delivery fee Deliver to United States

Economy delivery 10 - 13 business days

Free $6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Aug 31, 2017
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781785283451
Vendor :
Apache
Category :
Languages :

What do you get with Print?

Product feature icon Instant access to your digital eBook copy whilst your Print order is Shipped
Product feature icon Paperback book shipped to your preferred address
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
OR
Modal Close icon
Payment Processing...
tick Completed

Shipping Address

Billing Address

Shipping Methods
Estimated delivery fee Deliver to United States

Economy delivery 10 - 13 business days

Free $6.95

Premium delivery 6 - 9 business days

$21.95
(Includes tracking information)

Product Details

Publication date : Aug 31, 2017
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781785283451
Vendor :
Apache
Category :
Languages :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 170.97
Apache Spark 2.x Machine Learning Cookbook
$54.99
Mastering Machine Learning with Spark 2.x
$54.99
Mastering Spark for Data Science
$60.99
Total $ 170.97 Stars icon
Banner background image

Table of Contents

8 Chapters
Introduction to Large-Scale Machine Learning and Spark Chevron down icon Chevron up icon
Detecting Dark Matter - The Higgs-Boson Particle Chevron down icon Chevron up icon
Ensemble Methods for Multi-Class Classification Chevron down icon Chevron up icon
Predicting Movie Reviews Using NLP and Spark Streaming Chevron down icon Chevron up icon
Word2vec for Prediction and Clustering Chevron down icon Chevron up icon
Extracting Patterns from Clickstream Data Chevron down icon Chevron up icon
Graph Analytics with GraphX Chevron down icon Chevron up icon
Lending Club Loan Prediction Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Full star icon 5
(1 Ratings)
5 star 100%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Canming Oct 21, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Awesome book to get you started in machine learning.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is the delivery time and cost of print book? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela
What is custom duty/charge? Chevron down icon Chevron up icon

Customs duty are charges levied on goods when they cross international borders. It is a tax that is imposed on imported goods. These duties are charged by special authorities and bodies created by local governments and are meant to protect local industries, economies, and businesses.

Do I have to pay customs charges for the print book order? Chevron down icon Chevron up icon

The orders shipped to the countries that are listed under EU27 will not bear custom charges. They are paid by Packt as part of the order.

List of EU27 countries: www.gov.uk/eu-eea:

A custom duty or localized taxes may be applicable on the shipment and would be charged by the recipient country outside of the EU27 which should be paid by the customer and these duties are not included in the shipping charges been charged on the order.

How do I know my custom duty charges? Chevron down icon Chevron up icon

The amount of duty payable varies greatly depending on the imported goods, the country of origin and several other factors like the total invoice amount or dimensions like weight, and other such criteria applicable in your country.

For example:

  • If you live in Mexico, and the declared value of your ordered items is over $ 50, for you to receive a package, you will have to pay additional import tax of 19% which will be $ 9.50 to the courier service.
  • Whereas if you live in Turkey, and the declared value of your ordered items is over € 22, for you to receive a package, you will have to pay additional import tax of 18% which will be € 3.96 to the courier service.
How can I cancel my order? Chevron down icon Chevron up icon

Cancellation Policy for Published Printed Books:

You can cancel any order within 1 hour of placing the order. Simply contact customercare@packt.com with your order details or payment transaction id. If your order has already started the shipment process, we will do our best to stop it. However, if it is already on the way to you then when you receive it, you can contact us at customercare@packt.com using the returns and refund process.

Please understand that Packt Publishing cannot provide refunds or cancel any order except for the cases described in our Return Policy (i.e. Packt Publishing agrees to replace your printed book because it arrives damaged or material defect in book), Packt Publishing will not accept returns.

What is your returns and refunds policy? Chevron down icon Chevron up icon

Return Policy:

We want you to be happy with your purchase from Packtpub.com. We will not hassle you with returning print books to us. If the print book you receive from us is incorrect, damaged, doesn't work or is unacceptably late, please contact Customer Relations Team on customercare@packt.com with the order number and issue details as explained below:

  1. If you ordered (eBook, Video or Print Book) incorrectly or accidentally, please contact Customer Relations Team on customercare@packt.com within one hour of placing the order and we will replace/refund you the item cost.
  2. Sadly, if your eBook or Video file is faulty or a fault occurs during the eBook or Video being made available to you, i.e. during download then you should contact Customer Relations Team within 14 days of purchase on customercare@packt.com who will be able to resolve this issue for you.
  3. You will have a choice of replacement or refund of the problem items.(damaged, defective or incorrect)
  4. Once Customer Care Team confirms that you will be refunded, you should receive the refund within 10 to 12 working days.
  5. If you are only requesting a refund of one book from a multiple order, then we will refund you the appropriate single item.
  6. Where the items were shipped under a free shipping offer, there will be no shipping costs to refund.

On the off chance your printed book arrives damaged, with book material defect, contact our Customer Relation Team on customercare@packt.com within 14 days of receipt of the book with appropriate evidence of damage and we will work with you to secure a replacement copy, if necessary. Please note that each printed book you order from us is individually made by Packt's professional book-printing partner which is on a print-on-demand basis.

What tax is charged? Chevron down icon Chevron up icon

Currently, no tax is charged on the purchase of any print book (subject to change based on the laws and regulations). A localized VAT fee is charged only to our European and UK customers on eBooks, Video and subscriptions that they buy. GST is charged to Indian customers for eBooks and video purchases.

What payment methods can I use? Chevron down icon Chevron up icon

You can pay with the following card types:

  1. Visa Debit
  2. Visa Credit
  3. MasterCard
  4. PayPal
What is the delivery time and cost of print books? Chevron down icon Chevron up icon

Shipping Details

USA:

'

Economy: Delivery to most addresses in the US within 10-15 business days

Premium: Trackable Delivery to most addresses in the US within 3-8 business days

UK:

Economy: Delivery to most addresses in the U.K. within 7-9 business days.
Shipments are not trackable

Premium: Trackable delivery to most addresses in the U.K. within 3-4 business days!
Add one extra business day for deliveries to Northern Ireland and Scottish Highlands and islands

EU:

Premium: Trackable delivery to most EU destinations within 4-9 business days.

Australia:

Economy: Can deliver to P. O. Boxes and private residences.
Trackable service with delivery to addresses in Australia only.
Delivery time ranges from 7-9 business days for VIC and 8-10 business days for Interstate metro
Delivery time is up to 15 business days for remote areas of WA, NT & QLD.

Premium: Delivery to addresses in Australia only
Trackable delivery to most P. O. Boxes and private residences in Australia within 4-5 days based on the distance to a destination following dispatch.

India:

Premium: Delivery to most Indian addresses within 5-6 business days

Rest of the World:

Premium: Countries in the American continent: Trackable delivery to most countries within 4-7 business days

Asia:

Premium: Delivery to most Asian addresses within 5-9 business days

Disclaimer:
All orders received before 5 PM U.K time would start printing from the next business day. So the estimated delivery times start from the next day as well. Orders received after 5 PM U.K time (in our internal systems) on a business day or anytime on the weekend will begin printing the second to next business day. For example, an order placed at 11 AM today will begin printing tomorrow, whereas an order placed at 9 PM tonight will begin printing the day after tomorrow.


Unfortunately, due to several restrictions, we are unable to ship to the following countries:

  1. Afghanistan
  2. American Samoa
  3. Belarus
  4. Brunei Darussalam
  5. Central African Republic
  6. The Democratic Republic of Congo
  7. Eritrea
  8. Guinea-bissau
  9. Iran
  10. Lebanon
  11. Libiya Arab Jamahriya
  12. Somalia
  13. Sudan
  14. Russian Federation
  15. Syrian Arab Republic
  16. Ukraine
  17. Venezuela