Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Mastering Machine Learning with Spark 2.x

You're reading from   Mastering Machine Learning with Spark 2.x Harness the potential of machine learning, through spark

Arrow left icon
Product type Paperback
Published in Aug 2017
Publisher Packt
ISBN-13 9781785283451
Length 340 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Alex Tellez Alex Tellez
Author Profile Icon Alex Tellez
Alex Tellez
Michal Malohlava Michal Malohlava
Author Profile Icon Michal Malohlava
Michal Malohlava
Max Pumperla Max Pumperla
Author Profile Icon Max Pumperla
Max Pumperla
Arrow right icon
View More author details
Toc

Introduction to Large-Scale Machine Learning and Spark

"Information is the oil of the 21st century, and analytics is the combustion engine."


--Peter Sondergaard, Gartner Research

By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_data_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop.

However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverage the power of modern computers to apply various mathematical methods in order to learn more about data and patterns, and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation-expensive statistical or machine learning algorithms have started to help extract nuggets of information from data.

One of the first well-adopted big data technologies was Hadoop, which allows for the MapReduce computation by saving intermediate results on a disk. However, it still lacks proper big data tools for information extraction. Nevertheless, Hadoop was just the beginning. With the growing size of machine memory, new in-memory computation frameworks appeared, and they also started to provide basic support for conducting data analysis and modeling—for example, SystemML or Spark ML for Spark and FlinkML for Flink. These frameworks represent only the tip of the iceberg—there is a lot more in the big data ecosystem, and it is permanently evolving, since the volume of data is constantly growing, demanding new big data algorithms and processing methods. For example, the Internet of Things (IoT) represents a new domain that produces huge amount of streaming data from various sources (for example, home security system, Alexa Echo, or vital sensors) and brings not only an unlimited potential to mind useful information from data, but also demands new kind of data processing and modeling methods.

Nevertheless, in this chapter, we will start from the beginning and explain the following topics:

  • Basic working tasks of data scientists
  • Aspect of big data computation in distributed environment
  • The big data ecosystem
  • Spark and its machine learning support
You have been reading a chapter from
Mastering Machine Learning with Spark 2.x
Published in: Aug 2017
Publisher: Packt
ISBN-13: 9781785283451
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image