Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Spark for Data Science

You're reading from   Spark for Data Science Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

Arrow left icon
Product type Paperback
Published in Sep 2016
Publisher Packt
ISBN-13 9781785885655
Length 344 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Bikramaditya Singhal Bikramaditya Singhal
Author Profile Icon Bikramaditya Singhal
Bikramaditya Singhal
Srinivas Duvvuri Srinivas Duvvuri
Author Profile Icon Srinivas Duvvuri
Srinivas Duvvuri
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Big Data and Data Science – An Introduction FREE CHAPTER 2. The Spark Programming Model 3. Introduction to DataFrames 4. Unified Data Access 5. Data Analysis on Spark 6. Machine Learning 7. Extending Spark with SparkR 8. Analyzing Unstructured Data 9. Visualizing Big Data 10. Putting It All Together 11. Building Data Science Applications

What this book covers

Chapter 1, Big Data and Data Science – An Introduction, this chapter discusses briefly about the various challenges in big data analytics and how Apache Spark solves those problems on a single platform. This chapter also explains how data analytics has evolved to what it is now and also gives a basic idea on the Spark stack.

Chapter 2, The Spark Programming Model, this chapter talks about the design considerations of Apache Spark and the supported programming languages. It also explains the Spark core components and covers the RDD API in details, which is the basic building block of Spark.

Chapter 3, Introduction to DataFrames, this chapter explains about the DataFrames, which are the most handy and useful component for the data scientists to work at ease. It explains about Spark SQL and the Catalyst optimizer that empowers DataFrames. Also, various DataFrames operations are demonstrated with code examples.

Chapter 4, Unified Data Access, this chapter talks about the various ways we source data from different sources, consolidate and work in a unified way. It covers the streaming aspect of real time data collection and operating on them. It also talks about the under-the-hood fundamentals of these APIs.

Chapter 5, Data Analysis on Spark, this chapter discuss about the complete data analytics lifecycle. With ample code examples, it explains how to source data from different sources, prepare the data using data cleaning and transformation techniques, and perform descriptive and inferential statistics to generate hidden insights from data.

Chapter 6, Machine Learning, this chapter explains various machine learning algorithms, how they are implemented in the MLlib library and how they can be used with the pipeline API for a streamlined execution. This chapter covers the fundamentals of all the algorithms covered so it could serve as a one stop reference.

Chapter 7, Extending Spark with SparkR, this chapter is primarily intended for the R programmers who want to leverage Spark for Data Analytics. It explains how to program with SparkR and how to use the machine learning algorithms of R libraries.

Chapter 8, Analyzing Unstructured Data, this chapter discusses only about unstructured data analysis. It explains how to source unstructured data, process it and perform machine learning on it. It also covers some of the dimension reduction techniques which were not covered in the “Machine Learning” chapter.

Chapter 9, Visualizing Big Data, in this chapter, readers learn various visualization techniques that are supported on Spark. It explains the different kinds of visualization requirements of data engineers, data scientists and business users; and also suggests right kinds of tools and techniques. It also talks about leveraging IPython/Jupyter notebook and Zeppelin, an Apache project for data visualization.

Chapter 10,Putting It All Together, till now the book has discussed about most of the data analytics components in different chapters separately. This chapter is an effort to stich various steps on a typical data science project and demonstrate a step-by-step approach to a full blown analytics project execution.

Chapter 11,Building Data Science Applications, till now the book has mostly discussed about the data science components along with a full blown execution example. This chapter provides a heads up on how to build data products that can be deployed in production. It also gives an idea on the current development status of the Apache Spark project and what is in store for it.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image