Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
PySpark Cookbook

You're reading from   PySpark Cookbook Over 60 recipes for implementing big data processing and analytics using Apache Spark and Python

Arrow left icon
Product type Paperback
Published in Jun 2018
Publisher Packt
ISBN-13 9781788835367
Length 330 pages
Edition 1st Edition
Languages
Concepts
Arrow right icon
Authors (2):
Arrow left icon
Tomasz Drabas Tomasz Drabas
Author Profile Icon Tomasz Drabas
Tomasz Drabas
Denny Lee Denny Lee
Author Profile Icon Denny Lee
Denny Lee
Arrow right icon
View More author details
Toc

Table of Contents (9) Chapters Close

Preface 1. Installing and Configuring Spark FREE CHAPTER 2. Abstracting Data with RDDs 3. Abstracting Data with DataFrames 4. Preparing Data for Modeling 5. Machine Learning with MLlib 6. Machine Learning with the ML Module 7. Structured Streaming with PySpark 8. GraphFrames – Graph Theory with PySpark

Introduction

We cannot begin a book on Spark (well, on PySpark) without first specifying what Spark is. Spark is a powerful, flexible, open source, data processing and querying engine. It is extremely easy to use and provides the means to solve a huge variety of problems, ranging from processing unstructured, semi-structured, or structured data, through streaming, up to machine learning. With over 1,000 contributors from over 250 organizations (not to mention over 3,000 Spark Meetup community members worldwide), Spark is now one of the largest open source projects in the portfolio of the Apache Software Foundation.

The origins of Spark can be found in 2012 when it was first released; Matei Zacharia developed the first versions of the Spark processing engine at UC Berkeley as part of his PhD thesis. Since then, Spark has become extremely popular, and its popularity stems from a number of reasons:

  • It is fast: It is estimated that Spark is 100 times faster than Hadoop when working purely in memory, and around 10 times faster when reading or writing data to a disk.
  • It is flexible: You can leverage the power of Spark from a number of programming languages; Spark natively supports interfaces in Scala, Java, Python, and R. 
  • It is extendible: As Spark is an open source package, you can easily extend it by introducing your own classes or extending the existing ones. 
  • It is powerful: Many machine learning algorithms are already implemented in Spark so you do not need to add more tools to your stack—most of the data engineering and data science tasks can be accomplished while working in a single environment.
  • It is familiar: Data scientists and data engineers, who are accustomed to using Python's pandas, or R's data.frames or data.tables, should have a much gentler learning curve (although the differences between these data types exist). Moreover, if you know SQL, you can also use it to wrangle data in Spark!
  • It is scalable: Spark can run locally on your machine (with all the limitations such a solution entails). However, the same code that runs locally can be deployed to a cluster of thousands of machines with little-to-no changes. 

For the remainder of this book, we will assume that you are working in a Unix-like environment such as Linux (throughout this book, we will use Ubuntu Server 16.04 LTS) or macOS (running macOS High Sierra); all the code provided has been tested in these two environments. For this chapter (and some other ones, too), an internet connection is also required as we will be downloading a bunch of binaries and sources from the internet. 

We will not be focusing on installing Spark in a Windows environment as it is not truly supported by the Spark developers. However, if you are inclined to try, you can follow some of the instructions you will find online, such as from the following link: http://bit.ly/2Ar75ld.

Knowing how to use the command line and how to set some environment variables on your system is useful, but not really required—we will guide you through the steps.

You have been reading a chapter from
PySpark Cookbook
Published in: Jun 2018
Publisher: Packt
ISBN-13: 9781788835367
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image