Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Hands-On Big Data Analytics with PySpark

You're reading from   Hands-On Big Data Analytics with PySpark Analyze large datasets and discover techniques for testing, immunizing, and parallelizing Spark jobs

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher Packt
ISBN-13 9781838644130
Length 182 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Authors (3):
Arrow left icon
James Cross James Cross
Author Profile Icon James Cross
James Cross
Bartłomiej Potaczek Bartłomiej Potaczek
Author Profile Icon Bartłomiej Potaczek
Bartłomiej Potaczek
Rudy Lai Rudy Lai
Author Profile Icon Rudy Lai
Rudy Lai
Arrow right icon
View More author details
Toc

Table of Contents (15) Chapters Close

Preface 1. Installing Pyspark and Setting up Your Development Environment FREE CHAPTER 2. Getting Your Big Data into the Spark Environment Using RDDs 3. Big Data Cleaning and Wrangling with Spark Notebooks 4. Aggregating and Summarizing Data into Useful Reports 5. Powerful Exploratory Data Analysis with MLlib 6. Putting Structure on Your Big Data with SparkSQL 7. Transformations and Actions 8. Immutable Design 9. Avoiding Shuffle and Reducing Operational Expenses 10. Saving Data in the Correct Format 11. Working with the Spark Key/Value API 12. Testing Apache Spark Jobs 13. Leveraging the Spark GraphX API 14. Other Books You May Enjoy

Parallelization with Spark RDDs

Now that we know how to create RDDs within the text file that we received from the internet, we can look at a different way to create this RDD. Let's discuss parallelization with our Spark RDDs.

In this section, we will cover the following topics:

  • What is parallelization?
  • How do we parallelize Spark RDDs?

Let's start with parallelization.

What is parallelization?

The best way to understand Spark, or any language, is to look at the documentation. If we look at Spark's documentation, it clearly states that, for the textFile function that we used last time, it reads the text file from HDFS.

On the other hand, if we look at the definition of parallelize, we can see that this is creating an RDD by distributing a local Scala collection.

So, the main difference between using parallelize to create an RDD and using the textFile to create an RDD is where the data is sourced from.

Let's look at how this works practically. Let's go to the PySpark installation screen, from where we left off previously. So, we imported urllib, we used urllib.request to retrieve some data from the internet, and we used SparkContext and textFile to load this data into Spark. The other way to do this is to use parallelize.

Let's look at how we can do this. Let's first assume that our data is already in Python, and so, for demonstration purposes, we are going to create a Python list of a hundred numbers as follows:

a = range(100)
a

This gives us the following output:

range(0, 100)

For example, if we look at a, it is simply a list of 100 numbers. If we convert this into a list, it will show us the list of 100 numbers:

list (a)

This gives us the following output:

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
...

The following command shows us how to turn this into an RDD:

list_rdd = sc.parallelize(a)

If we look at what list_rdd contains, we can see that it is PythonRDD.scala:52, so, this tells us that the Scala-backed PySpark instance has recognized this as a Python-created RDD, as follows:

list_rdd

This gives us the following output:

PythonRDD[3] at RDD at PythonRDD.scala:52

Now, let's look at what we can do with this list. The first thing we can do is count how many elements are present in list_rdd by using the following command:

list_rdd.count()

This gives us the following output:

100

We can see that list_rdd is counted at 100. If we run it again without cutting through into the results, we can actually see that, since Scala is running in a real time when going through the RDD, it is slower than just running the length of a, which is instant.

However, RDD takes some time, because it needs time to go through the parallelized version of the list. So, at small scales, where there are only a hundred numbers, it might not be very helpful to have this trade-off, but with larger amounts of data and larger individual sizes of the elements of the data, it will make a lot more sense.

We can also take an arbitrary amount of elements from the list, as follows:

list_rdd.take(10)

This gives us the following output:

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

When we run the preceding command, we can see that PySpark has performed some calculations before returning the first ten elements of the list. Notice that all of this is now backed by PySpark, and we are using Spark's power to manipulate this list of 100 items.

Let's now use the reduce function in list_rdd, or in RDDs in general, to demonstrate what we can do with PySpark's RDDs. We will apply two parameter functions as an anonymous lambda function to the reduce call as follows:

list_rdd.reduce(lambda a, b: a+b)

Here, lambda takes two parameters, a and b. It simply adds these two numbers together, hence a+b, and returns the output. With the RDD reduce call, we can sequentially add the first two numbers of RDD lists together, return the results, and then add the third number to the results, and so on. So, eventually, you add all 100 numbers to the same results by using reduce.

Now, after some work through the distributed database, we can now see that adding numbers from 0 to 99 gives us 4950, and it is all done using PySpark's RDD methodology. You might recognize this function from the term MapReduce, and, indeed, it's the same thing.

We have just learned what parallelization is in PySpark, and how we can parallelize Spark RDDs. This effectively amounts to another way for us to create RDDs, and that's very useful for us. Now, let's look at some basics of RDD operation.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image