Packt+ | Advance your knowledge in tech

You're reading from Apache Spark 2.x Cookbook Over 70 cloud-ready recipes for distributed Big Data processing and analytics

Product type Paperback

Published in May 2017

Publisher

ISBN-13 9781787127265

Length 294 pages

Edition 1st Edition

Languages

Scala

Tools

Apache Spark

Concepts

Big Data

Author (1):

Rishi Yadav

Let's revisit our word count example to understand these five parts. This is how an RDD graph looks for wordCount at the dataset level view:

Basically, this is how the flow goes:

        scala> val words = sc.textFile("hdfs://localhost:9000/user/hduser/words")

The following are the five parts of the words RDD:

Part	Description
Partitions	One partition per HDFS inputsplit/block (`org.apache.spark.rdd.HadoopPartition`)
Dependencies	None
Compute function	To read the block
Preferred location	The HDFS block's location
Partitioner	None

        scala> val wordsFlatMap = words.flatMap(_.split("W+"))

The following are the five parts of the wordsFlatMap RDD:

Part	Description
Partitions	Same as the parent RDD, that is, `words` (`org.apache.spark.rdd.HadoopPartition`)
Dependencies	Same as the parent RDD, that is, `words` (`org.apache.spark.OneToOneDependency`)
Compute function	To compute the parent and split each element, which flattens the results
Preferred location	Ask parent RDD
Partitioner	None

        scala> val wordsMap = wordsFlatMap.map( w => (w,1))

The following are the five parts of the wordsMap RDD:

Part	Description
Partitions	Same as the parent RDD, that is, wordsFlatMap (org.apache.spark.rdd.HadoopPartition)
Dependencies	Same as the parent RDD, that is, wordsFlatMap (org.apache.spark.OneToOneDependency)
Compute function	To compute the parent and map it to PairRDD
Preferred Location	Ask parent RDD
Partitioner	None

        scala> val wordCount = wordsMap.reduceByKey(_+_)

The following are the five parts of the wordCount RDD:

Part	Description
Partitions	One per reduce task (`org.apache.spark.rdd.ShuffledRDDPartition`)
Dependencies	Shuffle dependency on each parent (`org.apache.spark.ShuffleDependency`)
Compute function	To perform additions on shuffled data
Preferred location	None
Partitioner	HashPartitioner (`org.apache.spark.HashPartitioner`)