An RDD can be created in four ways:
- Parallelize a collection: This is one of the easiest ways to create an RDD. You can use the existing collection from your programs, such as List, Array, or Set, as well as others, and ask Spark to distribute that collection across the cluster to process it in parallel. A collection can be distributed with the help of parallelize(), as shown here:
#Python
numberRDD = spark.sparkContext.parallelize(range(1,10))
numberRDD.collect()
Out[4]: [1, 2, 3, 4, 5, 6, 7, 8, 9]
The following code performs the same operation in Scala:
//scala
val numberRDD = spark.sparkContext.parallelize(1 to 10)
numberRDD.collect()
res4: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
- From an external dataset: Though parallelizing a collection is the easiest way to create an RDD, it is not the recommended way for the large datasets. Large datasets...