Implementation of the Apriori algorithm in Apache Spark
We have gone through the preceding algorithm. Now we will try to write the entire algorithm in Spark. Spark does not have a default implementation of Apriori algorithm, so we will have to write our own implementation as shown next (refer to the comments in the code as well).
First, we will have the regular boilerplate code to initiate the Spark configuration and context:
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master); JavaSparkContext sc = new JavaSparkContext(conf);
Now, we will load the dataset file using the SparkContext
and store the result in a JavaRDD
instance. We will create the instance of the AprioriUtil
class. This class contains the methods for calculating the support and confidence values. Finally, we will store the total number of transactions (stored in the transactionCount
variable) so that this variable can be broadcasted and reused on different DataNodes when needed:
JavaRDD<String> rddX ...