Normally, in order to take advantage of Spark, data is stored in a Hadoop Distributed File System (HDFS), which is a distributed filesystem designed to store large volumes of data, and computation occurs over multiple nodes on clusters. For demonstration purposes, we are keeping the data on a local machine and running Spark locally. It is no different from running it on a distributed computing cluster.
Learning on massive click logs with Spark
Loading click logs
To train a model on massive click logs, we first need to load the data in Spark. We do so by taking the following steps:
- First, we spin up the PySpark shell by using the following command:
./bin/pyspark --master local[*] --driver-memory 20G
Here, we specify a large...