Learning on massive click logs with Spark
Normally, in order to take advantage of Spark, data is stored using Hadoop Distributed File System (HDFS), which is a distributed file system designed to store large volumes of data, and computation occurs over multiple nodes on clusters. For demonstration purposes, we will keep the data on a local machine and run Spark locally. This is no different from running it on a distributed computing cluster.
Loading click logs
To train a model on massive click logs, we first need to load the data in Spark. We do so by taking the following steps:
- We spin up the PySpark shell by using the following command:
./bin/pyspark --master local[*] --driver-memory 20G
Here, we specify a large driver memory as we are dealing with a dataset of more than 6 GB.
A driver program is responsible for collecting and storing processed results from executors. So, a large driver memory helps complete jobs where...