Installing and configuring PySpark
PySpark is installed with Spark. You can see it in the ~/spark3/bin
directory, as well as other libraries and tools. To configure PySpark to run, you need to export environment variables. The variables are shown here:
export SPARK_HOME=/home/paulcrickard/spark3 export PATH=$SPARK_HOME/bin:$PATH export PYSPARK_PYTHON=python3
The preceding command set the SPARK_HOME
variable. This will be where you installed Spark. I have pointed the variable to the head of the Spark cluster because the node would really be on another machine. Then, it adds SPARK_HOME
to your path. This means that when you type a command, the operating system will look for it in the directories specified in your path, so now it will search ~/spark3/bin
, which is where PySpark lives.
Running the preceding commands in a terminal will allow Spark to run while the terminal is open. You will have to rerun these commands every time. To make them permanent, you can add the commands...