Special note for Windows installation
Spark (really Hadoop) needs a temporary storage location for its working set of data. Under Windows this defaults to the \tmp\hive
location. If the directory does not exist when Spark/Hadoop starts it will create it. Unfortunately, under Windows, the installation does not have the correct tools built-in to set the access privileges to the directory.
You should be able to run chmod
under winutils
to set the access privileges for the hive
directory. However, I have found that the chmod
function does not work correctly.
A better idea has been to create the tmp\hive
directory yourself in admin mode. And then grant full privileges to the hive directory to all users, again in admin mode.
Without this change, Hadoop fails right away. When you start pyspark
, the output (including any errors) are displayed in the command line window. One of the errors will be insufficient access to this directory.