As with all other scripts we have presented so far, we will begin by setting some global variables.
If you do not know what these mean, check the Installing Spark from sources recipe.
Livy requires some configuration files from Hadoop. Thus, as part of this script, we allow you to install Hadoop should it not be present on your machine. That is why we now allow you to pass arguments to the downloadThePackage, unpack, and moveTheBinaries functions.
The changes to the functions are fairly self-explanatory, so for the sake of space, we will not be pasting the code here. You are more than welcome, though, to peruse the relevant portions of the installLivy.sh script.
Installing Livy drills down literally to downloading the package, unpacking it, and moving it to its final destination (in our case, this is /opt/livy).
Checking if Hadoop is installed is the next thing on our to-do list. To run Livy with local sessions, we require two environment variables: SPARK_HOME and HADOOP_CONF_DIR; the SPARK_HOME is definitely set but if you do not have Hadoop installed, you most likely will not have the latter environment variable set:
function checkHadoop() {
if type -p hadoop; then
echo "Hadoop executable found in PATH"
_hadoop=hadoop
elif [[ -n "$HADOOP_HOME" ]] && [[ -x "$HADOOP_HOME/bin/hadoop" ]]; then
echo "Found Hadoop executable in HADOOP_HOME"
_hadoop="$HADOOP_HOME/bin/hadoop"
else
echo "No Hadoop found. You should install Hadoop first. You can still continue but some functionality might not be available. "
echo
echo -n "Do you want to install the latest version of Hadoop? [y/n]: "
read _install_hadoop
case "$_install_hadoop" in
y*) installHadoop ;;
n*) echo "Will not install Hadoop" ;;
*) echo "Will not install Hadoop" ;;
esac
fi
}
function installHadoop() {
_hadoop_binary="http://mirrors.ocf.berkeley.edu/apache/hadoop/common/hadoop-2.9.0/hadoop-2.9.0.tar.gz"
_hadoop_archive=$( echo "$_hadoop_binary" | awk -F '/' '{print $NF}' )
_hadoop_dir=$( echo "${_hadoop_archive%.*}" )
_hadoop_dir=$( echo "${_hadoop_dir%.*}" )
downloadThePackage $( echo "${_hadoop_binary}" )
unpack $( echo "${_hadoop_archive}" )
moveTheBinaries $( echo "${_hadoop_dir}" ) $( echo "${_hadoop_destination}" )
}
The checkHadoop function first checks if the hadoop binary is present on the PATH; if not, it will check if the HADOOP_HOME variable is set and, if it is, it will check if the hadoop binary can be found inside the $HADOOP_HOME/bin folder. If both attempts fail, the script will ask you if you want to install the latest version of Hadoop; the default answer is n but if you answer y, the installation will begin.
Once the installation finishes, we will begin installing the additional kernels for the Jupyter Notebooks.
Here's the function that handles the kernel's installation:
function installJupyterKernels() {
# install the library
pip install sparkmagic
echo
# ipywidgets should work properly
jupyter nbextension enable --py --sys-prefix widgetsnbextension
echo
# install kernels
# get the location of sparkmagic
_sparkmagic_location=$(pip show sparkmagic | awk -F ':' '/Location/ {print $2}')
_temp_dir=$(pwd) # store current working directory
cd $_sparkmagic_location # move to the sparkmagic folder
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
jupyter-kernelspec install sparkmagic/kernels/pyspark3kernel
echo
# enable the ability to change clusters programmatically
jupyter serverextension enable --py sparkmagic
echo
# install autowizwidget
pip install autovizwidget
cd $_temp_dir
}
First, we install the sparkmagic package for Python. Quoting directly from https://github.com/jupyter-incubator/sparkmagic:
"Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter Notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment."
The following command enables the Javascript extensions in Jupyter Notebooks so that ipywidgets can work properly; if you have an Anaconda distribution of Python, this package will be installed automatically.
Following this, we install the kernels. We need to switch to the folder where sparkmagic was installed into. The pip show <package> command displays all relevant information about the installed packages; from the output, we only extract the Location using awk.
To install the kernels, we use the jupyter-kernelspec install <kernel> command. For example, the command will install the sparkmagic kernel for the Scala API of Spark:
jupyter-kernelspec install sparkmagic/kernels/sparkkernel
Once all the kernels are installed, we enable Jupyter to use sparkmagic so that we can change clusters programmatically. Finally, we will install the autovizwidget, an auto-visualization library for pandas dataframes.
This concludes the Livy and sparkmagic installation part.