Using IPython with PySpark
As Python is the most preferred choice for data scientists due to its high-level syntax and extensive library of packages, Spark developers have considered it for data analysis. The PySpark API has been developed for working with RDDs in Python. IPython Notebook is an essential tool for data scientists to present the scientific and theoretical work in an interactive fashion, integrating both text and Python code.
This recipe shows how to configure IPython with PySpark and also focuses on connecting the IPython shell to PySpark.
Getting ready
To step through this recipe, you need Ubuntu 14.04 (Linux flavor) installed on the machine. Python comes pre-installed. The python --version
command gives the version of the Python installed. If the version seems to be 2.6.x, upgrade it to Python 2.7 as follows:
sudo apt-get install python2.7
How to do it…
Install IPython as follows:
sudo pip install ipython
Create an IPython profile for use with PySpark...