Visualization
There are multiple visualization packages, but in this section we will be using matplotlib
and Bokeh exclusively to give you the best tools for your needs.
Both of the packages come preinstalled with Anaconda. First, let's load the modules and set them up:
%matplotlib inline import matplotlib.pyplot as plt plt.style.use('ggplot') import bokeh.charts as chrt from bokeh.io import output_notebook output_notebook()
The %matplotlib inline
and the output_notebook()
commands will make every chart generated with matplotlib
or Bokeh, respectively, appear within the notebook and not as a separate window.
Histograms
Histograms are by far the easiest way to visually gauge the distribution of your features. There are three ways you can generate histograms in PySpark (or a Jupyter notebook):
Aggregate the data in workers and return an aggregated list of bins and counts in each bin of the histogram to the driver
Return all the data points to the driver and allow the plotting libraries' methods...