Visualizing word counts in the dataset
A picture is worth a thousand words and this section will set out to prove that. Unfortunately, Spark does not have any inherent plotting capabilities as of version 2.2. In order to plot values in a dataframe, we must convert to pandas
.
Getting ready
This section will require importing matplotlib
for plotting:
import matplotlib.pyplot as plt %matplotlib inline
How to do it...
This section walks through the steps to convert the Spark dataframe into a visualization that can be seen in the Jupyter notebook.
- Convert Spark dataframe to a
pandas
dataframe using the following script:
df_plot = df.select('id', 'word_count').toPandas()
- Plot the dataframe using the following script:
import matplotlib.pyplot as plt %matplotlib inline df_plot.set_index('id', inplace=True) df_plot.plot(kind='bar', figsize=(16, 6)) plt.ylabel('Word Count') plt.title('Word Count distribution') plt.show()
How it works...
This section explains how the Spark dataframe is converted to pandas
...