Exploring the MovieLens datasets
Before any modeling takes place, it is important to get familiar with the source dataset and perform some exploratory data analysis.
Getting ready
We will import the following library to assist with visualizing and exploring the MovieLens dataset: matplotlib
.
How to do it...
This section will walk through the steps to analyze the movie ratings in the MovieLens database:
- Retrieve some summary statistics on the
rating_1
column by executing the following script:
mainDF.describe('rating_1').show
- Build a histogram of the distribution of ratings by executing the following script:
import matplotlib.pyplot as plt %matplotlib inline mainDF.select('rating_1').toPandas().hist(figsize=(16, 6), grid=True) plt.title('Histogram of Ratings') plt.show()
- Execute the following script to view the values of the histogram in a spreadsheet dataframe:
mainDF.groupBy(['rating_1']).agg({'rating_1':'count'})\ .withColumnRenamed('count(rating_1)', 'Row Count').orderBy(["Row Count"],ascending...