Manipulating and merging the MovieLens datasets
We currently have four separate datasets that we are working with, but ultimately we would like to get it down to a single dataset. This chapter will focus on pairing down our datasets to one.
Getting ready
This section will not require any import of PySpark libraries but a background in SQL joins will come in handy, as we will explore multiple approaches to joining dataframes.
How to do it...
This section will walk through the following steps for joining dataframes in PySpark:
- Execute the following script to rename all field names in
ratings
, by appending a_1
to the end of the name:
for i in ratings.columns: ratings = ratings.withColumnRenamed(i, i+'_1')
- Execute the following script to
inner join
themovies
dataset to theratings
dataset, creating a new table calledtemp1
:
temp1 = ratings.join(movies, ratings.movieId_1 == movies.movieId, how = 'inner')
- Execute the following script to inner join the
temp1
dataset to thelinks
dataset, creating...