Preparing dataset for the deep learning pipeline
We are now ready to prepare our dataset to be fed into the deep learning model that we will build in Keras.
Getting ready
While preparing the dataset for Keras
we will import the following libraries into our notebook:
import pyspark.sql.functions as F
import numpy as np
from pyspark.ml.feature import StringIndexer
import keras.utils
How to do it...
This section walks through the following steps to prepare the dataset for the deep learning pipeline:
- Execute the following script to clean up the column names:
mainDF = mainDF.withColumnRenamed('userId_1', 'userid') mainDF = mainDF.withColumnRenamed('movieId_1', 'movieid') mainDF = mainDF.withColumnRenamed('rating_1', 'rating') mainDF = mainDF.withColumnRenamed('timestamp_1', 'timestamp') mainDF = mainDF.withColumnRenamed('imdbId', 'imdbid') mainDF = mainDF.withColumnRenamed('tmdbId', 'tmdbid')
- The
rating
column is currently divided into 0.5 increments. Tweak the ratings to be rounded to a whole integer using...