For the following sections, we will employ the book rating dataset known as goodbooks-10k to illustrate all of the topics outlined previously. The dataset consists of 6 million ratings on 10,000 books from 53,424 users. More details on the goodbooks-10k dataset can be found https://www.kaggle.com/zygmunt/goodbooks-10k#books.csv.
In the folder associated with this chapter, you will find two CSV files:
- ratings.csv: Contains book ratings, user IDs, book IDs, and rating
- books.csv: Contains book attributes, including title
It is now time to wrangle big data to create a dataset for modeling.