In the article Prabhanjan Tattar, author of book Practical Data Science Cookbook - Second Edition, explainsPython is an interpreted language (sometimes referred to as a scripting language), much like R. It requires no special IDE or software compilation tools and is therefore as fast as R to develop with and prototype. Like R, it also makes use of C shared objects to improve computational performance. Additionally, Python is a default system tool on Linux, Unix, and Mac OS X machines and is available on Windows. Python comes with batteries included, which means that the standard library is widely inclusive of many modules, from multiprocessing to compression toolsets. Python is a flexible computing powerhouse that can tackle any problem domain. If you find yourself in need of libraries that are outside of the standard library, Python also comes with a package manager (like R) that allows the download and installation of other code bases.
(For more resources related to this topic, see here.)
Python’s computational flexibility means that some analytical tasks take more lines of code than their counterpart in R. However, Python does have the tools that allow it to perform the same statistical computing. This leads to an obvious question: When do we use R over Python and vice versa? This article attempts to answer this question by taking an application-oriented approach to statistical analyses.
From books to movies to people to follow on Twitter, recommender systems carve the deluge of information on the Internet into a more personalized flow, thus improving the performance of e-commerce, web, and social applications. It is no great surprise, given the success of Amazon-monetizing recommendations and the Netflix Prize, that any discussion of personalization or data-theoretic prediction would involve a recommender. What is surprising is how simple recommenders are to implement yet how susceptible they are to vagaries of sparse data and overfitting.
Consider a non-algorithmic approach to eliciting recommendations; one of the easiest ways to garner a recommendation is to look at the preferences of someone we trust. We are implicitly comparing our preferences to theirs, and the more similarities you share, the more likely you are to discover novel, shared preferences. However, everyone is unique, and our preferences exist across a variety of categories and domains. What if you could leverage the preferences of a great number of people and not just those you trust? In the aggregate, you would be able to see patterns, not just of people like you, but also anti-recommendations— things to stay away from, cautioned by the people not like you. You would, hopefully, also see subtle delineations across the shared preference space of groups of people who share parts of your own unique experience.
Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine.
To prepare for this recipe, and the rest of the article, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/.
In this recipe, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease.
Perform the following steps to better understand the data that we will be working with throughout:
head -n 5 u.item
Note that if you are working on a computer running the Microsoft Windows operating system and not using a virtual machine (not recommended), you do not have access to the head command; instead, use the following command:
moreu.item 2 n
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
head -n 5 u.data
moreu.item 2 n
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
The two main files that we will be using are as follows:
Both are character-delimited files; u.data, which is the main file, is tab delimited, and u.item is pipe delimited.
For u.data, the first column is the user ID, the second column is the movie ID, the third is the star rating, and the last is the timestamp. The u.item file contains much more information, including the ID, title, release date, and even a URL to IMDB. Interestingly, this file also has a Boolean array indicating the genre(s) of each movie, including (in order) action, adventure, animation, children, comedy, crime, documentary, drama, fantasy, film-noir, horror, musical, mystery, romance, sci-fi, thriller, war, and western.
Free, web-scale datasets that are appropriate for building recommendation engines are few and far between. As a result, the movie lens dataset is a very popular choice for such a task but there are others as well. The well-known Netflix Prize dataset has been pulled down by Netflix. However, there is a dump of all user-contributed content from the Stack Exchange network (including Stack Overflow) available via the Internet Archive (https://archive.org/details/stackexchange). Additionally, there is a book-crossing dataset that contains over a million ratings of about a quarter million different books (http://www2.informatik.uni-freiburg.de/~cziegler/BX/).
Recommendation engines require large amounts of training data in order to do a good job, which is why they’re often relegated to big data projects. However, to build a recommendation engine, we must first get the required data into memory and, due to the size of the data, must do so in a memory-safe and efficient way. Luckily, Python has all of the tools to get the job done, and this recipe shows you how.
You will need to have the appropriate movie lens dataset downloaded, as specified in the preceding recipe. If you skipped the setup in you will need to go back and ensure that you have NumPy correctly installed.
The following steps guide you through the creation of the functions that we will need in order to load the datasets into the memory:
In [1]: import csv
...: import datetime
In [2]: defload_reviews(path, **kwargs):
...: “““
...: Loads MovieLens reviews
...: “““
...: options = {
...: ‘fieldnames’: (‘userid’, ‘movieid’, ‘rating’, ‘timestamp’),
...: ‘delimiter’: ‘t’,
...: }
...: options.update(kwargs)
...:
...: parse_date = lambda r,k: datetime.fromtimestamp(float(r[k]))
...: parse_int = lambda r,k: int(r[k])
...:
...: with open(path, ‘rb’) as reviews:
...: reader = csv.DictReader(reviews, **options)
...: for row in reader:
...: row[‘movieid’] = parse_int(row, ‘movieid’)
...: row[‘userid’] = parse_int(row, ‘userid’)
...: row[‘rating’] = parse_int(row, ‘rating’)
...: row[‘timestamp’] = parse_date(row, ‘timestamp’)
...: yield row
In [3]: import os
...: defrelative_path(path):
...: “““
...: Returns a path relative from this code file
...: “““
...: dirname = os.path.dirname(os.path.realpath(‘__file__’))
...: path = os.path.join(dirname, path)
...: return os.path.normpath(path)
In [4]: defload_movies(path, **kwargs):
...:
...: options = {
...: ‘fieldnames’: (‘movieid’, ‘title’, ‘release’, ‘video’, ‘url’),
...: ‘delimiter’: ‘|’,
...: ‘restkey’: ‘genre’,
...: }
...: options.update(kwargs)
...:
...: parse_int = lambda r,k: int(r[k])
...: parse_date = lambda r,k: datetime.strptime(r[k], ‘%d-%b-%Y’) if r[k] else None
...:
...: with open(path, ‘rb’) as movies:
...: reader = csv.DictReader(movies, **options)
...: for row in reader:
...: row[‘movieid’] = parse_int(row, ‘movieid’)
...: row[‘release’] = parse_date(row, ‘release’)
...: row[‘video’] = parse_date(row, ‘video’)
...: yield row
Finally, we start creating a MovieLens class that will be augmented later :
In [5]: from collections import defaultdict
In [6]: class MovieLens(object):
...: “““
...: Data structure to build our recommender model on.
...: “““
...:
...: def __init__(self, udata, uitem):
...: “““
...: Instantiate with a path to u.data and u.item
...: “““
...: self.udata = udata
...: self.uitem = uitem
...: self.movies = {}
...: self.reviews = defaultdict(dict)
...: self.load_dataset()
...:
...: defload_dataset(self):
...: “““
...: Loads the two datasets into memory, indexed on the ID.
...: “““
...: for movie in load_movies(self.uitem):
...: self.movies[movie[‘movieid’]] = movie
...:
...: for review in load_reviews(self.udata):
...: self.reviews[review[‘userid’]][review[‘movieid’]] = review
Ensure that the functions have been imported into your REPL or the IPython workspace, and type the following, making sure that the path to the data files is appropriate for your system:
In [7]: data = relative_path(‘../data/ml-100k/u.data’)
...: item = relative_path(‘../data/ml-100k/u.item’)
...: model = MovieLens(data, item)
The methodology that we use for the two data-loading functions (load_reviews and load_movies) is simple, but it takes care of the details of parsing the data from the disk. We created a function that takes a path to our dataset and then any optional keywords. We know that we have specific ways in which we need to interact with the csv module, so we create default options, passing in the field names of the rows along with the delimiter, which is t. The options.update(kwargs) line means that we’ll accept whatever users pass to this function.
We then created internal parsing functions using a lambda function in Python. These simple parsers take a row and a key as input and return the converted input. This is an example of using lambda as internal, reusable code blocks and is a common technique in Python. Finally, we open our file and create a csv.DictReader function with our options. Iterating through the rows in the reader, we parse the fields that we want to be int and datetime, respectively, and then yield the row.
Note that as we are unsure about the actual size of the input file, we are doing this in a memory-safe manner using Python generators. Using yield instead of return ensures that Python creates a generator under the hood and does not load the entire dataset into the memory.
We’ll use each of these methodologies to load the datasets at various times through our computation that uses this dataset. We’ll need to know where these files are at all times, which can be a pain, especially in larger code bases; in the There’s more… section, we’ll discuss a Python pro-tip to alleviate this concern.
Finally, we created a data structure, which is the MovieLens class, with which we can hold our reviews’ data. This structure takes the udata and uitem paths, and then, it loads the movies and reviews into two Python dictionaries that are indexed by movieid and userid, respectively. To instantiate this object, you will execute something as follows:
In [7]: data = relative_path(‘../data/ml-100k/u.data’)
...: item = relative_path(‘../data/ml-100k/u.item’)
...: model = MovieLens(data, item)
Note that the preceding commands assume that you have your data in a folder called data. We can now load the whole dataset into the memory, indexed on the various IDs specified in the dataset.
Did you notice the use of the relative_path function? When dealing with fixtures such as these to build models, the data is often included with the code. When you specify a path in Python, such as data/ml-100k/u.data, it looks it up relative to the current working directory where you ran the script. To help ease this trouble, you can specify the paths that are relative to the code itself:
importos
defrelative_path(path):
“““
Returns a path relative from this code file
“““
dirname = os.path.dirname(os.path.realpath(‘__file__’))
path = os.path.join(dirname, path)
returnos.path.normpath(path)
Keep in mind that this holds the entire data structure in memory; in the case of the 100k dataset, this will require 54.1 MB, which isn’t too bad for modern machines. However, we should also keep in mind that we’ll generally build recommenders using far more than just 100,000 reviews. This is why we have configured the data structure the way we have—very similar to a database. To grow the system, you will replace the reviews and movies properties with database access functions or properties, which will yield data types expected by our methods.
If you’re looking for a good movie, you’ll often want to see the most popular or best rated movies overall. Initially, we’ll take a naïve approach to compute a movie’s aggregate rating by averaging the user reviews for each movie. This technique will also demonstrate how to access the data in our MovieLens class.
These recipes are sequential in nature. Thus, you should have completed the previous recipes in the article before starting with this one.
Follow these steps to output numeric scores for all movies in the dataset and compute a top-10 list:
In [8]: class MovieLens(object):
...:
...:
...: defreviews_for_movie(self, movieid):
...: “““
...: Yields the reviews for a given movie
...: “““
...: for review in self.reviews.values():
...: if movieid in review:
...: yield review[movieid]
...:
In [9]: import heapq
...: from operator import itemgetter
...: class MovieLens(object):
...:
...: defaverage_reviews(self):
...: “““
...: Averages the star rating for all movies. Yields a tuple of movieid,
...: the average rating, and the number of reviews.
...: “““
...: for movieid in self.movies:
...: reviews = list(r[‘rating’] for r in self.reviews_for_movie(movieid))
...: average = sum(reviews) / float(len(reviews))
...: yield (movieid, average, len(reviews))
...:
...: deftop_rated(self, n=10):
...: “““
...: Yields the n top rated movies
...: “““
...: return heapq.nlargest(n, self.bayesian_average(), key=itemgetter(1))
...:
Note that the … notation just below class MovieLens(object): signifies that we will be appending the average_reviews method to the existing MovieLens class.
In [10]: for mid, avg, num in model.top_rated(10):
...: title = model.movies[mid][‘title’]
...: print “[%0.3f average rating (%i reviews)] %s” % (avg, num,title)
Out [10]: [5.000 average rating (1 reviews)] Entertaining Angels: The
Dorothy Day Story (1996)
[5.000 average rating (2 reviews)] Santa with Muscles (1996)
[5.000 average rating (1 reviews)] Great Day in Harlem, A (1994)
[5.000 average rating (1 reviews)] They Made Me a Criminal (1939)
[5.000 average rating (1 reviews)] Aiqingwansui (1994)
[5.000 average rating (1 reviews)] Someone Else’s America (1995)
[5.000 average rating (2 reviews)] Saint of Fort Washington,
The (1993)
[5.000 average rating (3 reviews)] Prefontaine (1997)
[5.000 average rating (3 reviews)] Star Kid (1997)
[5.000 average rating (1 reviews)] Marlene Dietrich: Shadow
and Light (1996)
The new reviews_for_movie() method that is added to the MovieLens class iterates through our review dictionary values (which are indexed by the userid parameter), checks whether the movieid value has been reviewed by the user, and then presents that review dictionary. We will need such functionality for the next method.
With the average_review() method, we have created another generator function that goes through all of our movies and all of their reviews and presents the movie ID, the average rating, and the number of reviews. The top_rated function uses the heapq module to quickly sort the reviews based on the average.
The heapq data structure, also known as the priority queue algorithm, is the Python implementation of an abstract data structure with interesting and useful properties. Heaps are binary trees that are built so that every parent node has a value that is either less than or equal to any of its children nodes. Thus, the smallest element is the root of the tree, which can be accessed in constant time, which is a very desirable property. With heapq, Python developers have an efficient means to insert new values in an ordered data structure and also return sorted values.
Here, we run into our first problem—some of the top-rated movies only have one review (and conversely, so do the worst-rated movies). How do you compare Casablanca, which has a 4.457 average rating (243 reviews), with Santa with Muscles, which has a 5.000 average rating (2 reviews)? We are sure that those two reviewers really liked Santa with Muscles, but the high rating for Casablanca is probably more meaningful because more people liked it. Most recommenders with star ratings will simply output the average rating along with the number of reviewers, allowing the user to determine their quality; however, as data scientists, we can do better in the next recipe.
We have thus pointed out that companies such as Amazon track purchases and page views to make recommendations, Goodreads and Yelp use 5 star ratings and text reviews, and sites such as Reddit or Stack Overflow use simple up/down voting. You can see that preference can be expressed in the data in different ways, from Boolean flags to voting to ratings. However, these preferences are expressed by attempting to find groups of similarities in preference expressions in which you are leveraging the core assumption of collaborative filtering.
More formally, we understand that two people, Bob and Alice, share a preference for a specific item or widget. If Alice too has a preference for a different item, say, sprocket, then Bob has a better than random chance of also sharing a preference for a sprocket. We believe that Bob and Alice’s taste similarities can be expressed in an aggregate via a large number of preferences, and by leveraging the collaborative nature of groups, we can filter the world of products.
In the recipes we learned various ways for understanding data and finding highest scoring reviews using IPython.
Further resources on this subject: