Creating similar movies from one million ratings – part 3
About 15 minutes after I set off our movie-similarities-1m
script on a cluster using EMR, I have some actual results to look at. Let's review what happened.
Assessing the results
Here are the results:
The top similar movie to Star Wars Episode Four, was Star Wars Episode Five, not too surprising. But the next entry is a little bit surprising, some little movie called Sanjuro had a very high similarity score. Let's look at what's going on there. Its actual strength, the number of people that rated that together with Star Wars, was only 60, so I think it's safe to say that is kind of a spurious result. Now that we're using a million ratings, we probably need to increase that minimum threshold on the number of co-raters in order to actually display a result. By doing so, we could probably pretty easily filter out movies like that and instead get Raiders of the Lost Ark as our second result instead of as our third. I think the position...