Selecting the highest rated movies by year
One of the most basic and common operations to perform during data analysis is to select rows containing the largest value of some column within a group. Applied to our movie dataset, this could mean finding the highest-rated film of each year or the highest-grossing film by content rating. To accomplish these tasks, we need to sort the groups as well as the column used to rank each member of the group, and then extract the highest member of each group.
In this recipe, we will find the highest-rated film of each year using a combination of pd.DataFrame.sort_values
and pd.DataFrame.drop_duplicates
.
How to do it
Start by reading in the movie dataset and slim it down to just the three columns we care about: movie_title
, title_year
, and imdb_score
:
df = pd.read_csv(
"data/movie.csv",
usecols=["movie_title", "title_year", "imdb_score"],
dtype_backend="numpy_nullable"...