Evaluating Yelp reviews
We read in the processed Yelp reviews using this script and print out some statistics of the data:
reviews <- read.csv("c:/Users/Dan/yelp_academic_dataset_review.csv")
I usually take a look at some of the data once loaded to visually check that things are working as expected. We can do this with a head()
function call:
head(reviews)
Summary data
All of the columns appear to be correctly loading. Now, we can look at summary statistics for the data:
summary(reviews)
There are several points in the summary worth noting:
- Some of the data points I had assumed would be just
TRUE
/FALSE
,0
/1
have ranges instead; for example,funny
has a max value over 600;useful
has a max 1100,cool
 has 500. - All of the IDs (users, businesses) have been mangled. We could use the user file and the business file to come up with exact references.
- Star ratings are
1
-5
, as expected. However, the mean and median are about a4
, which I take as many people only take the time to write good reviews.