Evaluating Yelp reviews
We read in the processed Yelp reviews using this script and print out some statistics of the data:
reviews <- read.csv("c:/Users/Dan/yelp_academic_dataset_review.csv")
I usually take a look at some of the data once loaded to visually check that things are working as expected. We can do this with a head()
function call:
head(reviews)
![](https://static.packt-cdn.com/products/9781785880070/graphics/69426c3a-266a-44f4-a5ea-04779b7b0c57.png)
Summary data
All of the columns appear to be correctly loading. Now, we can look at summary statistics for the data:
summary(reviews)
![](https://static.packt-cdn.com/products/9781785880070/graphics/06be3241-9afc-49b7-a6fd-d6b760d7922f.png)
There are several points in the summary worth noting:
- Some of the data points I had assumed would be just
TRUE
/FALSE
,0
/1
have ranges instead; for example,funny
has a max value over 600;useful
has a max 1100,cool
 has 500. - All of the IDs (users, businesses) have been mangled. We could use the user file and the business file to come up with exact references.
- Star ratings are
1
-5
, as expected. However, the mean and median are about a4
, which I take as many people only take the time to write good reviews.