Cross-validating the results
As I've already mentioned, the dataset for this chapter is a manually coded group of 500 hotel reviews taken from the OpinRank dataset. For this experiment, we'll break these into 10 chunks of 50 reviews each.
These chunks will allow us to use K-fold cross validation to test how our system is doing. Cross validation is a way of checking your algorithm and procedures by splitting your data up into equally sized chunks. You then train your data on all of the chunks but one; that is the training set. You calculate the error after running the trained system on the validation set. Then, you use the next chunk as a validation set and start over again. Finally, we can average the error for all of the trials.
For example, the validation procedure uses four folds, A, B, C, and D. For the first run, A, B, and C would be the training set, and D would be the test set. Next, A, B, and D would be the training set, and C would be the test set. This would continue until every...