Notice how we did not use the test set until the last step. Why not? This is a pretty important detail to ensure that the test remains a good one. As we iterate through the training set and nudge our classifier one way or another, we can sometimes wrap the classifier around the images or overtrain. This happens when you learn the training set rather than learn the features inside each of the classes.
When we overtrain, our accuracy on the iterative rounds of the training set will look promising, but that is all false hope. Having a never-before-seen test set should introduce reality back into the process. Great accuracy on the training set followed by poor results on the test set suggests overfitting.
This is why we've kept a separate test set. It helps indicate the real accuracy of our classifier. This is also why you should never shuffle your dataset...