This project was all about building a convolutional neural network (CNN) classifier to solve the problem of estimating 3D human poses using frames captured from movies. Our hypothetical use case was to enable visual effects specialists to easily estimate the pose of actors (from their shoulders, necks, and heads from the frames in a video. Our task was to build the intelligence for this application.
The modified VGG16 architecture we built using transfer learning has a test mean squared error loss of 454.81 squared units over 200 test images for each of the 14 coordinates (that is, the 7(x, y) pairs). We can also say that the test root mean squared error over 200 test images for each of the 14 coordinates is 21.326 units. What does this mean?
The root mean squared error (RMSE), in this case, is a measure of how far off the predicted joint coordinates/joint pixel location...