Model evaluation and hyperparameter tuning
After each epoch of our data parallel model training, we need to evaluate whether the training progress is good or not. We use these evaluation results to conduct hyperparameter tuning, such as the learning rate and batch size per GPU.
Note that the validation set for hyper parameter tuning is from the training set, not the test set, so we split the total training data with a 5:1 ratio. 5/6 of the total training data is for model training, while 1/6 of the total data is for model validation. This can be implemented as follows:
train_all_set = datasets.MNIST('./mnist_data', download=True, train=True, transform = transforms.Compose([ transforms.ToTensor(), ...