Even if you were to follow my advice above and settle on a good enough architecture, you can and should still attempt to search for ideal hyperparameters within that architecture. Some of the hyperparameters we might want to search include the following:
- Our choice of optimizer. Thus far, I've been using Adam, but an rmsprop optimizer or a well-tuned SGD may do better.
- Each of these optimizers has a set of hyperparameters that we might tune, such as learning rate, momentum, and decay.
- Network weight initialization.
- Neuron activation.
- Regularization parameters such as dropout probability or the regularization parameter used in l2 regularization.
- Batch size.
As implied above, this is not an exhaustive list. There are most certainly more options you could try, including introducing variable numbers of neurons in each hidden layer,...