The use of state-of-the-art architectures, such as FCN-8s and U-Net, is key to building performant systems for semantic segmentation. However, the most advanced models still need a proper loss to converge optimally. While cross-entropy is the default loss to train models both for coarse and dense classification, precautions should be taken for the latter cases.
For image-level and pixel-level classification tasks, class imbalance is a common problem. Imagine training models over a dataset of 990 cat pictures and 10 dog pictures. A model that would learn to always output cat would achieve 99% training accuracy, but would not be really useful in practice. For image classification, this can be avoided by adding or removing pictures so that all classes appear in the same proportions. The problem is trickier for pixel-level classification. Some classes may appear in every image but span only a handful of pixels, while other classes may...