Building encoders-decoders to directly output label maps—where each pixel value represents a class (for instance, 1 for dog, and 2 for cat)—would yield poor results. As with classifiers, we need a better way to output categorical values.
To classify images among N categories, we learned to build networks with the final layers outputting N logits, representing the predicted per-class scores. We also learned how to convert these scores into probabilities using the softmax operation, and how to return the most probable class(es) by picking the highest values (for instance, using argmax). The same mechanism can be applied to semantic segmentation, at the pixel level instead of the image level. Instead of outputting a column vector of N logits containing the per-class scores for each full image, our network is built to return an H × W × N tensor with scores for each pixel (refer to Figure 6-10):