If elements captured by a semantic mask are well-separated/non-overlapping, splitting the masks to distinguish each instance is not too complicated a task. Plenty of algorithms are available to estimate the contours of distinct blobs in binary matrices and/or to provide a separate mask for each blob. For multi-class instance segmentation, this process can just be repeated for each class mask returned by object segmentation methods, splitting them further into instances.
But precise semantic masks should first be obtained, or elements too close to each other may be returned as a single blob. So, how can we ensure that segmentation models put enough attention into generating masks with precise contours, at least for non-overlapping elements? We know the answer already—the only way to teach networks to do something specific is to adapt their training loss accordingly.
U-Net was developed for biomedical applications, to segment neuronal structures in microscope...