While we could simply use a pretrained detection network followed by a pretrained segmentation network, the whole pipeline would certainly work better if the two networks were stitched together and trained in an end-to-end manner. Backpropagating the segmentation loss through the common layers would better ensure that the features extracted are meaningful both for the detection and the segmentation tasks. This is pretty much the original idea behind Mask R-CNN by Kaiming He et al. from Facebook AI Research (FAIR) in 2017 (Mask R-CNN, Proceedings of the IEEE CVPR conference).
Mask R-CNN is mostly based on Faster R-CNN. Like Faster R-CNN, Mask R-CNN is composed of a region-proposal network, followed by two branches predicting the class and the box offset for each proposed region (refer to Chapter 5, Object Detection Models)....