To help us understand how to code Mask R-CNN for instance segmentation, we will leverage a dataset that masks people who are present within an image. The dataset we'll be using has been created from a subset of the ADE20K dataset, which is available at https://groups.csail.mit.edu/vision/datasets/ADE20K/. We will only use those images where we have masks for people.
The strategy that we'll adopt is as follows:
- Fetch the dataset and then create datasets and dataloaders from it.
- Create a ground truth in a format needed for PyTorch's official implementation of Mask R-CNN.
- Download the pre-trained Faster R-CNN model and attach a Mask R-CNN head to it.
- Train the model with a PyTorch code snippet that has been standardized for training Mask R-CNN.
- Infer on an image by performing non-max suppression first and then identifying the bounding box and the mask corresponding to the people in the image.
Let's code up the preceding...