Introducing SAM
Imagine a scenario where you are given an image and are asked to predict the mask corresponding to a given text (let’s say a dog in an image where there are multiple objects, like a dog, cat, person, and so on). How would you go about solving this problem?
In a traditional setting, this is an object detection problem where we need data to perform fine-tuning on a given dataset or leverage a pre-trained model. We are unable to leverage CLIP as it classifies the overall picture and not individual objects within it.
Further, in this scenario, we want to do all of this without even training a model. Here is where Segment Anything Model (SAM) - https://arxiv.org/pdf/2304.02643 from Meta helps in solving the problem.
How SAM works
SAM is trained on a corpus of 1 billion masks generated from 11 million images. These 1 billion images (SAM 1B dataset) are from the data engine that Meta developed in the following stages:
- Assisted manual –...