From few- to zero-shot learning
As you’ll remember, a key model we’ve been referring back to is GPT-3, Generative Pretrained Transformers. The paper that gave us the third version of this is called Language models are few shot learners. (1) Why? Because the primary goal of the paper was to develop a model capable of performing well without extensive fine-tuning. This is an advantage because it means you can use one model to cover a much broader array of use cases without needing to develop custom code or curate custom datasets. Said another way, the unit economics are much stronger for zero-shot learning than they are for fine-tuning. In a fine-tuning world, you need to work harder for your base model to solve a use case. This is in contrast to a few-shot world, where it’s easier to solve additional use cases from your base model. This makes the few-shot model more valuable because the fine-tuning model becomes too expensive at scale. While in practice fine-tuning...