Using Textual Inversion
Textual inversion (TI) is another way to provide additional capabilities to a pretrained model. Unlike Low-Rank Adaptation (LoRA), discussed in Chapter 8, which is a fine-tuning technique applied to the text encoder and the UNet attention weights, TI is a technique to add new embedding space based on the trained data.
In the context of Stable Diffusion, text embedding refers to the representation of text data as numerical vectors in a high dimensional space, allowing for manipulation and processing by machine learning algorithms. Specifically, in the case of Stable Diffusion, text embeddings are typically created using the Contrastive Language-Image Pretraining (CLIP) [6] model.
To train a TI model, you only need a minimal set of three to five images, resulting in a compact pt
or bin
file, typically just a few kilobytes in size. This makes TI a highly efficient method for incorporating new elements, concepts, or styles into your pretrained checkpoint model...