How TI works
Simply put, training a TI is finding a text embedding that matches the target image the best, such as its style, object, or face. The key is to find a new embedding that never existed in the current text encoder. As Figure 9.3, from its original paper [1], shows:
Figure 9.3: The outline of the text embedding and inversion process
The only job of the training is to find a new embedding represented by v *. and use S * as the token string placeholder; the string can be replaced by any string that does not exist in the tokenizer later. Once the new corresponding embedding vector is founded, the train is done. The output of the training is usually a vector with 768 numbers. That is why the TI file is tiny; it is just a couple of kilobytes.
It is like the pretrained UNet is a pile of matrix magic boxes, one key (embedding) can unlock a box to have a pattern, a style, or an object. The number of boxes is way more than the limited keys...