Overcoming the 77-tokens limitation
Fortunately, the Stable Diffusion UNet doesn’t enforce this 77-token limitation. If we can get the embeddings in batches, concatenate those chunked embeddings into one tensor, and provide it to the UNet, we should be able to overcome the 77-token limitation. Here’s an overview of the process:
- Extract the text tokenizer and text encoder from the Stable Diffusion pipeline.
- Tokenize the input prompt, regardless of its size.
- Eliminate the added beginning and end tokens.
- Pop out the first 77 tokens and encode them into embeddings.
- Stack the embeddings into a tensor of size
[1,
x, 768]
.
Now, let’s implement this idea using Python code:
- Take out the text tokenizer and text encoder:
# step 1. take out the tokenizer and text encoder
tokenizer = pipe.tokenizer
text_encoder = pipe.text_encoder
We can reuse the tokenizer and text encoder from the Stable Diffusion pipeline.
- Tokenize any size of input...