Generating text embeddings using CLIP
To generate the text embeddings (the embeddings contain the image features), we need first to tokenize the input text or prompt and then encode the token IDs into embeddings. Here are steps to achieve this:
- Get the prompt token IDs:
input_prompt = "a running dog"
# input tokenizer and clip embedding model
import torch
from transformers import CLIPTokenizer,CLIPTextModel
# initialize tokenizer
clip_tokenizer = CLIPTokenizer.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    subfolder = "tokenizer",
    dtype     = torch.float16
)
input_tokens = clip_tokenizer(
    input_prompt,
    return_tensors = "pt"
)["input_ids"]
input_tokens
The preceding code will convert the
a running dog
text prompt to a token ID list as atorch
tensor object –tensor([[49406, 320,
2761,...