Text to video
Imagine a scenario where you provide a text prompt and expect to generate a video from it. How do you implement this?
So far, we have generated images from a text prompt. Generating videos from text requires us to control two aspects:
- Temporal consistency across frames (the subject in one frame should look similar to the subject in a subsequent frame)
- Action consistency across frames (if the text prompt is a rocket shooting into the sky, the rocket should have a consistent upward trajectory over increasing frames)
We should address the above two aspects while training a text-to-video model, and the way we address these aspects again uses diffusion models.
To understand the model building process, we will learn about the text-to-video model built by damo-vilab. It leverages the Unet3DConditionModel
instead of the Unet2DConditionModel
that we saw in the previous chapter.
Workflow
The Unet3DConditionModel contains the CrossAttnDownBlock3D...