Summary
Every day, the quality and duration of text-to-video samples circulating on social networks are improving. It’s likely that by the time you read this chapter, the function of the technologies metioned in this chapter will have surpassed what was described here. However, one constant is the concept of training a model to apply an attention mechanism to a sequence of images.
At the time of writing, OpenAI’s Sora [9] has just been released. This technology can generate high-quality videos based on the Transformer Diffusion architecture. This is a similar methodology to that used in AnimatedDiff, which combines the Transformer and diffusion models.
What sets AnimatedDiff apart is its openness and adaptability. It can be applied to any community model with the same base checkpoint version, a feature not currently offered by any other solution. Furthermore, the authors of the paper have completely open-sourced the code and model.
This chapter primarily discussed...