Initializing the Stable Diffusion UNet
The UNet architecture [5] was introduced by Ronneberger et al. for biomedical image segmentation purposes. Before the UNet architecture, a convolution network was commonly used for image classification tasks. When using a convolution network, the output is a single class label. However, in many visual tasks, the desired output should include localization too, and the UNet model solved this problem.
The U-shaped architecture of UNet enables efficient learning of features at different scales. UNet’s skip connections directly combine feature maps from different stages, allowing a model to effectively propagate information across various scales. This is crucial for denoising, as it ensures the model retains both fine-grained details and global context during noise removal. These features make UNet a good candidate for the denoising model.
In the Diffuser
library, there is a class named UNet2DconditionalModel
; this is a conditional 2D...