Diffusion model
In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference.[1] The goal of diffusion models is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space. In computer vision, this means that a neural network is trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.[2][3] Three examples of generic diffusion modeling frameworks used in computer vision are denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[4]
Diffusion models were introduced in 2015 with a motivation from non-equilibrium thermodynamics.[5]
Diffusion models can be applied to a variety of tasks, including image denoising, inpainting, super-resolution, and image generation. For example, an image generation model would start with a random noise image and then, after having been trained reversing the diffusion process on natural images, the model would be able to generate new natural images. Announced on 13 April 2022, OpenAI's text-to-image model DALL-E 2 is a recent example. It uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image.[6]
Mathematical principles
Generating an image in the space of all images
Consider the problem of image generation. Let represent an image, and let be the probability distribution over all possible images. If we have itself, then we can say for certain how likely a certain image is. However, this is intractable in general.
Most often, we are uninterested in knowing the absolute probability that a certain image is -- when, if ever, are we interested in how likely an image is in the space of all possible images? Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors -- how more likely is this image of cat, compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some gaussian noise added?
Consequently, we are actually quite uninterested in itself, but rather, . This performs two effects
- One, we no longer need to normalize , but can use any , where is any unknown constant that is of no concern to us.
- Two, we are comparing neighbors , by
Let the score function be , then consider what we can do with .
As it turns out, allows us to sample from using stochastic gradient Langevin dynamics, which is essentially an infinitesimal version of Markov chain Monte Carlo.[2]
Learning the score function
The score function can be learned by noising-denoising.[1]
Main variants
Classifier guidance
Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution , where ranges over images, and ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).
Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image conditional on description , we imagine that the requester really had in mind an image , but the image is passed through a noisy channel and came out garbled, as . Image generation is then nothing but inferring which the requester had in mind.
In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get
in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". The SGLD uses
where is the score function, trained as previously described, and is found by using a differentiable image classifier.
With temperature
The classifier-guided diffusion model samples from , which is concentrated around the maximum a posteriori estimate . If we want to force the model to move towards the maximum likelihood estimate , we can use
where is interpretable as inverse temperature. In the context of diffusion models, it is usually called the guidance scale. A high would force the model to sample from a distribution concentrated around . This often improves quality of generated images.[7] This can be done simply by SGLD with
Further reading
- Guidance: a cheat code for diffusion models. Good overview up to 2022.
References
- Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (19 June 2020). "Denoising Diffusion Probabilistic Models". arXiv:2006.11239.
{{cite journal}}
: Cite journal requires|journal=
(help) - Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
- Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].
- Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2022). "Diffusion models in vision: A survey". arXiv:2209.04747 [cs.CV].
- Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. PMLR. 37: 2256–2265.
- Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].
- Dhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis". arXiv:2105.05233 [cs.LG].
- Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].
- Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". arXiv:2112.10741 [cs.CV].
- Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].
- Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Ghasemipour, Seyed Kamyar Seyed; Ayan, Burcu Karagol; Mahdavi, S. Sara; Lopes, Rapha Gontijo; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-05-23). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].