Diffusion model

In machine learning, diffusion models, also known as diffusion probabilistic models, are a class of latent variable models. They are Markov chains trained using variational inference.[1] The goal of diffusion models is to learn the latent structure of a dataset by modeling the way in which data points diffuse through the latent space. In computer vision, this means that a neural network is trained to denoise images blurred with Gaussian noise by learning to reverse the diffusion process.[2][3] Three examples of generic diffusion modeling frameworks used in computer vision are denoising diffusion probabilistic models, noise conditioned score networks, and stochastic differential equations.[4]

Diffusion models were introduced in 2015 with a motivation from non-equilibrium thermodynamics.[5]

Diffusion models can be applied to a variety of tasks, including image denoising, inpainting, super-resolution, and image generation. For example, an image generation model would start with a random noise image and then, after having been trained reversing the diffusion process on natural images, the model would be able to generate new natural images. Announced on 13 April 2022, OpenAI's text-to-image model DALL-E 2 is a recent example. It uses diffusion models for both the model's prior (which produces an image embedding given a text caption) and the decoder that generates the final image.[6]

Mathematical principles

Generating an image in the space of all images

Consider the problem of image generation. Let $x$ represent an image, and let $p(x)$ be the probability distribution over all possible images. If we have $p(x)$ itself, then we can say for certain how likely a certain image is. However, this is intractable in general.

Most often, we are uninterested in knowing the absolute probability that a certain image is -- when, if ever, are we interested in how likely an image is in the space of all possible images? Instead, we are usually only interested in knowing how likely a certain image is compared to its immediate neighbors -- how more likely is this image of cat, compared to some small variants of it? Is it more likely if the image contains two whiskers, or three, or with some gaussian noise added?

Consequently, we are actually quite uninterested in $p(x)$ itself, but rather, $\nabla _{x}\ln p(x)$ . This performs two effects

One, we no longer need to normalize $p(x)$ , but can use any ${\tilde {p}}(x)=Cp(x)$ , where $C=\int {\tilde {p}}(x)dx>0$ is any unknown constant that is of no concern to us.
Two, we are comparing $p(x)$ neighbors $p(x+dx)$ , by ${\frac {p(x)}{p(x+dx)}}=e^{-\langle \nabla _{x}\ln p,dx\rangle }$

Let the score function be $s(x):=\nabla _{x}\ln p(x)$ , then consider what we can do with $s(x)$ .

As it turns out, $s(x)$ allows us to sample from $p(x)$ using stochastic gradient Langevin dynamics, which is essentially an infinitesimal version of Markov chain Monte Carlo.[2]

Learning the score function

The score function can be learned by noising-denoising.[1]

Main variants

Classifier guidance

Suppose we wish to sample not from the entire distribution of images, but conditional on the image description. We don't want to sample a generic image, but an image that fits the description "black cat with red eyes". Generally, we want to sample from the distribution $p(x|y)$ , where $x$ ranges over images, and $y$ ranges over classes of images (a description "black cat with red eyes" is just a very detailed class, and a class "cat" is just a very vague description).

Taking the perspective of the noisy channel model, we can understand the process as follows: To generate an image $x$ conditional on description $y$ , we imagine that the requester really had in mind an image $x$ , but the image is passed through a noisy channel and came out garbled, as $y$ . Image generation is then nothing but inferring which $x$ the requester had in mind.

In other words, conditional image generation is simply "translating from a textual language into a pictorial language". Then, as in noisy-channel model, we use Bayes theorem to get

p(x|y)\propto p(y|x)p(x)

in other words, if we have a good model of the space of all images, and a good image-to-class translator, we get a class-to-image translator "for free". The SGLD uses

\nabla _{x}\ln p(x|y)=\nabla _{x}\ln p(y|x)+\nabla _{x}\ln p(x)

where $\nabla _{x}\ln p(x)$ is the score function, trained as previously described, and $\nabla _{x}\ln p(y|x)$ is found by using a differentiable image classifier.

With temperature

The classifier-guided diffusion model samples from $p(x|y)$ , which is concentrated around the maximum a posteriori estimate $\arg \max _{x}p(x|y)$ . If we want to force the model to move towards the maximum likelihood estimate $\arg \max _{x}p(y|x)$ , we can use

p_{\beta }(x|y)\propto p(y|x)^{\beta }p(x)

where $\beta >0$ is interpretable as inverse temperature. In the context of diffusion models, it is usually called the guidance scale. A high $\beta$ would force the model to sample from a distribution concentrated around $\arg \max _{x}p(y|x)$ . This often improves quality of generated images.[7] This can be done simply by SGLD with

\nabla _{x}\ln p_{\beta }(x|y)=\beta \nabla _{x}\ln p(y|x)+\nabla _{x}\ln p(x)

Classifier-free guidance

If we do not have a classifier $p(y|x)$ , we could still extract one out of the image model itself:[8]

\nabla _{x}\ln p_{\beta }(x|y)=(1-\beta )\nabla _{x}\ln p(x)+\beta \nabla _{x}\ln p(x|y)

Such a model is usually trained by presenting it with both $(x,y)$ and $(x,None)$ , allowing it to model both $\nabla _{x}\ln p(x|y)$ and $\nabla _{x}\ln p(x)$ .

This is an integral part of systems like GLIDE,[9] DALL-E[10] and Google Imagen.[11]

References

Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (19 June 2020). "Denoising Diffusion Probabilistic Models". arXiv:2006.11239. {{cite journal}}: Cite journal requires |journal= (help)
Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].
Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].
Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2022). "Diffusion models in vision: A survey". arXiv:2209.04747 [cs.CV].
Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. PMLR. 37: 2256–2265.
Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].
Dhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis". arXiv:2105.05233 [cs.LG].
Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].
Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". arXiv:2112.10741 [cs.CV].
Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].
Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Ghasemipour, Seyed Kamyar Seyed; Ayan, Burcu Karagol; Mahdavi, S. Sara; Lopes, Rapha Gontijo; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-05-23). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[:0-1] Ho, Jonathan; Jain, Ajay; Abbeel, Pieter (19 June 2020). "Denoising Diffusion Probabilistic Models". arXiv:2006.11239. {{cite journal}}: Cite journal requires |journal= (help)

[:1-2] Song, Yang; Sohl-Dickstein, Jascha; Kingma, Diederik P.; Kumar, Abhishek; Ermon, Stefano; Poole, Ben (2021-02-10). "Score-Based Generative Modeling through Stochastic Differential Equations". arXiv:2011.13456 [cs.LG].

[3] Gu, Shuyang; Chen, Dong; Bao, Jianmin; Wen, Fang; Zhang, Bo; Chen, Dongdong; Yuan, Lu; Guo, Baining (2021). "Vector Quantized Diffusion Model for Text-to-Image Synthesis". arXiv:2111.14822 [cs.CV].

[4] Croitoru, Florinel-Alin; Hondru, Vlad; Ionescu, Radu Tudor; Shah, Mubarak (2022). "Diffusion models in vision: A survey". arXiv:2209.04747 [cs.CV].

[5] Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. PMLR. 37: 2256–2265.

[6] Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].

[7] Dhariwal, Prafulla; Nichol, Alex (2021-06-01). "Diffusion Models Beat GANs on Image Synthesis". arXiv:2105.05233 [cs.LG].

[8] Ho, Jonathan; Salimans, Tim (2022-07-25). "Classifier-Free Diffusion Guidance". arXiv:2207.12598 [cs.LG].

[9] Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models". arXiv:2112.10741 [cs.CV].

[10] Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv:2204.06125 [cs.CV].

[11] Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Ghasemipour, Seyed Kamyar Seyed; Ayan, Burcu Karagol; Mahdavi, S. Sara; Lopes, Rapha Gontijo; Salimans, Tim; Ho, Jonathan; Fleet, David J.; Norouzi, Mohammad (2022-05-23). "Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding". arXiv:2205.11487 [cs.CV].