Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

Wang, Xiaolong; He, Zhijian; Peng, Xiaojiang

doi:10.3390/math12070977

Open AccessReview

Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

by

Xiaolong Wang

¹

,

Zhijian He

²

and

Xiaojiang Peng

^2,*

¹

College of Applied Science, Shenzhen University, Shenzhen 518052, China

²

College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(7), 977; https://doi.org/10.3390/math12070977

Submission received: 7 March 2024 / Revised: 20 March 2024 / Accepted: 23 March 2024 / Published: 25 March 2024

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Diffusion models have swiftly taken the lead in generative modeling, establishing unprecedented standards for producing high-quality, varied outputs. Unlike Generative Adversarial Networks (GANs)—once considered the gold standard in this realm—diffusion models bring several unique benefits to the table. They are renowned for generating outputs that more accurately reflect the complexity of real-world data, showcase a wider array of diversity, and are based on a training approach that is comparatively more straightforward and stable. This survey aims to offer an exhaustive overview of both the theoretical underpinnings and practical achievements of diffusion models. We explore and outline three core approaches to diffusion modeling: denoising diffusion probabilistic models, score-based generative models, and stochastic differential equations. Subsequently, we delineate the algorithmic enhancements of diffusion models across several pivotal areas. A notable aspect of this review is an in-depth analysis of leading generative models, examining how diffusion models relate to and evolve from previous generative methodologies, offering critical insights into their synergy. A comparative analysis of the merits and limitations of different generative models is a vital component of our discussion. Moreover, we highlight the applications of diffusion models across computer vision, multi-modal generation, and beyond, culminating in significant conclusions and suggesting promising avenues for future investigation.

Keywords:

generative models; computer vision; diffusion models; multi-modal generation

MSC:

68T07

1. Introduction

Diffusion models [1,2,3,4,5,6,7,8], heralding a significant leap in deep generative modeling, have captivated a broad audience with their exceptional capabilities. These models stand out for their ability to create images of astounding realism and quality, marking a new chapter in digital creativity and content creation. In comparison to Generative Adversarial Networks (GANs) [9], diffusion models excel in their controllability and precision, more accurately meeting user-defined criteria during the generation process. This advantage is vividly demonstrated by the outputs from the Stable Diffusion model [10], for example, in Figure 1, which leverages text inputs to steer the image generation process, achieving images of remarkable fidelity that closely match the provided textual descriptions.

The last four years have seen a remarkable increase in publications related to diffusion modeling, featuring numerous innovative theories and landmark studies. This explosion of research presents a daunting challenge for newcomers, complicating their ability to find a foothold within this broad and swiftly changing field. With this context in mind, our survey seeks to offer a detailed overview of diffusion modeling’s present state, charting the evolution of its foundational principles and techniques, and highlighting its current avant-garde applications. Our intention is to facilitate the journey of future researchers into the realm of diffusion modeling by providing a well-structured and accessible gateway into this complex area of study.

Building on this foundation, a distinctive highlight of our review is its comprehensive examination and nuanced understanding of diffusion models in relation to other mainstream generative models. We delve deeply into the distinctions, advantages, and limitations of each approach, offering a comparative analysis that underscores their interconnectedness and the unique ways in which diffusion models build upon and diverge from prior generative methodologies. This critical exploration not only elucidates the synergistic relationships among these models but also equips readers with valuable insights into selecting the most appropriate methods for their specific research goals.

The exceptional generative power of diffusion models has swiftly positioned them as key players in a plethora of vision-centric applications, including but not limited to image editing, inpainting [11,12,13], semantic segmentation [14,15,16], and anomaly detection [17,18]. This surge in popularity can be traced back to the significant advancements made since the introduction of diffusion probabilistic models [3], which have expanded upon the foundational concepts of diffusion modeling [4]. The resulting wave of research enthusiasm is marked by the daily unveiling of novel models and an expanding corpus of scholarly work. Public and academic intrigue in diffusion models particularly intensified following the release of groundbreaking text-to-image generators like DALL-E [19], Imagen [20], and Stable Diffusion [10], which have redefined the benchmarks for generating images from textual descriptions. Moreover, the field has seen remarkable strides in text-to-video generation [21,22], showcasing highly advanced videos and amplifying the excitement surrounding diffusion models. A statistical and chronological analysis showcased in Figure 2 underlines the rising prominence of diffusion models, especially within the vision community. This illustration emphasizes their growing importance in the expansive field of generative modeling, marking a notable shift in focus and enthusiasm towards these models.

In this paper, we will first introduce the framework of diffusion models by providing a succinct description of the three main formulations: denoising diffusion probabilistic models (DDPMs) [1,2,3,4,7], score-based generative models (SGMs) [5,8], and stochastic differential equations (Score SDEs) [5,23]. The cornerstone of these methods is the incorporation of random noise in the forward diffusion process and its subsequent removal in the reverse diffusion process to synthesize new samples. We will elucidate the functioning of these models within the diffusion process and clarify their interrelations (Section 2).

Then, we relate and justify our work with an overview of the rapidly expanding body of work on diffusion models, categorizing the research into three key areas: efficient sampling [1,24,25], improved likelihood [1,26], and handling data with special structures [10,27,28] (Section 3).

Next, we juxtapose diffusion models with other generative models (Section 4), shedding light on the connections and distinctions between diffusion models and their contemporaries, such as variational autoencoders (VAEs) [29,30], generative adversarial networks (GANs) [9], and flow-based models [31,32,33,34]. A comprehensive analysis will highlight the merits and limitations inherent to each model type and spotlight seminal contributions that have successfully integrated diffusion models with alternative generative frameworks to achieve enhanced performance.

Subsequently, we will systematically review the applications of diffusion models across various fields (Section 5), including computer vision [11,12,13,14,15,16,17,18,35,36,37,38], multi-modal generation [10,19,20,39,40,41,42,43,44,45,46,47,48,49,50], and interdisciplinary fields [51,52,53,54,55,56]. For each application, we will define the task and discuss how diffusion models offer solutions to challenges identified in previous works. In conclusion, this review will synthesize core insights, draw significant conclusions, and identify promising directions for the continued exploration of diffusion models in future research endeavors (Section 6 and Section 7).

2. Framework of Diffusion Models

Diffusion models represent a sophisticated class of probabilistic generative models designed to perturb data by systematically introducing noise through a forward process and subsequently learning to reverse this process to generate new samples. Figure 3 conceptually illustrates the diffusion process, offering a visual understanding of the underlying mechanisms. Research on diffusion modeling has converged on three principal formulations: denoising diffusion probabilistic models (DDPMs), which focus on iteratively denoising data; score-based generative models (SGMs), which leverage gradients of the data distribution’s score for sample generation; and stochastic differential equations (Score SDEs), which frame the diffusion process within the continuous domain of differential equations. This section aims to provide a comprehensive overview of these formulations, elucidating their interconnections and unique contributions to the field of generative modeling.

2.1. Denoising Diffusion Probabilistic Models (DDPMs)

2.1.1. Forward Process

The forward process of DDPMs [3,7] involves a Markov chain [57], which gradually adds Gaussian noise to an initial data distribution over a series of time steps. Given an original data point

x_{0}

, the forward process can be expressed as follows:

q (x_{t} | x_{t - 1}) = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(1)

where t indexes the time step from 1 to T and

β_{t}

are variance parameters of the noise added at each step. The term

N

denotes the Gaussian distribution, and

I

is the identity matrix.

The sequence of transformations from

x_{0}

to

x_{T}

(where

x_{T}

is nearly indistinguishable from pure noise) is modeled by the following joint distribution:

q (x_{1 : T} | x_{0}) = \prod_{t = 1}^{T} q (x_{t} | x_{t - 1}) = \prod_{t = 1}^{T} N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I)

(2)

Here, each step is a Gaussian diffusion process that is conditional on the previous step, effectively creating a smooth transition from structured data to random noise.

The noise schedule is defined by the variance parameters

β_{t}

, which typically start small and gradually increase but are bounded such that

0 < β_{t} < 1

. This ensures that the data are not overwhelmed by noise too quickly and allows the model to learn meaningful representations of the data at each step of the diffusion process.

The parameters

α_{t}

are defined as

α_{t} = 1 - β_{t}

(3)

and their cumulative product is given by

{\bar{α}}_{t} = \prod_{i = 1}^{t} α_{i}

(4)

The variables

α_{t}

and

{\bar{α}}_{t}

represent the proportion of the original data that remains at step t and the overall proportion of the data that remains up to step t, respectively. We can then expeditiously compute

x_{t}

at any moment based on

x_{0}

and

α

as follows:

x_{t} = \sqrt{{\bar{α}}_{t}} \cdot x_{0} + \sqrt{(1 - {\bar{α}}_{t})} \cdot {\hat{z}}_{t}

(5)

where

{\hat{z}}_{t} \sim N (0, I)

.

Through this process, the forward diffusion effectively corrupts the data by incrementally adding noise, moving the data distribution towards a Gaussian distribution with a mean of zero and a variance of one.

2.1.2. Reverse Process

The reverse process in Denoising Diffusion Probabilistic Models is conceptualized as an iterative denoising of the sequence of latent variables, transitioning from a diffused state back to the original distribution of the data. This inverse mapping is achieved through a Markov chain characterized by Gaussian transitions, where each step aims to estimate the preceding latent variable

x_{t - 1}

from the current noisy state

x_{t}

.

At the heart of the reverse process is the transition probability, modeled as a Gaussian distribution:

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1}; {\tilde{μ}}_{t} (x_{t}, x_{0}), {\tilde{β}}_{t} I),

(6)

where the mean

μ_{t} (x_{t})

is inferred from the noisy data point

x_{t}

and is a function of the neural network parameters. The mean is computed as a weighted combination of the current noisy observation and the original data point, which can be expressed as

{\tilde{μ}}_{t} (x_{t}, x_{0}) : = \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} x_{0} + \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t}

(7)

However, since

q (x_{t - 1} | x_{t})

depends on the entire data distribution, we approximate it using a neural network as follows:

p_{θ} (x_{t - 1} | x_{t}) : = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

(8)

The combination of q and p forms a variational auto-encoder [29], and the variational lower bound (VLB) can be expressed as

L_{vlb} : = L_{0} + L_{1} + \dots + L_{T - 1} + L_{T}

(9)

where

L_{0} : = - \log p_{θ} (x_{0} | x_{1})

(10)

and for

t = 1, \dots, T

,

L_{t - 1} : = D_{KL} (q (x_{t - 1} | x_{t}, x_{0}) | | p_{θ} (x_{t - 1} | x_{t}))

(11)

with

L_{T} : = D_{KL} (q (x_{T} | x_{0}) | | p (x_{T}))

(12)

representing the Kullback–Leibler divergences at each time step of the diffusion process. Thus, we can estimate

L_{t - 1}

using the prior (Equation (8)) and posterior (Equation (6)).

There exist multiple strategies to parameterize

μ_{θ} (x_{t}, t)

in the prior. The most straightforward method is to directly predict

μ_{θ} (x_{t}, t)

with the neural network. As an alternative, the network might estimate

x_{0}

, utilizing this prediction within Equation (7) to calculate

μ_{θ} (x_{t}, t)

. Furthermore, the network has the option to estimate the noise

ϵ

, applying both Equations (7) and (5) to compute

μ_{θ} (x_{t}, t)

.

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - \bar{α_{t}}}} ϵ_{θ} (x_{t}, t)) .

(13)

To train the neural network, we minimize a loss function that encapsulates a variational lower bound on the negative log-likelihood of the observed data. This involves the optimization of the following objective function:

L_{θ} = E_{q (x_{0} : T)} [\sum_{t = 1}^{T} \frac{| | μ_{t} (x_{t}) - μ_{θ} (x_{t}, t) {| |}^{2}}{2 β_{t}}],

(14)

which effectively measures the fidelity of the predicted noise and the accuracy of the reverse transitions.

For computational efficiency, a simplified loss function is often employed, focusing directly on the accuracy of the noise prediction:

L_{simple} = E_{x_{0}, ϵ} [| | ϵ - ϵ_{θ} (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ_{t}, t) {| |}^{2}] .

(15)

Here,

\bar{α_{t}}

and

\bar{β_{t}}

are time-dependent factors derived from the forward, jointly determining the proportion of noise added to the data and the proportion of the data signal retained at each step. The optimization of parameters

θ

is performed via backpropagation and stochastic gradient descent, with the aim of enabling the model to accurately reconstruct the data distribution from its diffused latent variables.

2.2. Score-Based Generative Models (SGMs)

Score-Based Generative Models (SGMs) [5,8], also known as energy-based models in some contexts, offer a framework for generating complex high-dimensional data by learning the gradient field of log probabilities, commonly referred to as the score. At the core of SGMs lies the concept of the score function [58]:

s_{x} (x) = \nabla_{x} \log p_{x} (x),

(16)

where

s_{x} (x)

denotes the score function for the data

x

under the probability density function

p_{x} (x)

. This function points in the direction of steepest ascent in the probability density of the data, providing a method to refine synthetic data points towards higher probability regions.

In SGMs, a neural network is trained to approximate the score function across different levels of noise perturbation. Given a data point

x_{0}

from the data distribution

q (x_{0})

, we perturb it with Gaussian noise:

q (x_{t} | x_{0}) = N (x_{t}; x_{0}, σ_{t}^{2} I),

(17)

to obtain a sequence of noisier data representations

x_{1}, x_{2}, \dots, x_{T}

. The neural network, parameterized by

θ

, is then trained to estimate the score function

\nabla_{x_{t}} \log q (x_{t})

at each noise level

σ_{t}

. Numerous established methodologies, including score matching [58], denoising score matching [59,60,61], and sliced score matching [62], exist for inferring score functions from datasets, commonly referred to as score estimation. These techniques enable us to effectively utilize one of these approaches to train noise-conditional score networks using data points that have been subject to perturbation.

For the sampling process, which plays a crucial role in generating high-quality data samples from learned distributions, this process involves generating new data samples by progressively reducing the noise levels, guided by the learned score functions. Here, we introduce the initial sampling method for SGMs, called Annealed Langevin Dynamics (ALD) [8]. ALD is an iterative process that effectively navigates from a high-noise state to the original data distribution. It begins with an initial sample drawn from a Gaussian distribution representing the highest noise level. Then, it iteratively refines this sample through a series of steps, each corresponding to a lower noise level. The iterative update at each step t is governed by the following Langevin dynamics equation:

x_{t}^{(i + 1)} = x_{t}^{(i)} + \frac{s_{t}}{2} \cdot s_{θ} (x_{t}^{(i)}, t) + \sqrt{s_{t}} \cdot ϵ^{(i)},

(18)

where

$x_{t}^{(i + 1)}$ is the sample at iteration $i + 1$ and noise level t;
$s_{t}$ is the step size for noise level t;
$s_{θ} (x_{t}^{(i)}, t)$ is the score function estimated by the trained model;
$ϵ^{(i)}$ is a random sample drawn from a normal distribution $N (0, I)$ , representing the stochastic nature of the process.

The score function

s_{θ} (x_{t}^{(i)}, t)

plays a pivotal role in guiding the sampling process. It provides the direction and magnitude of the update to the current sample

x_{t}^{(i)}

, aiming to reduce the difference between the sample’s distribution and the true data distribution at each noise level. Essentially, the score function captures the gradient of the log probability density of the data with respect to the sample, directing the sampling towards regions of higher probability density. This guidance is crucial for effectively navigating the complex landscape of the data distribution, enabling the model to generate samples that closely resemble the original data. This process is repeated for each noise level, starting from the highest

σ_{T}

to the lowest

σ_{1}

, thereby annealing the noise and refining the sample towards the true data distribution. The ALD method’s effectiveness lies in its ability to leverage the score function at each noise level, providing a guided path for sample refinement.

2.3. Stochastic Differential Equations (Score SDEs)

Stochastic Differential Equations (SDEs) [5,23] form the backbone of continuous-time models in Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs), particularly in the infinite time steps or noise levels scenario. This generalization, known as Score SDE [5], leverages SDEs for both noise perturbation and sample generation, with a significant focus on estimating score functions of noisy data distributions.

The diffusion process in Score SDEs is described by a specific SDE:

d x = f (x, t) d t + g (t) d w,

(19)

where

f (x, t)

represents the drift function and

g (t)

represents the diffusion coefficient of the SDE, respectively, and w is a standard Wiener process (or Brownian motion). This equation encapsulates the forward processes in both DDPMs and SGMs, which are essentially discretizations of this continuous-time model.

In the context of DDPMs, the corresponding SDE, as demonstrated in Song et al. [5], takes the following form:

d x = - \frac{1}{2} β (t) x d t + \sqrt{β (t)} d w,

(20)

where

β (\frac{t}{T}) = \frac{T β_{t}}{T}

as T approaches infinity. For SGMs, the SDE is expressed as

d x = \sqrt{\frac{d [σ {(t)}^{2}]}{d t}} d w,

(21)

where

σ (\frac{t}{T}) = σ_{t}

as T tends towards infinity.

Interestingly, any diffusion process of the form of Equation (15) can be reversed, as shown by Anderson [63], by solving the reverse-time SDE:

d x = (f (x, t) - g {(t)}^{2} \nabla_{x} \log q_{t} (x)) d t + g (t) d \bar{w},

(22)

where

\bar{w}

is a standard Wiener process in reverse time and

d t

denotes an infinitesimal negative time step. The marginal densities of the solution trajectories of this reverse SDE mirror those of the forward SDE but evolve in the opposite time direction [5].

Furthermore, Song et al. [5] established the existence of a Probability Flow Ordinary Differential Equation (ODE), whose trajectories share the same marginals as the reverse-time SDE. This Probability Flow ODE is formulated as follows:

d x = (f (x, t) - \frac{1}{2} g {(t)}^{2} \nabla_{x} \log q_{t} (x)) d t .

(23)

Both the reverse-time SDE and the Probability Flow ODE facilitate sampling from the same data distribution, affirming their equivalence in terms of marginal distributions. These mathematical formulations not only elucidate the underpinnings of continuous-time generative models but also highlight the intricate interplay between diffusion and reverse-diffusion processes in generating complex data distributions.

3. Advanced Diffusion Variants

Despite the remarkable achievements of diffusion models in generating high-quality samples, the original diffusion models face several constraints, such as inefficient sampling methods, poorer likelihood performance compared to other generative models, and challenges in handling data with unique structures. On this foundation, numerous studies have been proposed to address these issues, aiming to enhance the performance of diffusion models. Efforts to overcome the inherent limitations of diffusion models have led to the development of innovative approaches that refine their sampling efficiency, improve likelihood measures, and extend their applicability to a wider array of structured data. These advancements underscore the dynamic nature of research in the field of generative models, highlighting the continuous pursuit of optimized performance and broader utility.

3.1. Efficient Sampling in Diffusion Models

Denoising Diffusion Implicit Models (DDIM) [2] stand out as an innovative approach that accelerates the sampling of diffusion models. The formulation of DDIM redefines the reverse diffusion process in a non-Markovian setting, which allows for deterministic sampling when the variance

σ_{t}^{2}

is set to zero. This is a significant departure from the Denoising Diffusion Probabilistic Models (DDPMs), where the process is inherently stochastic. The key to DDIM’s accelerated sampling lies in this deterministic property, which can be described mathematically as follows:

For a single step of the reverse process in DDIM, the sample at the previous time step

x_{t - 1}

can be derived from the current step

x_{t}

using the following equation:

x_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ}^{(t)} (x_{t})}{\sqrt{{\bar{α}}_{t}}}) + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} \cdot ϵ_{θ}^{(t)}

(24)

where

{\bar{α}}_{t}

represents the cumulative product of

1 - β_{t}

, the noise schedule of the forward process, and

ϵ_{θ}^{(t)}

is the predicted noise at step t. By setting

σ_{t}^{2} = 0

, the DDIM simplifies to

x_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ}^{(t)} (x_{t})}{\sqrt{{\bar{α}}_{t}}}) + \sqrt{1 - {\bar{α}}_{t - 1}} \cdot ϵ_{θ}^{(t)}

(25)

The deterministic nature of this process eliminates the need for stochastic sampling, enabling the direct computation of

x_{t - 1}

without additional noise terms. This facilitates faster sampling by using a sequence of deterministic steps, essentially ‘skipping’ through the diffusion process.

Watson et al. [24] made significant strides in the optimization of sampling procedures for Denoising Diffusion Probabilistic Models (DDPMs). They proposed an optimization strategy that leverages dynamic programming to effectively reduce the number of refinement steps needed in the DDPM sampling process. Recognizing the decomposition of the DDPM objective into individual KL divergence terms, their method identifies the optimal discretization scheme, thereby striking a balance between computational cost and model performance. This dynamic programming solution is particularly novel as it does not require additional hyperparameters or model retraining, making it an immediately applicable technique for enhancing the performance of pre-trained DDPMs.

Adversarial Diffusion Distillation (ADD) [25] offers a novel training framework that significantly enhances the sampling efficiency of foundational image diffusion models. This methodology incorporates score distillation from a pre-trained diffusion model, termed the teacher, to a student model that astonishingly requires only one to four steps to generate images of high quality. Complementing the distillation, an adversarial loss implemented with a well-crafted discriminator aims to ensure the fidelity of the synthesized images. The student samples, once processed by the teacher’s forward diffusion, become the targets for the distillation loss, as delineated in Section 3.3. The overarching objective combines the adversarial loss with the distillation loss, formalized as

L = L_{adv}^{G} ({\hat{x}}_{θ} (x_{s}, s), ϕ) + λ L_{distill} ({\hat{x}}_{θ} (x_{s}, s), ψ)

(26)

While applicable in pixel space, the method seamlessly adapts to Latent Diffusion Models (LDMs) [10] operating in a shared latent space between teacher and student, allowing the choice of computing the distillation loss in pixel or latent space. The method yields more stable gradients and superior results when applied to latent diffusion models.

3.2. Improved Likelihood in Diffusion Models

The training objective for diffusion models is framed as a variational lower bound (VLB) on the negative log-likelihood (Equation (9)). Nevertheless, this bound can frequently be loose, which may result in diffusion models yielding log-likelihoods that are not fully optimized.

Improved Denoising Diffusion Probabilistic Models (IDDPM) [1] represent a significant evolution in the generative modeling landscape by refining diffusion processes. IDDPM innovates upon the standard DDPM practice of using fixed variances,

Σ_{θ} (x_{t}, t) = σ_{t}^{2} I

, with

σ_{t}

being a constant, by introducing a learnable variance mechanism. This mechanism is mathematically formulated as

Σ_{θ} (x_{t}, t) = \exp (v \log β_{t} + (1 - v) \log {\bar{β}}_{t})

(27)

where v denotes a vector that is trained to adapt the model’s noise level dynamically, thus aiming to optimize the log-likelihood of the generated samples.

In addition, IDDPM modifies the linear noise schedule to a cosine noise schedule for a nuanced noise reduction that preserves information integrity throughout the diffusion process:

{\bar{α}}_{t} = \frac{f (t)}{f (0)}, f (t) = \cos {(\frac{t / T + s}{1 + s} \cdot \frac{π}{2})}^{2}

(28)

This cosine schedule is particularly effective in maintaining image fidelity in low-step regimes, where traditional schedules may falter.

The primary objective of IDDPM is to enhance the likelihood of traditional DDPMs, facilitating a more accurate representation of data distributions. By leveraging advanced variance learning and noise scheduling techniques, IDDPM significantly improves generative performance and efficiency.

The work of Song et al. [26] demonstrated that score-based diffusion models, which synthesize samples by inverting a stochastic diffusion process, can be optimized more effectively by focusing on the likelihood of the data. Traditionally trained by minimizing a weighted combination of score matching losses, these models have not directly optimized log-likelihood. Song et al. [26] reveal that with a special weighting function, the training objective can upper bound the negative log-likelihood, enabling approximate maximum likelihood training of score-based diffusion models. This is formalized as follows:

D_{K L} (q_{0} | | p_{θ}^{s d e} {) \leq L (θ; g (\cdot))}^{2} + D_{K L} (q_{T} | | π),

(29)

where

L {(θ; g (\cdot))}^{2}

is the score matching objective and the

D_{K L}

terms represent the Kullback–Leibler divergences at the beginning and end of the diffusion process, respectively. Empirical evidence shows that this methodology consistently enhances the likelihoods of diffusion models, indicating an improved learning of the data distribution.

3.3. Handling Data with Special Structures

Effective deployment of diffusion models often requires consideration of the unique structures inherent in critical data domains. To address such complexities, modifications to the diffusion models are necessary.

Diffusion models are inherently designed for continuous data, employing Gaussian noise perturbations. This poses challenges for discrete data applications. A novel solution, VQ-Diffusion [27], addresses this by introducing a random walk in the discrete data space, replacing Gaussian noise with a transition kernel suited for discrete structures. The transition kernel for the forward diffusion process is given by

q (x_{t} | x_{t - 1}) = v^{⊤} (x_{t}) Q_{t} v (x_{t - 1}),

(30)

where

v (x)

is a one-hot vector representing the discrete data state and

Q_{t}

is the transition matrix of a lazy random walk at step t. This approach adapts the diffusion process to discrete data domains, enabling the generation of discrete data samples while respecting the intrinsic structure of the data.

For data characterized by invariant structures, numerous works have endowed diffusion models with the capability to interpret this type of data. For example, Xu et al. [28] introduced a pivotal approach that leverages Markov chains with an invariant prior, coupled with equivariant Markov kernels, to ensure the generated data respects the intrinsic invariance properties. Specifically, for a transformation T representing rotation or translation, they establish that

p (x_{T}) = p (T (x_{T})),

(31)

p_{θ} (x_{t - 1} | x_{t}) = p_{θ} (T (x_{t - 1}) | T (x_{t})),

(32)

guaranteeing the invariance of the sample distribution to T such that

p_{0} (x) = p_{0} (T (x)) .

(33)

This framework empowers diffusion models to generate data, like molecular conformations, that remain invariant under rotations and translations, ensuring the physical and chemical validity of the synthesized structures regardless of their spatial orientation.

Building upon the manifold hypothesis, which posits that natural data predominantly occupy manifolds with significantly lower intrinsic dimensionality, recent advancements have focused on leveraging learned manifolds for diffusion model training. This methodology involves using autoencoders to reduce the data to a manageable latent space, thus enabling diffusion models to operate more efficiently due to the reduced data complexity.

The Latent Diffusion Model (LDM) [10] and DALLE-2 [41] serve as prominent examples of this approach. LDM segregates the process into two distinct phases: initially employing an autoencoder for dimensionality reduction, then training a diffusion model to generate latent codes. Similarly, DALLE-2 trains a diffusion model on the CLIP image embedding space, with a subsequent phase dedicated to decoding these embeddings back into images. This strategy underscores the efficiency and potential of training diffusion models on learned manifolds, optimizing the generative process within a simplified data domain.

4. Relation to Other Generative Models

Common generative models include GANs [9], VAEs [29,30], and flow-based Models [31,32,33,34]. A generative model typically operates by taking a series of random noise as input and transforming it through a probabilistic model to produce data with specific semantic information, such as images or text. Additionally, algorithms that integrate diffusion models with other generative models are summarized in Table 1.

4.1. Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) [29,30] are generative models that offer a probabilistic description of observations in latent space. Essentially, VAEs represent latent attributes as probability distributions. A typical autoencoder has two similar networks: an encoder and a decoder. The encoder processes the input into a compressed representation, which the decoder then uses to reconstruct the original input. VAEs benefit from a continuous latent space, facilitating the ease of random sampling and interpolation. The encoder does not output an encoding vector directly but instead produces two equally sized vectors: one for means

μ

and another for standard deviations

σ

. Each hidden node is modeled as a Gaussian distribution, where sampling from these output vectors to feed into the decoder constitutes stochastic generation:

z = μ + σ ⊙ ϵ, ϵ \sim N (0, I) .

(34)

This means that, even with constant means and standard deviations, the encoding derived from the same input can differ across multiple forward passes due to the stochastic nature of the sampling process.

The training process involves minimizing the reconstruction loss, which reflects the similarity between the output and the input, and the latent loss, which measures how closely the hidden nodes adhere to a standard normal distribution. The latent loss is given by the Kullback–Leibler divergence:

L_{latent} = D_{K L} (N (μ, σ^{2}) | | N (0, I)) .

(35)

A balance must be struck between the latent loss, which when minimized reduces the amount of information that can be encoded, and the reconstruction loss. When latent loss is low, the generated images may bear excessive resemblance to the training images, affecting quality negatively. Conversely, with minimal reconstruction loss, while training reconstructions are accurate, newly generated images may vary substantially, necessitating the discovery of an optimal balance.

One of the main shortcomings of VAEs is the tendency to produce blurred outputs, a result of the method of data distribution recovery and loss function computation. Additionally, they are susceptible to mode collapse, a phenomenon where the model generates a limited variety of samples, thus failing to capture the full diversity of the data distribution. Furthermore, the KL divergence term in VAEs, integral for regularizing the latent space, can lead to an oversimplified latent representation. This simplification restricts the model’s capacity to encode complex data variations, potentially undermining the richness of the generated outputs.

The Denoising Diffusion Probabilistic Model (DDPM) can be regarded as a Markovian Variational Autoencoder (VAE) with a predefined encoding strategy. In essence, the forward process of DDPM acts as the encoder, conforming to a linear Gaussian model. This encoding sequence is mathematically formalized by Equation (5) within the original literature. Conversely, the reverse process of DDPM is analogous to the decoder, which operates iteratively across several decoding phases. Notably, all latent variables in the decoder’s architecture are dimensionally equivalent to the input data, maintaining a consistent scale throughout the model’s generative process. Compared to VAEs, diffusion models exhibit superior performance in both the quality and diversity of generated samples. However, this advantage comes at the cost of increased computational requirements and longer sampling durations, as diffusion models operate iteratively over multiple time steps. Additionally, the architecture of diffusion models is inherently more complex than that of VAEs, necessitating extensive fine-tuning to optimize performance.

To elaborate further, the strengths of VAEs lie in their architectural simplicity and efficiency in training, making them suitable for a wide range of applications with limited computational resources. VAEs are particularly adept at learning compact latent representations, facilitating tasks such as anomaly detection and latent space arithmetic. However, VAEs often struggle with producing high-resolution images and may exhibit a trade-off between sample quality and diversity due to the imposed KL divergence constraint in their objective function.

On the other hand, diffusion models, while computationally intensive, are celebrated for their ability to generate highly detailed and diverse samples, outperforming many generative models in tasks requiring high fidelity and variation. Their iterative refinement process allows for the generation of samples that are often indistinguishable from real data. Nevertheless, the iterative nature of diffusion models leads to longer generation times, making real-time applications challenging. Moreover, their complexity requires careful tuning of hyperparameters and a deep understanding of the underlying stochastic processes to achieve optimal results.

4.2. Generative Adversarial Networks (GANs)

Generative Adversarial Networks (GANs) [9] are a class of machine learning frameworks where two neural networks contest with each other in a game-theoretic setting. The system comprises a generator G that creates samples from noise and a discriminator D that evaluates them against real data.

The training of GANs is an adversarial process: starting from synthetic data, the generator and discriminator engage in a continuous dynamic of computation and backpropagation, aiming to produce synthetic outputs that are indistinguishable from authentic data. The generator G synthesizes fake data from a noise signal through a series of computations, while the discriminator D assesses both real and fake data against its judgment criteria.

During training, the generator strives to fabricate increasingly realistic data, whereas the discriminator endeavors to differentiate between real and fake samples. This process unfolds iteratively until the generator fabricates data of such verisimilitude that the discriminator can no longer reliably label it as fake. The overall training objective is formulated as a minimax game where D is maximized, while the log likelihood

\log (1 - D (G (z)))

is minimized for G, as per the following function:

\min_{G} \max_{D} V (D, G) = E_{x \sim p_{data} (x)} [\log D (x)] + E_{z \sim p_{z} (z)} [\log (1 - D (G (z)))]

(36)

In this equation, x represents real data, z is a point sampled from the noise distribution

p_{z}

, and

G (z)

denotes the fake data generated by G. The expectations are over the real data distribution

p_{data}

and the noise distribution

p_{z}

, respectively. The discriminator D aims to assign the correct label to both real and fake data, while the generator G aims to produce data that are indistinguishable by D, culminating in a Nash equilibrium for the adversarial game.

Generative Adversarial Networks (GANs) manifest a critical limitation wherein an overly proficient discriminator yields near-binary feedback, severely curtailing the generator’s gradient-based learning; conversely, an adept generator can exploit weaknesses in the discriminator, leading to gradient saturation. This necessitates a delicate balance in training dynamics to maintain parity in performance between the two networks, contributing to the widely acknowledged training challenges associated with GANs. However, GANs offer significant advantages over diffusion models in certain aspects. Notably, GANs provide precise control over the positioning and boundaries of objects within generated images, a crucial attribute for creative content generation tasks such as photo editing and patching, as well as for data augmentation in discriminative learning. Another key advantage of GAN models relative to diffusion models is their rapid inference speed. GANs require only a single forward pass to generate an image, whereas diffusion models necessitate multiple iterative denoising steps, leading to slower inference speeds that may impact the practical usability of the models.

In stark contrast, diffusion models benefit from a more transparent training loss function that facilitates model convergence and mitigates the mode collapse issue inherent to GANs. Moreover, diffusion models are adept at capturing the intricacies of complex, non-linear distributions that extend beyond the capabilities of GANs, which excel in generating homogeneous data, such as singular-class imagery. GANs often struggle with the multifaceted distributions present in diverse-class image datasets, a domain where diffusion models demonstrate superior modeling prowess, capturing a broader spectrum of intricate image distributions. Dhariwal et al. [6] have proven this. Despite these advantages, the distinct strengths of GANs, particularly in terms of object placement precision and rapid image generation, underscore the importance of selecting the appropriate generative model framework based on the specific requirements and goals of the task at hand.

4.3. Flow-Based Generative Models

Flow-based generative [31,32,33,34] models are conceptually attractive due to tractability of the exact log-likelihood, tractability of exact latent-variable inference, and parallelizability of both training and synthesis. The flow-based generative model is the exact log-likelihood model, which applies a series of invertible transformations to samples from a priori so that the exact log-likelihood of an observation can be computed. Unlike the previous two algorithms, the flow-based model directly optimizes the objective function (which is log-likelihood!), so the loss function is a negative log-likelihood. The flow model f is constructed as an invertible transform that maps a high-dimensional random variable x to a standard Gaussian latent variable z. This model can be represented by an arbitrary bijective function and can be formed by superimposing individual simple invertible transforms.

Assuming a model with a generator G, let x denote a data sample.

P_{G} (x)

represents the distribution of x as predicted by G, and

P_{data} (x)

is the true distribution of the data. The objective in adjusting the generator is to bring

P_{G} (x)

as close as possible to

P_{data} (x)

—that is, to minimize the divergence between the two distributions. In the context of the model,

x^{i}

represents a sample drawn from the distribution

P_{data}

. Therefore, solving for the generator G is akin to a maximum likelihood estimation, which is equivalent to maximizing the probability of observing each sampled data point, or equivalently, minimizing the Kullback–Leibler (KL) divergence between the two distributions, aligning with our expectations. This is formally expressed as

G^{*} = \underset{G}{\arg \max} \sum_{i = 1}^{m} \log P_{G} (x^{i}) \Leftrightarrow \underset{G}{\arg \min} K L (P_{data} | | P_{G})

(37)

Here, the KL divergence measures how one probability distribution diverges from a second, expected probability distribution.

In flow-based generative models, the objective function is crafted to facilitate the computation of complex probability densities through a series of invertible transformations. For a given data sample

x^{i}

, the log-likelihood under the model’s distribution

p_{X}

is given by the transformation of

x^{i}

through the inverse of the generative model G and the log-determinant of the Jacobian matrix of

G^{- 1}

. The objective function can be expressed as

\log p_{X} (x^{i}) = \log p_{Z} (G^{- 1} (x^{i})) + \log | \det J_{G^{- 1}} (x^{i}) |,

(38)

where

p_{Z}

is the prior probability distribution in the latent space and

J_{G^{- 1}}

is the Jacobian matrix of the inverse of G.

Maximizing this log-likelihood is equivalent to minimizing the Kullback–Leibler divergence between the data distribution

p_{data}

and the model’s distribution

p_{X}

. However, computing the determinant of the Jacobian matrix can be highly time-consuming, especially for large matrices. Additionally, calculating the inverse of the generator

G^{- 1}

requires that the dimensions of the input and output must be identical. This necessitates an ingenious network architecture design that simplifies the computation of both the determinant of the Jacobian matrix and the inverse of the generator.

In practical flow-based models, the architecture may involve multiple transformations G. Since it is cumbersome to impose various constraints on a single transformation, these constraints are distributed across multiple G mappings. The transformations are then concatenated, and the overall objective function becomes

\log p_{X} (x^{i}) = \log p_{Z} (G^{- 1} (x^{i})) + \sum_{k = 1}^{K} \log | \det J_{G_{k}^{- 1}} (x^{i}) |,

(39)

where each

G_{k}

is an individual transformation within the sequence and K is the total number of transformations. This formulation allows for the modular design of the network and eases the imposition of constraints across the multiple mappings.

The invertible nature of the transformations used in flow-based models ensures that sampling can be performed efficiently and no information about the input data is lost. This is a significant advantage over models like VAEs and GANs. But the requirement for transformations to be invertible in flow-based models also introduces some limitations to them. It often leads to increased computational complexity, both in terms of memory and processing power. This can make flow-based models less scalable to very high-dimensional data or very complex distributions. Actually, the time required for flow-based models to generate images of the same resolution is several times greater than the Diffusion models. In addition, designing transformations that are both invertible and expressive enough to capture complex data distributions is challenging, and the requirement for invertibility restricts the choice of neural network architectures that can be used in flow-based models. Flow-based generative models offer unique advantages in terms of exact likelihood computation and efficient, invertible sampling. However, their practical application is often limited by computational demands, the challenge of designing expressive yet invertible transformations, and architectural constraints.

5. Applications of Diffusion Models

We conducted a multi-angle classification of diffusion models applied in computer vision. To categorize the existing models, we considered three criteria: task, denoising conditions, and foundational methods (architecture). The primary basis for classification is the application of the model. We categorize these applications into three distinct groups, each defined by the nature of the task: computer vision [11,12,13,14,35,36,37,38,72], multi-modal generation [10,19,20,21,22,39,40,41,48,49], and interdisciplinary fields [51,52,53,54,55,73,74]. Table 2 delineates our structured classification of diffusion models, arrayed according to the specified criteria. For every category, an initial overview introduces the scope and significance of the task at hand. This is succeeded by an in-depth discussion of the pivotal role diffusion models play in enhancing task performance and achieving state-of-the-art results.

5.1. Computer Vision

5.1.1. Image Editing and Inpainting

Image editing and inpainting are tasks within the field of computer vision and graphics that involve modifying and reconstructing images. Anydoor [11] is a novel application of diffusion models in the domain of zero-shot object-level image customization. Unlike conventional approaches, AnyDoor does not necessitate a task-specific dataset or training modifications. Instead, it begins with an image that includes some initial guidance—such as sketched strokes that indicate shapes and colors—and leverages the forward process of diffusion models to add noise, thereby preserving these properties while smoothing out deformations. Following this, AnyDoor employs the reverse process of the diffusion model to denoise the image. This process is guided by the initial strokes to synthesize a realistic image that adheres to the provided specifications. The strength of AnyDoor lies in its generic diffusion model, which, through the solution of the reverse Stochastic Differential Equation (SDE) [5,23], can generate images that fulfill the zero-shot guidance without the need for custom datasets or specialized training regimens. In Figure 4, we show the fantastic applications of AnyDoor.

In their influential work, Sdedit [12] adeptly demonstrated the application of diffusion models to a suite of guided image generation tasks, such as stroke-based painting, editing, and image composition. The process commences with an image imbued with preliminary guidance features—like outlines and color schemes—which the model is trained to retain. Through the forward diffusion process, incidental distortions within the image are methodically mitigated by the incremental infusion of noise. Subsequently, through the reverse diffusion process, the noise is meticulously purged, culminating in the synthesis of a realistic image that adheres to the initial guidance parameters. This synthesis is achieved using a universal diffusion model that resolves the reverse Stochastic Differential Equation (SDE) [5,23], remarkably without the necessity for a specialized dataset or tailored training alterations.

Repaint [13] addressed the challenge of inpainting in the domain of image processing with an innovative approach that harnesses denoising diffusion probabilistic models. They presented a novel technique that leverages the strengths of diffusion models to effectively reconstruct missing or corrupted portions of images. Utilizing a series of learned noise-reduction steps, the proposed method progressively refines the inpainting results, yielding high-quality and coherent image restorations. Their method was shown to not only produce visually plausible inpaintings but also to do so with a remarkable adherence to the contextual integrity of the source material. The capability of their approach to handle a wide range of inpainting tasks marks a substantial advancement in the field, setting new benchmarks for future research endeavors.

The mentioned studies comprehensively demonstrate the efficacy of diffusion models in addressing tasks related to image editing and inpainting.

5.1.2. Semantic Segmentation

Semantic segmentation poses a unique challenge within the field of computer vision, requiring fine-grained classification at the pixel level. Baranchuk et al. [36] provided compelling evidence on the utility of diffusion models in the domain of semantic segmentation. By ingeniously extracting feature maps at varying scales from the decoder segment of a U-Net architecture, which is integral to the denoising process, they lay the groundwork for pixel-wise classification. These feature maps are then uniformly upscaled to match dimensions and concatenated, forming a rich, multi-scale representation that feeds into an ensemble of multi-layer perceptrons. The richness of the representations, particularly those obtained in the latter stages of the denoising sequence, is underscored by their substantial contribution to the model’s performance. We provide this process in Figure 5. Empirical results furnished by the authors convincingly demonstrate that semantic segmentation leveraging diffusion models not only competes with but also frequently surpasses the majority of established baselines, marking a significant stride in image processing and analysis.

Furthermore, Decoder Denoising Pretraining (DDeP) [14] has successfully integrated diffusion models with the principles of denoising autoencoders [15], yielding promising results in label-efficient segmentation. This technique underscores the utility of diffusion processes in enhancing the feature extraction capabilities of autoencoders, thus improving segmentation with fewer labels. Concurrently, ODISE [16] forays into the realm of open-vocabulary segmentation tasks, pioneering the use of diffusion models to navigate the challenges associated with segmenting images without predefined classes. It introduces an innovative implicit captioner, which generates descriptive captions for images. This novel approach facilitates the effective leveraging of pre-trained large-scale text-to-image diffusion models, thereby broadening the applicability and effectiveness of diffusion models in more diverse and complex segmentation scenarios.

5.1.3. Anomaly Detection

Anomaly detection stands as a critical and challenging frontier in machine learning and computer vision, where generative models have been increasingly recognized for their potent anomaly detection capabilities, especially in modeling normal or healthy reference data. Wyatt et al. [17] innovatively train a Denoising Diffusion Probabilistic Model (DDPM) exclusively on healthy medical images, leveraging the model’s ability to detect anomalies during inference by contrasting the reconstructed image against the original. They also empirically demonstrate that employing simplex noise, as opposed to traditional Gaussian noise, enhances performance for anomaly detection tasks.

In a parallel vein, Wolleb et al. [18] introduced a weakly supervised method anchored in diffusion models, adept at identifying anomalies within medical imagery. Their approach, requiring a pair of unaligned images—one healthy and one with lesions—utilizes the diffusion process to perturb the healthy image. The subsequent denoising step is adeptly steered by the gradient of a binary classifier, aiming to regenerate the healthy image. The final stage involves contrasting the regenerated healthy image with the lesion-containing counterpart to generate a precise anomaly map. Both contributions underscore the efficacy of DDPMs over adversarial training-based alternatives, particularly in scenarios with smaller datasets, due to their superior modeling capacity and the stability of their training paradigms.

5.2. Multi-Modal Generation

5.2.1. Text-to-Image Generation

Text-to-image generation involves creating a visual representation from a given textual description. The diffusion model has garnered significant attention for its remarkable achievements in the text-to-image generation domain. Diffusion models are capable of integrating multiple concepts such as objects, textures, and shapes to synthesize high-quality samples. To validate this assertion, we employed the Stable Diffusion model [10] to produce images from a range of textual prompts. The outcomes of this process are depicted in Figure 1.

GLIDE [44] emerges as a groundbreaking text-to-image synthesis model, ingeniously combining the strengths of pre-trained Denoising Diffusion Probabilistic Models (DDPM) [6] with the capabilities of CLIP [43] models. This integration forms the backbone of GLIDE’s approach to region-based image editing, targeting a wide range of general-purpose applications. By harnessing natural language guidance, GLIDE adeptly navigates the complexities of editing real and diverse images. Its architecture is particularly notable for leveraging the descriptive power of language, employing CLIP’s robust language–image embeddings to align textual prompts with corresponding visual elements accurately. Meanwhile, the DDPM component ensures the generation of high-fidelity images. The synergy between these two models enables GLIDE to produce highly coherent and contextually relevant image modifications, making it a versatile tool for various image editing tasks that require both precision and creativity.

Building upon the foundations laid by GLIDE [44], Imagen [20] adopts classifier-free guidance for text-to-image generation. A crucial distinction between GLIDE and Imagen is evident in their choice of text encoders, as depicted in Figure 6. Specifically, GLIDE integrates the training of its text encoder with the diffusion prior using paired image-text data. In contrast, Imagen [20] opts for a pretrained, frozen, large language model as its text encoder. This approach of using a frozen encoder facilitates offline text embedding, significantly reducing the computational load during the online training phase of the text-to-image diffusion prior. Moreover, the text encoder in Imagen can be trained on either image-text data, like CLIP [43], or text-only corpora, such as BERT [42], GPT [45], and T5 [46]. Text-only corpora are substantially larger than paired image-text data, thus exposing the language models to a richer and more diverse text distribution. For instance, the text corpus for BERT [42] is approximately 20 GB, while that for T5 [46] it amounts to about 800 GB. Imagen’s experiments with various T5 [46] variants as text encoders demonstrate that increasing the size of the language model significantly enhances both image fidelity and image-text alignment, outperforming increases in the diffusion model size.

Stable Diffusion represents a notable advancement in the training of diffusion models within latent space, scaling up the principles of the Latent Diffusion Model (LDM) [10]. It follows in the footsteps of Dall-E [19], which uses a VQ-VAE to develop a visual codebook, by applying VQ-GAN [77] for latent representation in its initial phase. Notably, VQ-GAN [77] refines VQ-VAE by incorporating an adversarial objective, thereby enhancing the realism of the generated images. Stable Diffusion cleverly utilizes a pretrained VAE [77] to reverse the forward diffusion process, which originally introduces noise into the latent space. Additionally, it integrates cross-attention mechanisms, serving as a versatile conditioning tool for various inputs, such as text. The findings in [10] emphasize that diffusion modeling in latent space surpasses pixel-space counterparts, particularly in reducing complexity and preserving intricate details. This approach has been further explored in VQ-diffusion [27], which employs a mask-then-replace strategy in diffusion, and similar to pixel-space methods, classifier-free guidance has been shown to significantly enhance text-to-image models in latent space [10], marking a promising direction in generative model development (Figure 7).

33, or unCLIP [41], marks a significant advancement in text-to-image synthesis by integrating diffusion models with multimodal contrastive learning from its predecessor, DALL-E [19] (Figure 8).Unique in its approach, DALL-E 2 utilizes the CLIP [43] text encoder for interpreting and encoding text inputs, while innovatively inverting the CLIP image encoder through a diffusion model. This model generates images from the CLIP latent space, involving two key processes: encoding text into a latent representation and translating this into an image. The diffusion model, being non-deterministic, adds variability and richness to the imagery, capturing the text’s essence with high accuracy and creativity. Central to DALL-E 2’s architecture is the text–image latent prior, crucial for aligning text and image latent spaces and ensuring that generated images closely match the textual narrative. As shown in Figure 9, DALL-E2 [41] finds that this prior can be learned by either the autoregressive method or diffusion model, and the diffusion prior achieves superior performance.

Using a diffusion model to learn this prior has proven superior to autoregressive methods, and removing this component significantly reduces performance, underscoring its importance.

5.2.2. Text-to-Audio Generation

Text-to-audio generation, a pivotal task in transforming textual information into vocal output, has seen significant advancements through models like Grad-TTS and its successors. Grad-TTS [75], a notable text-to-speech model, integrates a score-based decoder with diffusion models, adeptly converting noise predicted by the encoder into speech closely aligned with text input using Monotonic Alignment Search. Building on this, Grad-TTS2 [78] enhances the adaptability of the original model, offering improved flexibility and quality in speech synthesis. In a parallel development, Diffsound [76] employs a non-autoregressive decoder based on discrete diffusion models [79], a novel approach that allows for the simultaneous prediction of all mel-spectrogram tokens, followed by successive refinements. Another innovative approach is EdiTTS [56], which utilizes a score-based text-to-speech model to refine a mel-spectrogram that has been coarsely altered, ensuring greater control and precision in the generation process.

5.2.3. Text-to-Video Generation

Text-to-video generation is the task of creating coherent and contextually relevant video sequences from textual descriptions. The recent advancements in enhancing the efficiency of diffusion models have paved the way for their successful application in the realm of video generation and processing. The “Make-A-Video” method [48] innovatively extends diffusion-based text-to-image models to generate videos, employing a spatiotemporally factorized diffusion model for seamless text-to-video synthesis. This approach leverages a joint text–image prior, removing the dependency on paired text–video datasets and enabling efficient generation of contextually accurate videos from textual descriptions. Furthermore, the model incorporates super-resolution strategies, enhancing the output to high-definition videos with improved frame rates.

Imagen Video [39] marks a notable advancement in high-definition video generation by implementing cascaded video diffusion models. This approach effectively transfers successful strategies from text-to-image synthesis, such as employing a frozen T5 text encoder for robust text understanding and integrating classifier-free guidance for improved image quality. These technical enhancements enable Imagen Video to produce videos that are not only visually striking but also faithfully represent the input text. Tune-A-Video [80] introduces the form of one-shot video tuning for text-to-video generation, effectively eliminating the strain for training on large-scale video datasets. The model achieves this through the implementation of efficient attention tuning, which optimizes the model’s focus on relevant features in both text and video domains. Furthermore, structural inversion is employed as a key technique, enhancing the temporal consistency of generated videos. This approach not only optimizes the training process but also guarantees the production of coherent and contextually aligned videos from textual descriptions.

Text2Video-Zero [47] represents a notable leap in the field of generative models, achieving zero-shot text-to-video synthesis by adapting a pre-trained text-to-image diffusion model. The innovation lies in its approach to maintaining temporal consistency, which it accomplishes by manipulating motion dynamics in latent codes and implementing cross-frame attention mechanisms. These techniques guarantee that the generated videos are not only visually coherent but also accurately reflect the narrative and dynamics implied by the input text. The primary objective of Text2Video-Zero is to facilitate accessible text-guided video generation and editing, eliminating the need for extensive fine-tuning or reliance on large-scale video datasets. This work stands out for its ability to bridge the gap between text descriptions and video content, paving the way for efficient and effective video synthesis.

5.2.4. Text-to-3D Generation

Text-to-3D generation is an emerging task in the field of computer graphics and machine learning, where the goal is to create three-dimensional models or scenes directly from textual descriptions. DreamFusion [50] represents a notable advancement in the synthesis of three-dimensional models from text, leveraging the capabilities of Imagen [20] as a tool for generating images from textual descriptions. In DreamFusion, the text input serves as a guide, influencing the visual output, with the image generation process being influenced by directional descriptions within the text. This results in different perspectives of the generated images being influenced by the text’s directional elements. For the 3D representation of the generated objects, DreamFusion employs Mip-NeRF, an innovative approach to rendering 3D models. This is achieved through a pre-trained 2D text-to-image diffusion model, which DreamFusion uses to conduct text-to-3D synthesis. The model optimizes a randomly initialized 3D model—a Neural Radiance Field, or NeRF—using a probability density distillation loss. This loss function strategically utilizes the 2D diffusion model as a prior for the optimization of a parametric image generator, enabling the creation of detailed and contextually accurate 3D representations from text descriptions. DreamFusion’s methodology not only highlights the potential of text-to-image models in 3D synthesis but also showcases the effectiveness of combining these models with advanced rendering techniques like Mip-NeRF.

Magic3D [49] emerges as an innovative solution to the challenges faced by DreamFusion in text-to-3D synthesis. DreamFusion faced two primary drawbacks: the slow optimization of Neural Radiance Fields (NeRF) and the low resolution of image space supervision, leading to prolonged processing times and lower-quality 3D models. Magic3D addresses these issues through a two-stage optimization framework. In its first stage, Magic3D employs a low-resolution diffusion prior alongside a sparse 3D hash grid structure to expeditiously derive a coarse model. This approach significantly accelerates the initial modeling phase. The second stage takes this coarse representation as a starting point and further optimizes a textured 3D mesh model. This refined model interacts with a high-resolution latent diffusion model through an efficient differentiable renderer, enhancing the detail and quality of the final 3D output. Magic3D addresses the limited output resolution of Imagen [20], which constrained DreamFusion’s resolution. By enhancing the resolution from 64 × 64 to 512 × 512, Magic3D produces models with significantly greater detail. Furthermore, the replacement of Mip-NeRF with the more efficient Instant-NGP in Magic3D leads to improved efficiency.

ProlificDreamer [81] introduces a novel concept called Variational Score Distillation (VSD)—a technique that represents a significant advancement in the synthesis of 3D scenes from textual prompts. VSD operates by optimizing a distribution of 3D scenes, treating them as random variables, to achieve a high degree of alignment with a pretrained 2D diffusion model. The key to this alignment lies in ensuring that the distribution of rendered images, viewed from all perspectives, closely matches the 2D model’s output. To measure and optimize this alignment, ProlificDreamer employs Kullback–Leibler (KL) divergence, a statistical tool that quantifies the difference between two probability distributions. This approach allows ProlificDreamer to efficiently bridge the gap between textual descriptions and their corresponding 3D visualizations, guaranteeing that the generated 3D scenes are not only visually coherent but also contextually accurate as per the input text prompts.

5.3. Interdisciplinary Applications

5.3.1. Medical Image Generation and Segmentation

In the interdisciplinary realm of medical imaging, diffusion models have carved a niche for themselves, demonstrating exceptional prowess in both generation and segmentation tasks. Song et al. [73] presented an innovative approach to address inverse problems in medical imaging using score-based generative models. Their methodology is particularly tailored for reconstructing medical images from various types of measurements, which is a critical challenge in the field. The process begins with the training of an unconditional score model, establishing a foundational generative framework. The novelty of their approach lies in the derivation of a stochastic process corresponding to the medical measurements. This process facilitates the infusion of conditional information into the model through a proximal optimization step, effectively incorporating the specific constraints of medical imaging. A crucial aspect of their method involves the decomposition of the matrix that maps the original signal to its measurements. This decomposition enables sampling in a closed-form, which is a significant advancement that allows for more efficient and accurate image reconstruction. The authors demonstrate the efficacy of their approach through a series of experiments across various medical imaging modalities, including computed tomography (CT), low-dose CT, and MRI. Their results showcase the model’s capability to reconstruct high-fidelity images consistent with both the prior and observed measurements.

Chung et al. [51] introduced a novel approach in medical imaging for reconstructing accelerated MRI scans using score-based diffusion models. Their technique involves pretraining a score model on unconditional magnitude images and then employing a variance exploding SDE [5] solver for the sampling process. The reconstruction is further refined using a Predictor–Corrector algorithm [5], integrated with a data consistency mapping. This enables effective conditioning on the split real and imaginary parts of the image. Additionally, they extend the method to accommodate multiple coil-varying measurements, enhancing the model’s adaptability for diverse MRI scenarios. This work showcases the potential of diffusion models in effectively reconstructing high-fidelity MRI images.

Peng et al. [52] made significant strides in MRI reconstruction by introducing a method that guides the reverse-diffusion process using observed k-space signals. Their approach is characterized by a novel coarse-to-fine sampling algorithm, which significantly enhances the efficiency of the sampling process in MRI reconstruction. This method allows for a more precise and gradual reconstruction of MR images, leading to improvements in both the quality and accuracy of the final images.

Wolleb et al. [55] introduced a novel method for brain tumor segmentation using diffusion models. Their technique involves diffusing and then denoising segmentation maps to reconstruct original images, with a crucial step of concatenating the brain MR image during denoising. This process, conducted through a U-Net model, conditions the denoising on the MR image, enhancing segmentation accuracy. Additionally, they generate multiple samples per input due to stochasticity in the diffusion process, facilitating the creation of an ensemble mean segmentation map along with a variance measure, quantifying the segmentation’s uncertainty. This approach not only enhances the precision of tumor segmentation but also yields quantitative insights regarding the confidence level of the results.

5.3.2. Molecule Generation and Drug Design

The design of molecules, crucial in both biomedical research and drug design, is adeptly addressed using diffusion models. Fragment-based drug design, a key strategy in 3D molecular discovery, is effectively advanced by these models. This E(3)-equivariant 3D-conditional diffusion model, as detailed in the work of DiffLinker [54], is adept at generating molecular linkers. It integrates a graph neural network for predicting the linker size and an equivariant diffusion model to generate the linker, connecting molecular fragments into complete molecules. Its ability to generate linkers for multiple fragments, including the determination of atom count and attachment points, marks a significant advancement in the field.

Hoogeboom et al. [74] presented the E(3)-Equivariant Diffusion Model (EDM) for molecule generation in 3D. This method significantly expands the scope of small molecule generation, enabling the creation of structures with up to 29 atoms, surpassing the previous limit of nine heavy atoms. By combining an equivariant graph neural network with the diffusion process, this model achieves enhanced performance and scalability, modeling molecular structures with geometric symmetries and simplifying the training process.

Wu et al. [53] introduced a novel diffusion-based generative model, the Diffusion Informative Prior Bridge, which utilizes physical and statistical prior information to guide the diffusion process. This model integrates several energy functions to improve molecule generation and promote uniformity in 3D point cloud generation.

6. Future Trends

Research into diffusion models stands at the precipice of exciting developments, with theoretical and practical advancements yet to be explored. As these models establish themselves as a powerful generative framework rivaling adversarial networks without the need for adversarial training, the focus turns to deepening our understanding of their operational efficacy across varied applications. Crucial to this endeavor is unraveling the characteristics that distinguish diffusion models from other generative mechanisms, such as variational autoencoders and flow-based models. This differentiation could illuminate their ability to generate high-quality samples while achieving competitive likelihoods.

Moreover, improving latent space representations remains an area ripe for innovation. Current diffusion models lack the efficacy of their generative counterparts in providing manipulable semantic data representations, often mirroring the data space’s dimensionality and thus impacting sampling efficiency.

The advent of Artificial-Intelligence-Generated Content (AIGC) and the proliferation of Diffusion Foundation Models signify a paradigm shift towards models that are pre-trained to produce content that resonates with human perception. The generative pre-training techniques utilized in models such as GPT [45] and Visual ChatGPT [82], which have shown emergent abilities and surprising generation performance, hold promise for the future of diffusion models. By integrating these techniques, diffusion models could potentially be transformed to perform generatively at scale, offering new avenues for AIGC and beyond.

Recent advancements in diffusion models have markedly improved the generation of text-to-video content, overcoming previous limitations in quality, resolution, and synthesis control. Gen-1 [21] has set a new standard by utilizing text-conditioned image models like DALL-E2 and Stable Diffusion to produce videos with unprecedented detail from text prompts. Notably, it incorporates monocular depth estimates and embeddings from pre-trained neural networks for enhanced structural and content authenticity. Following this, Gen-2 has further pushed the boundaries by introducing the capability to generate videos from scratch, supporting 4K ultra-realistic definitions and showcasing the potential of diffusion models in video synthesis.

The introduction of the Sora model, utilizing a Diffusion transformer (DiT) architecture, represents a leap forward, generating videos at 1920 × 1080 resolution for up to a minute, surpassing previous models in both quality and length. These developments signal a transformative era in video generation from text, hinting at a future where creating complex, high-quality video content from simple text descriptions could become commonplace. However, the journey does not end here. The field is ripe for exploration, especially in modeling long-term temporal dynamics and interactions, to further extend video duration and realism. Future research could unlock even more sophisticated applications, potentially revolutionizing storytelling, education, and entertainment by making detailed, lifelike video generation accessible to everyone.

With systematic exploration and innovative thinking, the trajectory of diffusion models is set to redefine the boundaries of generative AI, edging closer to the multifaceted goal of artificial general intelligence.

7. Conclusions

In this review, we have navigated the evolving landscape of diffusion models, elucidating the distinctions and synergies among three foundational formulations: DDPMs, SGMs, and Score SDEs. These models have showcased their prowess in image generation, setting a benchmark for quality and diversity. However, it is crucial to acknowledge that original diffusion models possess certain limitations, such as slower sampling speeds, often requiring thousands of evaluation steps to generate a single sample. They also lag in maximum likelihood estimation when compared with likelihood-based models and exhibit limited generalizability across various data types. Despite these challenges, recent strides have been made from practical and theoretical standpoints to mitigate these constraints, enhancing their applicability and analytical robustness.

Yang et al. [83] enhanced diffusion models by streamlining the architecture for greater efficiency, significantly lowering computational costs and making these models viable for limited-resource settings. They achieve this through reducing latent space dimensionality and employing a layer-pruning algorithm to maintain generative quality with fewer resources. Furthermore, they improve the diffusion process with an adaptive noise conditioning technique, tailoring the number of steps to the data complexity for effective image synthesis. Lee et al. [84] tackle spectral inconsistencies in diffusion model outputs with a spectrum translation method grounded in contrastive learning. This approach effectively aligns the frequency components of generated images with those of real images, thereby elevating the visual quality and enabling the generation of complex textures and details with improved fidelity.

Through a comparative lens, we have identified avenues for integrating diffusion models with other generative frameworks, suggesting a path towards enriched generative capabilities. Our exploration across four distinct domains underscores the versatility of diffusion models and their potential to revolutionize not only image generation but also extend into multimodal and interdisciplinary applications. While recognizing the innate constraints of diffusion models, such as computational intensity and the complexity of architecture that currently hinder their scalability and broad application, it is important to note that substantial efforts have been dedicated to overcoming these barriers. The horizon for future research within diffusion models is vast and vibrant, with many opportunities to refine their efficiency and expand their utility. The promise of diffusion models to advance generative AI tasks is undeniable, beckoning a concerted effort to unearth their full potential and chart new territories in generative modeling. With ongoing research and development, we anticipate diffusion models to not only evolve in terms of their theoretical foundations but also in their practical executions, marking a new epoch in the field of generative AI.

Author Contributions

Conceptualization, X.W.; methodology, X.W. and Z.H.; formal analysis, Z.H.; resources, X.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by the National Natural Science Foundation of China (62176165), the Stable Support Projects for Shenzhen Higher Education Institutions (20220718110918001), and the Natural Science Foundation of Top Talent of SZTU (GDRC202131).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declare no conflicts of interest.

References

Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2020, arXiv:2011.13456. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Watson, D.; Chan, W.; Ho, J.; Norouzi, M. Learning fast samplers for diffusion models by differentiating through sample quality. arXiv 2022, arXiv:2202.05830. [Google Scholar]
Song, Y.; Ermon, S. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14December 2019; Volume 32. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Chen, X.; Huang, L.; Liu, Y.; Shen, Y.; Zhao, D.; Zhao, H. Anydoor: Zero-shot object-level image customization. arXiv 2023, arXiv:2307.09481. [Google Scholar]
Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv 2021, arXiv:2108.01073. [Google Scholar]
Lugmayr, A.; Danelljan, M.; Romero, A.; Yu, F.; Timofte, R.; Van Gool, L. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11461–11471. [Google Scholar]
Brempong, E.A.; Kornblith, S.; Chen, T.; Parmar, N.; Minderer, M.; Norouzi, M. Denoising pretraining for semantic segmentation. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4175–4186. [Google Scholar]
Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
Xu, J.; Liu, S.; Vahdat, A.; Byeon, W.; Wang, X.; De Mello, S. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2955–2966. [Google Scholar]
Wyatt, J.; Leach, A.; Schmon, S.M.; Willcocks, C.G. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 650–656. [Google Scholar]
Wolleb, J.; Bieder, F.; Sandkühler, R.; Cattin, P.C. Diffusion models for medical anomaly detection. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 8–12 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–45. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Esser, P.; Chiu, J.; Atighehchian, P.; Granskog, J.; Germanidis, A. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7346–7356. [Google Scholar]
Sheynin, S.; Polyak, A.; Singer, U.; Kirstain, Y.; Zohar, A.; Ashual, O.; Parikh, D.; Taigman, Y. Emu edit: Precise image editing via recognition and generation tasks. arXiv 2023, arXiv:2311.10089. [Google Scholar]
Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the design space of diffusion-based generative models. Adv. Neural Inf. Process. Syst. 2022, 35, 26565–26577. [Google Scholar]
Watson, D.; Ho, J.; Norouzi, M.; Chan, W. Learning to efficiently sample from diffusion probabilistic models. arXiv 2021, arXiv:2106.03802. [Google Scholar]
Sauer, A.; Lorenz, D.; Blattmann, A.; Rombach, R. Adversarial diffusion distillation. arXiv 2023, arXiv:2311.17042. [Google Scholar]
Song, Y.; Durkan, C.; Murray, I.; Ermon, S. Maximum likelihood training of score-based diffusion models. Adv. Neural Inf. Process. Syst. 2021, 34, 1415–1428. [Google Scholar]
Gu, S.; Chen, D.; Bao, J.; Wen, F.; Zhang, B.; Chen, D.; Yuan, L.; Guo, B. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10696–10706. [Google Scholar]
Xu, M.; Yu, L.; Song, Y.; Shi, C.; Ermon, S.; Tang, J. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv 2022, arXiv:2203.02923. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014; pp. 1278–1286. [Google Scholar]
Dinh, L.; Krueger, D.; Bengio, Y. Nice: Non-linear independent components estimation. arXiv 2014, arXiv:1410.8516. [Google Scholar]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real nvp. arXiv 2016, arXiv:1605.08803. [Google Scholar]
Papamakarios, G.; Nalisnick, E.; Rezende, D.J.; Mohamed, S.; Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 2021, 22, 2617–2680. [Google Scholar]
Adams, R.P. High-dimensional probability estimation with deep density models. arXiv 2013, arXiv:1302.5125. [Google Scholar]
Amit, T.; Shaharbany, T.; Nachmani, E.; Wolf, L. Segdiff: Image segmentation with diffusion probabilistic models. arXiv 2021, arXiv:2112.00390. [Google Scholar]
Baranchuk, D.; Rubachev, I.; Voynov, A.; Khrulkov, V.; Babenko, A. Label-efficient semantic segmentation with diffusion models. arXiv 2021, arXiv:2112.03126. [Google Scholar]
Li, H.; Yang, Y.; Chang, M.; Chen, S.; Feng, H.; Xu, Z.; Li, Q.; Chen, Y. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing 2022, 479, 47–59. [Google Scholar] [CrossRef]
Zimmermann, R.S.; Schott, L.; Song, Y.; Dunn, B.A.; Klindt, D.A. Score-based generative classifiers. arXiv 2021, arXiv:2110.00473. [Google Scholar]
Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.; Kingma, D.P.; Poole, B.; Norouzi, M.; Fleet, D.J.; et al. Imagen video: High definition video generation with diffusion models. arXiv 2022, arXiv:2210.02303. [Google Scholar]
Zhang, L.; Rao, A.; Agrawala, M. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3836–3847. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 11 June 2018).
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 5485–5551. [Google Scholar]
Khachatryan, L.; Movsisyan, A.; Tadevosyan, V.; Henschel, R.; Wang, Z.; Navasardyan, S.; Shi, H. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv 2023, arXiv:2303.13439. [Google Scholar]
Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. Make-a-video: Text-to-video generation without text-video data. arXiv 2022, arXiv:2209.14792. [Google Scholar]
Lin, C.H.; Gao, J.; Tang, L.; Takikawa, T.; Zeng, X.; Huang, X.; Kreis, K.; Fidler, S.; Liu, M.Y.; Lin, T.Y. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 300–309. [Google Scholar]
Poole, B.; Jain, A.; Barron, J.T.; Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv 2022, arXiv:2209.14988. [Google Scholar]
Chung, H.; Ye, J.C. Score-based diffusion models for accelerated MRI. Med. Image Anal. 2022, 80, 102479. [Google Scholar] [CrossRef] [PubMed]
Peng, C.; Guo, P.; Zhou, S.K.; Patel, V.M.; Chellappa, R. Towards performant and reliable undersampled MR reconstruction via diffusion model sampling. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Singapore, 8–12 September 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 623–633. [Google Scholar]
Wu, L.; Gong, C.; Liu, X.; Ye, M.; Liu, Q. Diffusion-based molecule generation with informative prior bridges. Adv. Neural Inf. Process. Syst. 2022, 35, 36533–36545. [Google Scholar]
Igashov, I.; Stärk, H.; Vignac, C.; Satorras, V.G.; Frossard, P.; Welling, M.; Bronstein, M.; Correia, B. Equivariant 3d-conditional diffusion models for molecular linker design. arXiv 2022, arXiv:2210.05274. [Google Scholar]
Wolleb, J.; Sandkühler, R.; Bieder, F.; Valmaggia, P.; Cattin, P.C. Diffusion models for implicit image segmentation ensembles. In Proceedings of the International Conference on Medical Imaging with Deep Learning, PMLR, Zurich, Switzerland, 6–8 July 2022; pp. 1336–1348. [Google Scholar]
Tae, J.; Kim, H.; Kim, T. EdiTTS: Score-based editing for controllable text-to-speech. arXiv 2021, arXiv:2110.02584. [Google Scholar]
Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
Hyvärinen, A.; Dayan, P. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 2005, 6, 695–709. [Google Scholar]
Raphan, M.; Simoncelli, E. Learning to be Bayesian without supervision. Adv. Neural Inf. Process. Syst. 2006, 19, 1145–1146. [Google Scholar]
Raphan, M.; Simoncelli, E.P. Least squares estimation without priors or supervision. Neural Comput. 2011, 23, 374–420. [Google Scholar] [CrossRef]
Vincent, P. A connection between score matching and denoising autoencoders. Neural Comput. 2011, 23, 1661–1674. [Google Scholar] [CrossRef]
Song, Y.; Garg, S.; Shi, J.; Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Online, 3–6 August 2020; pp. 574–584. [Google Scholar]
Anderson, B.D. Reverse-time diffusion equation models. Stoch. Process. Their Appl. 1982, 12, 313–326. [Google Scholar] [CrossRef]
Huang, C.W.; Lim, J.H.; Courville, A.C. A variational perspective on diffusion-based generative models and score matching. Adv. Neural Inf. Process. Syst. 2021, 34, 22863–22876. [Google Scholar]
Vahdat, A.; Kreis, K.; Kautz, J. Score-based Generative Modeling in Latent Space. Adv. Neural Inf. Process. Syst. 2021, 34, 11287–11302. [Google Scholar]
Luo, C. Understanding diffusion models: A unified perspective. arXiv 2022, arXiv:2208.11970. [Google Scholar]
Xiao, Z.; Kreis, K.; Vahdat, A. Tackling the generative learning trilemma with denoising diffusion gans. arXiv 2021, arXiv:2112.07804. [Google Scholar]
Wang, Z.; Zheng, H.; He, P.; Chen, W.; Zhou, M. Diffusion-gan: Training gans with diffusion. arXiv 2022, arXiv:2206.02262. [Google Scholar]
Zhang, Q.; Chen, Y. Diffusion normalizing flow. Adv. Neural Inf. Process. Syst. 2021, 34, 16280–16291. [Google Scholar]
Gong, W.; Li, Y. Interpreting diffusion score matching using normalizing flow. arXiv 2021, arXiv:2107.10072. [Google Scholar]
Kim, D.; Na, B.; Kwon, S.J.; Lee, D.; Kang, W.; Moon, I.c. Maximum Likelihood Training of Implicit Nonlinear Diffusion Model. Adv. Neural Inf. Process. Syst. 2022, 35, 32270–32284. [Google Scholar]
Saharia, C.; Chan, W.; Chang, H.; Lee, C.; Ho, J.; Salimans, T.; Fleet, D.; Norouzi, M. Palette: Image-to-image diffusion models. In Proceedings of the ACM SIGGRAPH 2022 Conference Proceedings, Vancouver, BC, Canada, 7–11 August 2022; pp. 1–10. [Google Scholar]
Song, Y.; Shen, L.; Xing, L.; Ermon, S. Solving inverse problems in medical imaging with score-based generative models. arXiv 2021, arXiv:2111.08005. [Google Scholar]
Hoogeboom, E.; Satorras, V.G.; Vignac, C.; Welling, M. Equivariant diffusion for molecule generation in 3d. In Proceedings of the International Conference on Machine Learning, PMLR, Zurich, Switzerland, 6–8 July 2022; pp. 8867–8887. [Google Scholar]
Popov, V.; Vovk, I.; Gogoryan, V.; Sadekova, T.; Kudinov, M. Grad-tts: A diffusion probabilistic model for text-to-speech. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8599–8608. [Google Scholar]
Yang, D.; Yu, J.; Wang, H.; Wang, W.; Weng, C.; Zou, Y.; Yu, D. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 1720–1733. [Google Scholar] [CrossRef]
Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12873–12883. [Google Scholar]
Kim, S.; Kim, H.; Yoon, S. Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv 2022, arXiv:2205.15370. [Google Scholar]
Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; Van Den Berg, R. Structured denoising diffusion models in discrete state-spaces. Adv. Neural Inf. Process. Syst. 2021, 34, 17981–17993. [Google Scholar]
Wu, J.Z.; Ge, Y.; Wang, X.; Lei, S.W.; Gu, Y.; Shi, Y.; Hsu, W.; Shan, Y.; Qie, X.; Shou, M.Z. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7623–7633. [Google Scholar]
Wang, Z.; Lu, C.; Wang, Y.; Bao, F.; Li, C.; Su, H.; Zhu, J. ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation. arXiv 2023, arXiv:2305.16213. [Google Scholar]
Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv 2023, arXiv:2303.04671. [Google Scholar]
Yang, X.; Zhou, D.; Feng, J.; Wang, X. Diffusion probabilistic model made slim. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22552–22562. [Google Scholar]
Lee, S.; Jung, S.W.; Seo, H. Spectrum Translation for Refinement of Image Generation (STIG) Based on Contrastive Learning and Spectral Filter Profile. arXiv 2024, arXiv:2403.05093. [Google Scholar]

Figure 1. Images generated by Stable Diffusion [10] based on various text prompts.

Figure 2. A chronological summary of important progress in the field of diffusion modeling.

Figure 3. The denoising diffusion probabilistic model involves a forward process known as diffusion and a reverse process referred to as reverse-diffusion or denoising. The diffusion process (denoted as

q (x_{t} | x_{t - 1})

) incrementally adds Gaussian noise to the data. In a reverse-diffusion process (denoted as

q (x_{t - 1} | x_{t})

), this noise is estimated and reversed to recover the original data distribution. The denoise step (

p_{θ} (x_{t - 1} | x_{t})

) is the estimated process that projects the noised samples back to their original distribution, completing the model’s cycle from noise to data.

Figure 3. The denoising diffusion probabilistic model involves a forward process known as diffusion and a reverse process referred to as reverse-diffusion or denoising. The diffusion process (denoted as

q (x_{t} | x_{t - 1})

) incrementally adds Gaussian noise to the data. In a reverse-diffusion process (denoted as

q (x_{t - 1} | x_{t})

), this noise is estimated and reversed to recover the original data distribution. The denoise step (

p_{θ} (x_{t - 1} | x_{t})

) is the estimated process that projects the noised samples back to their original distribution, completing the model’s cycle from noise to data.

Figure 4. Applications of AnyDoor without any parameter tuning. Besides customizing an image via placing single (top row) or multiple (middle row) objects at user-specified locations, the model also supports harmoniously moving or swapping objects within real scenes (bottom row).

Figure 5. Overview of the method proposed by Baranchuk et al. [36]. (1) Adding noise into an actual image following the forward diffusion protocol denoted by q. (2) Deriving pixel-level image features from the pre-trained Deep Diffusion Probabilistic Model (DDPM). (3) Utilizing a collection of Multi-Layer Perceptrons (MLPs) to assign a class label to every pixel feature.

Figure 6. Model diagram from Imagen [20].

Figure 7. Overview of Stable Diffusion [10].

Figure 8. Overview of DALL-E2 [41].

Figure 9. Results and applications of Magic3D [49], high-resolution text-to-3D generation.

Table 1. Diffusion models are integrated within various generative modeling frameworks.

Model	Article	Year
VAE	Huang et al. [64] Vadhat et al. [65] Luo et al. [66]	2021 2021 2022
GAN	Xiao et al. [67] Wang et al. [68]	2021 2022
Flow-based models	Zhang et al. [69] Gong et al. [70] Kim et al. [71]	2021 2021 2022

Table 2. Our classification systematically arranges diffusion models utilized in computer vision, dissecting them based on the task, denoising condition, and the architecture. The principal criterion for our taxonomy is the model’s application scope.The three colors in the table each represent the application domain to which the work belongs, namely computer vision, multi-modal generation, and interdisciplinary applications.

Paper	Task	Architecture	Denoising Condition	Year
Chen et al. [11]	image editing	DDPM	conditioned on image	2022
Meng et al. [12]	image synthesis and editing	Score SDE, DDPM, Improved DDPM	conditioned on image	2021
Lugmayr et al. [13]	image inpainting	DDPM	unconditional	2022
Baranchuk et al. [36]	image semantic segmentation	Improved DDPM	conditioned on image	2021
Brempong et al. [14]	image semantic segmentation	DDeP	conditioned on image	2022
Xu et al. [16]	image segmentation	DDPM	conditioned on image	2023
Wyatt el al. [17]	medical image anomaly detection	ADM	conditioned on image	2022
Wolleb et al. [18]	medical image anomaly detection	DDIM	conditioned on image	2022
Rombach et al. [10]	multi-task (image generation, inpainting, editing)	VQ-DDM	unconditional, conditioned on image	2022
Nichol et al. [44]	multi-task (image generation, inpainting, editing)	ADM	conditioned on image, text guidance	2021
Saharia et al. [20]	text-to-image generation	Imagen	conditioned on text	2022
Ramesh et al. [19]	text-to-image generation	ADM	conditioned on text	2021
Popov et al. [75]	text-to-speech generation	score-based decoder, DDPM	conditioned on frames	2021
Yang et al. [76]	text-to-sound generation	DDPM	conditioned on text	2023
Singer et al. [48]	text-to-video generation	T2V, DDPM	unconditional, conditioned on image	2022
Ho et al. [39]	text-to-video generation	DDPM	unconditional, conditioned on image	2022
poole et al. [50]	text-to-3D generation	DDPM	unconditional	2022
Lin et al. [49]	text-to-3D generation	improved DDPM	unconditional	2023
Song et al. [73]	medical image generation	NCSN++	conditioned on measurements	2021
Chung et al. [51]	medical image generation	NCSN++	conditioned on measurements	2022
Wolleb et al. [55]	medical image generation	improved DDPM	conditioned on image	2022
Hoogeboom et al. [74]	molecule generation	EDM	conditioned on image	2022
Wu et al. [53]	molecule generation	improved DDPM	conditioned on physical information	2022

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, X.; He, Z.; Peng, X. Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review. Mathematics 2024, 12, 977. https://doi.org/10.3390/math12070977

AMA Style

Wang X, He Z, Peng X. Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review. Mathematics. 2024; 12(7):977. https://doi.org/10.3390/math12070977

Chicago/Turabian Style

Wang, Xiaolong, Zhijian He, and Xiaojiang Peng. 2024. "Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review" Mathematics 12, no. 7: 977. https://doi.org/10.3390/math12070977

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Artificial-Intelligence-Generated Content with Diffusion Models: A Literature Review

Abstract

1. Introduction

2. Framework of Diffusion Models

2.1. Denoising Diffusion Probabilistic Models (DDPMs)

2.1.1. Forward Process

2.1.2. Reverse Process

2.2. Score-Based Generative Models (SGMs)

2.3. Stochastic Differential Equations (Score SDEs)

3. Advanced Diffusion Variants

3.1. Efficient Sampling in Diffusion Models

3.2. Improved Likelihood in Diffusion Models

3.3. Handling Data with Special Structures

4. Relation to Other Generative Models

4.1. Variational Autoencoders (VAEs)

4.2. Generative Adversarial Networks (GANs)

4.3. Flow-Based Generative Models

5. Applications of Diffusion Models

5.1. Computer Vision

5.1.1. Image Editing and Inpainting

5.1.2. Semantic Segmentation

5.1.3. Anomaly Detection

5.2. Multi-Modal Generation

5.2.1. Text-to-Image Generation

5.2.2. Text-to-Audio Generation

5.2.3. Text-to-Video Generation

5.2.4. Text-to-3D Generation

5.3. Interdisciplinary Applications

5.3.1. Medical Image Generation and Segmentation

5.3.2. Molecule Generation and Drug Design

6. Future Trends

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI