1. Introduction
Diffusion models [
1,
2,
3,
4,
5,
6,
7,
8], heralding a significant leap in deep generative modeling, have captivated a broad audience with their exceptional capabilities. These models stand out for their ability to create images of astounding realism and quality, marking a new chapter in digital creativity and content creation. In comparison to Generative Adversarial Networks (GANs) [
9], diffusion models excel in their controllability and precision, more accurately meeting user-defined criteria during the generation process. This advantage is vividly demonstrated by the outputs from the Stable Diffusion model [
10], for example, in
Figure 1, which leverages text inputs to steer the image generation process, achieving images of remarkable fidelity that closely match the provided textual descriptions.
The last four years have seen a remarkable increase in publications related to diffusion modeling, featuring numerous innovative theories and landmark studies. This explosion of research presents a daunting challenge for newcomers, complicating their ability to find a foothold within this broad and swiftly changing field. With this context in mind, our survey seeks to offer a detailed overview of diffusion modeling’s present state, charting the evolution of its foundational principles and techniques, and highlighting its current avant-garde applications. Our intention is to facilitate the journey of future researchers into the realm of diffusion modeling by providing a well-structured and accessible gateway into this complex area of study.
Building on this foundation, a distinctive highlight of our review is its comprehensive examination and nuanced understanding of diffusion models in relation to other mainstream generative models. We delve deeply into the distinctions, advantages, and limitations of each approach, offering a comparative analysis that underscores their interconnectedness and the unique ways in which diffusion models build upon and diverge from prior generative methodologies. This critical exploration not only elucidates the synergistic relationships among these models but also equips readers with valuable insights into selecting the most appropriate methods for their specific research goals.
The exceptional generative power of diffusion models has swiftly positioned them as key players in a plethora of vision-centric applications, including but not limited to image editing, inpainting [
11,
12,
13], semantic segmentation [
14,
15,
16], and anomaly detection [
17,
18]. This surge in popularity can be traced back to the significant advancements made since the introduction of diffusion probabilistic models [
3], which have expanded upon the foundational concepts of diffusion modeling [
4]. The resulting wave of research enthusiasm is marked by the daily unveiling of novel models and an expanding corpus of scholarly work. Public and academic intrigue in diffusion models particularly intensified following the release of groundbreaking text-to-image generators like DALL-E [
19], Imagen [
20], and Stable Diffusion [
10], which have redefined the benchmarks for generating images from textual descriptions. Moreover, the field has seen remarkable strides in text-to-video generation [
21,
22], showcasing highly advanced videos and amplifying the excitement surrounding diffusion models. A statistical and chronological analysis showcased in
Figure 2 underlines the rising prominence of diffusion models, especially within the vision community. This illustration emphasizes their growing importance in the expansive field of generative modeling, marking a notable shift in focus and enthusiasm towards these models.
In this paper, we will first introduce the framework of diffusion models by providing a succinct description of the three main formulations: denoising diffusion probabilistic models (DDPMs) [
1,
2,
3,
4,
7], score-based generative models (SGMs) [
5,
8], and stochastic differential equations (Score SDEs) [
5,
23]. The cornerstone of these methods is the incorporation of random noise in the forward diffusion process and its subsequent removal in the reverse diffusion process to synthesize new samples. We will elucidate the functioning of these models within the diffusion process and clarify their interrelations (
Section 2).
Then, we relate and justify our work with an overview of the rapidly expanding body of work on diffusion models, categorizing the research into three key areas: efficient sampling [
1,
24,
25], improved likelihood [
1,
26], and handling data with special structures [
10,
27,
28] (
Section 3).
Next, we juxtapose diffusion models with other generative models (
Section 4), shedding light on the connections and distinctions between diffusion models and their contemporaries, such as variational autoencoders (VAEs) [
29,
30], generative adversarial networks (GANs) [
9], and flow-based models [
31,
32,
33,
34]. A comprehensive analysis will highlight the merits and limitations inherent to each model type and spotlight seminal contributions that have successfully integrated diffusion models with alternative generative frameworks to achieve enhanced performance.
Subsequently, we will systematically review the applications of diffusion models across various fields (
Section 5), including computer vision [
11,
12,
13,
14,
15,
16,
17,
18,
35,
36,
37,
38], multi-modal generation [
10,
19,
20,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50], and interdisciplinary fields [
51,
52,
53,
54,
55,
56]. For each application, we will define the task and discuss how diffusion models offer solutions to challenges identified in previous works. In conclusion, this review will synthesize core insights, draw significant conclusions, and identify promising directions for the continued exploration of diffusion models in future research endeavors (
Section 6 and
Section 7).
2. Framework of Diffusion Models
Diffusion models represent a sophisticated class of probabilistic generative models designed to perturb data by systematically introducing noise through a forward process and subsequently learning to reverse this process to generate new samples.
Figure 3 conceptually illustrates the diffusion process, offering a visual understanding of the underlying mechanisms. Research on diffusion modeling has converged on three principal formulations: denoising diffusion probabilistic models (DDPMs), which focus on iteratively denoising data; score-based generative models (SGMs), which leverage gradients of the data distribution’s score for sample generation; and stochastic differential equations (Score SDEs), which frame the diffusion process within the continuous domain of differential equations. This section aims to provide a comprehensive overview of these formulations, elucidating their interconnections and unique contributions to the field of generative modeling.
2.1. Denoising Diffusion Probabilistic Models (DDPMs)
2.1.1. Forward Process
The forward process of DDPMs [
3,
7] involves a Markov chain [
57], which gradually adds Gaussian noise to an initial data distribution over a series of time steps. Given an original data point
, the forward process can be expressed as follows:
where
t indexes the time step from 1 to
T and
are variance parameters of the noise added at each step. The term
denotes the Gaussian distribution, and
is the identity matrix.
The sequence of transformations from
to
(where
is nearly indistinguishable from pure noise) is modeled by the following joint distribution:
Here, each step is a Gaussian diffusion process that is conditional on the previous step, effectively creating a smooth transition from structured data to random noise.
The noise schedule is defined by the variance parameters , which typically start small and gradually increase but are bounded such that . This ensures that the data are not overwhelmed by noise too quickly and allows the model to learn meaningful representations of the data at each step of the diffusion process.
The parameters
are defined as
and their cumulative product is given by
The variables
and
represent the proportion of the original data that remains at step
t and the overall proportion of the data that remains up to step
t, respectively. We can then expeditiously compute
at any moment based on
and
as follows:
where
.
Through this process, the forward diffusion effectively corrupts the data by incrementally adding noise, moving the data distribution towards a Gaussian distribution with a mean of zero and a variance of one.
2.1.2. Reverse Process
The reverse process in Denoising Diffusion Probabilistic Models is conceptualized as an iterative denoising of the sequence of latent variables, transitioning from a diffused state back to the original distribution of the data. This inverse mapping is achieved through a Markov chain characterized by Gaussian transitions, where each step aims to estimate the preceding latent variable from the current noisy state .
At the heart of the reverse process is the transition probability, modeled as a Gaussian distribution:
where the mean
is inferred from the noisy data point
and is a function of the neural network parameters. The mean is computed as a weighted combination of the current noisy observation and the original data point, which can be expressed as
However, since
depends on the entire data distribution, we approximate it using a neural network as follows:
The combination of
q and
p forms a variational auto-encoder [
29], and the variational lower bound (VLB) can be expressed as
where
and for
,
with
representing the Kullback–Leibler divergences at each time step of the diffusion process. Thus, we can estimate
using the prior (Equation (
8)) and posterior (Equation (
6)).
There exist multiple strategies to parameterize
in the prior. The most straightforward method is to directly predict
with the neural network. As an alternative, the network might estimate
, utilizing this prediction within Equation (
7) to calculate
. Furthermore, the network has the option to estimate the noise
, applying both Equations (
7) and (
5) to compute
.
To train the neural network, we minimize a loss function that encapsulates a variational lower bound on the negative log-likelihood of the observed data. This involves the optimization of the following objective function:
which effectively measures the fidelity of the predicted noise and the accuracy of the reverse transitions.
For computational efficiency, a simplified loss function is often employed, focusing directly on the accuracy of the noise prediction:
Here, and are time-dependent factors derived from the forward, jointly determining the proportion of noise added to the data and the proportion of the data signal retained at each step. The optimization of parameters is performed via backpropagation and stochastic gradient descent, with the aim of enabling the model to accurately reconstruct the data distribution from its diffused latent variables.
2.2. Score-Based Generative Models (SGMs)
Score-Based Generative Models (SGMs) [
5,
8], also known as energy-based models in some contexts, offer a framework for generating complex high-dimensional data by learning the gradient field of log probabilities, commonly referred to as the score. At the core of SGMs lies the concept of the score function [
58]:
where
denotes the score function for the data
under the probability density function
. This function points in the direction of steepest ascent in the probability density of the data, providing a method to refine synthetic data points towards higher probability regions.
In SGMs, a neural network is trained to approximate the score function across different levels of noise perturbation. Given a data point
from the data distribution
, we perturb it with Gaussian noise:
to obtain a sequence of noisier data representations
. The neural network, parameterized by
, is then trained to estimate the score function
at each noise level
. Numerous established methodologies, including score matching [
58], denoising score matching [
59,
60,
61], and sliced score matching [
62], exist for inferring score functions from datasets, commonly referred to as score estimation. These techniques enable us to effectively utilize one of these approaches to train noise-conditional score networks using data points that have been subject to perturbation.
For the sampling process, which plays a crucial role in generating high-quality data samples from learned distributions, this process involves generating new data samples by progressively reducing the noise levels, guided by the learned score functions. Here, we introduce the initial sampling method for SGMs, called Annealed Langevin Dynamics (ALD) [
8]. ALD is an iterative process that effectively navigates from a high-noise state to the original data distribution. It begins with an initial sample drawn from a Gaussian distribution representing the highest noise level. Then, it iteratively refines this sample through a series of steps, each corresponding to a lower noise level. The iterative update at each step
t is governed by the following Langevin dynamics equation:
where
is the sample at iteration and noise level t;
is the step size for noise level t;
is the score function estimated by the trained model;
is a random sample drawn from a normal distribution , representing the stochastic nature of the process.
The score function plays a pivotal role in guiding the sampling process. It provides the direction and magnitude of the update to the current sample , aiming to reduce the difference between the sample’s distribution and the true data distribution at each noise level. Essentially, the score function captures the gradient of the log probability density of the data with respect to the sample, directing the sampling towards regions of higher probability density. This guidance is crucial for effectively navigating the complex landscape of the data distribution, enabling the model to generate samples that closely resemble the original data. This process is repeated for each noise level, starting from the highest to the lowest , thereby annealing the noise and refining the sample towards the true data distribution. The ALD method’s effectiveness lies in its ability to leverage the score function at each noise level, providing a guided path for sample refinement.
2.3. Stochastic Differential Equations (Score SDEs)
Stochastic Differential Equations (SDEs) [
5,
23] form the backbone of continuous-time models in Denoising Diffusion Probabilistic Models (DDPMs) and Score-Based Generative Models (SGMs), particularly in the infinite time steps or noise levels scenario. This generalization, known as Score SDE [
5], leverages SDEs for both noise perturbation and sample generation, with a significant focus on estimating score functions of noisy data distributions.
The diffusion process in Score SDEs is described by a specific SDE:
where
represents the drift function and
represents the diffusion coefficient of the SDE, respectively, and
w is a standard Wiener process (or Brownian motion). This equation encapsulates the forward processes in both DDPMs and SGMs, which are essentially discretizations of this continuous-time model.
In the context of DDPMs, the corresponding SDE, as demonstrated in Song et al. [
5], takes the following form:
where
as
T approaches infinity. For SGMs, the SDE is expressed as
where
as
T tends towards infinity.
Interestingly, any diffusion process of the form of Equation (
15) can be reversed, as shown by Anderson [
63], by solving the reverse-time SDE:
where
is a standard Wiener process in reverse time and
denotes an infinitesimal negative time step. The marginal densities of the solution trajectories of this reverse SDE mirror those of the forward SDE but evolve in the opposite time direction [
5].
Furthermore, Song et al. [
5] established the existence of a Probability Flow Ordinary Differential Equation (ODE), whose trajectories share the same marginals as the reverse-time SDE. This Probability Flow ODE is formulated as follows:
Both the reverse-time SDE and the Probability Flow ODE facilitate sampling from the same data distribution, affirming their equivalence in terms of marginal distributions. These mathematical formulations not only elucidate the underpinnings of continuous-time generative models but also highlight the intricate interplay between diffusion and reverse-diffusion processes in generating complex data distributions.
3. Advanced Diffusion Variants
Despite the remarkable achievements of diffusion models in generating high-quality samples, the original diffusion models face several constraints, such as inefficient sampling methods, poorer likelihood performance compared to other generative models, and challenges in handling data with unique structures. On this foundation, numerous studies have been proposed to address these issues, aiming to enhance the performance of diffusion models. Efforts to overcome the inherent limitations of diffusion models have led to the development of innovative approaches that refine their sampling efficiency, improve likelihood measures, and extend their applicability to a wider array of structured data. These advancements underscore the dynamic nature of research in the field of generative models, highlighting the continuous pursuit of optimized performance and broader utility.
3.1. Efficient Sampling in Diffusion Models
Denoising Diffusion Implicit Models (DDIM) [
2] stand out as an innovative approach that accelerates the sampling of diffusion models. The formulation of DDIM redefines the reverse diffusion process in a non-Markovian setting, which allows for deterministic sampling when the variance
is set to zero. This is a significant departure from the Denoising Diffusion Probabilistic Models (DDPMs), where the process is inherently stochastic. The key to DDIM’s accelerated sampling lies in this deterministic property, which can be described mathematically as follows:
For a single step of the reverse process in DDIM, the sample at the previous time step
can be derived from the current step
using the following equation:
where
represents the cumulative product of
, the noise schedule of the forward process, and
is the predicted noise at step
t. By setting
, the DDIM simplifies to
The deterministic nature of this process eliminates the need for stochastic sampling, enabling the direct computation of without additional noise terms. This facilitates faster sampling by using a sequence of deterministic steps, essentially ‘skipping’ through the diffusion process.
Watson et al. [
24] made significant strides in the optimization of sampling procedures for Denoising Diffusion Probabilistic Models (DDPMs). They proposed an optimization strategy that leverages dynamic programming to effectively reduce the number of refinement steps needed in the DDPM sampling process. Recognizing the decomposition of the DDPM objective into individual KL divergence terms, their method identifies the optimal discretization scheme, thereby striking a balance between computational cost and model performance. This dynamic programming solution is particularly novel as it does not require additional hyperparameters or model retraining, making it an immediately applicable technique for enhancing the performance of pre-trained DDPMs.
Adversarial Diffusion Distillation (ADD) [
25] offers a novel training framework that significantly enhances the sampling efficiency of foundational image diffusion models. This methodology incorporates score distillation from a pre-trained diffusion model, termed the teacher, to a student model that astonishingly requires only one to four steps to generate images of high quality. Complementing the distillation, an adversarial loss implemented with a well-crafted discriminator aims to ensure the fidelity of the synthesized images. The student samples, once processed by the teacher’s forward diffusion, become the targets for the distillation loss, as delineated in
Section 3.3. The overarching objective combines the adversarial loss with the distillation loss, formalized as
While applicable in pixel space, the method seamlessly adapts to Latent Diffusion Models (LDMs) [
10] operating in a shared latent space between teacher and student, allowing the choice of computing the distillation loss in pixel or latent space. The method yields more stable gradients and superior results when applied to latent diffusion models.
3.2. Improved Likelihood in Diffusion Models
The training objective for diffusion models is framed as a variational lower bound (VLB) on the negative log-likelihood (Equation (
9)). Nevertheless, this bound can frequently be loose, which may result in diffusion models yielding log-likelihoods that are not fully optimized.
Improved Denoising Diffusion Probabilistic Models (IDDPM) [
1] represent a significant evolution in the generative modeling landscape by refining diffusion processes. IDDPM innovates upon the standard DDPM practice of using fixed variances,
, with
being a constant, by introducing a learnable variance mechanism. This mechanism is mathematically formulated as
where
v denotes a vector that is trained to adapt the model’s noise level dynamically, thus aiming to optimize the log-likelihood of the generated samples.
In addition, IDDPM modifies the linear noise schedule to a cosine noise schedule for a nuanced noise reduction that preserves information integrity throughout the diffusion process:
This cosine schedule is particularly effective in maintaining image fidelity in low-step regimes, where traditional schedules may falter.
The primary objective of IDDPM is to enhance the likelihood of traditional DDPMs, facilitating a more accurate representation of data distributions. By leveraging advanced variance learning and noise scheduling techniques, IDDPM significantly improves generative performance and efficiency.
The work of Song et al. [
26] demonstrated that score-based diffusion models, which synthesize samples by inverting a stochastic diffusion process, can be optimized more effectively by focusing on the likelihood of the data. Traditionally trained by minimizing a weighted combination of score matching losses, these models have not directly optimized log-likelihood. Song et al. [
26] reveal that with a special weighting function, the training objective can upper bound the negative log-likelihood, enabling approximate maximum likelihood training of score-based diffusion models. This is formalized as follows:
where
is the score matching objective and the
terms represent the Kullback–Leibler divergences at the beginning and end of the diffusion process, respectively. Empirical evidence shows that this methodology consistently enhances the likelihoods of diffusion models, indicating an improved learning of the data distribution.
3.3. Handling Data with Special Structures
Effective deployment of diffusion models often requires consideration of the unique structures inherent in critical data domains. To address such complexities, modifications to the diffusion models are necessary.
Diffusion models are inherently designed for continuous data, employing Gaussian noise perturbations. This poses challenges for discrete data applications. A novel solution, VQ-Diffusion [
27], addresses this by introducing a random walk in the discrete data space, replacing Gaussian noise with a transition kernel suited for discrete structures. The transition kernel for the forward diffusion process is given by
where
is a one-hot vector representing the discrete data state and
is the transition matrix of a lazy random walk at step
t. This approach adapts the diffusion process to discrete data domains, enabling the generation of discrete data samples while respecting the intrinsic structure of the data.
For data characterized by invariant structures, numerous works have endowed diffusion models with the capability to interpret this type of data. For example, Xu et al. [
28] introduced a pivotal approach that leverages Markov chains with an invariant prior, coupled with equivariant Markov kernels, to ensure the generated data respects the intrinsic invariance properties. Specifically, for a transformation
T representing rotation or translation, they establish that
guaranteeing the invariance of the sample distribution to
T such that
This framework empowers diffusion models to generate data, like molecular conformations, that remain invariant under rotations and translations, ensuring the physical and chemical validity of the synthesized structures regardless of their spatial orientation.
Building upon the manifold hypothesis, which posits that natural data predominantly occupy manifolds with significantly lower intrinsic dimensionality, recent advancements have focused on leveraging learned manifolds for diffusion model training. This methodology involves using autoencoders to reduce the data to a manageable latent space, thus enabling diffusion models to operate more efficiently due to the reduced data complexity.
The Latent Diffusion Model (LDM) [
10] and DALLE-2 [
41] serve as prominent examples of this approach. LDM segregates the process into two distinct phases: initially employing an autoencoder for dimensionality reduction, then training a diffusion model to generate latent codes. Similarly, DALLE-2 trains a diffusion model on the CLIP image embedding space, with a subsequent phase dedicated to decoding these embeddings back into images. This strategy underscores the efficiency and potential of training diffusion models on learned manifolds, optimizing the generative process within a simplified data domain.
4. Relation to Other Generative Models
Common generative models include GANs [
9], VAEs [
29,
30], and flow-based Models [
31,
32,
33,
34]. A generative model typically operates by taking a series of random noise as input and transforming it through a probabilistic model to produce data with specific semantic information, such as images or text. Additionally, algorithms that integrate diffusion models with other generative models are summarized in
Table 1.
4.1. Variational Autoencoders (VAEs)
Variational Autoencoders (VAEs) [
29,
30] are generative models that offer a probabilistic description of observations in latent space. Essentially, VAEs represent latent attributes as probability distributions. A typical autoencoder has two similar networks: an encoder and a decoder. The encoder processes the input into a compressed representation, which the decoder then uses to reconstruct the original input. VAEs benefit from a continuous latent space, facilitating the ease of random sampling and interpolation. The encoder does not output an encoding vector directly but instead produces two equally sized vectors: one for means
and another for standard deviations
. Each hidden node is modeled as a Gaussian distribution, where sampling from these output vectors to feed into the decoder constitutes stochastic generation:
This means that, even with constant means and standard deviations, the encoding derived from the same input can differ across multiple forward passes due to the stochastic nature of the sampling process.
The training process involves minimizing the reconstruction loss, which reflects the similarity between the output and the input, and the latent loss, which measures how closely the hidden nodes adhere to a standard normal distribution. The latent loss is given by the Kullback–Leibler divergence:
A balance must be struck between the latent loss, which when minimized reduces the amount of information that can be encoded, and the reconstruction loss. When latent loss is low, the generated images may bear excessive resemblance to the training images, affecting quality negatively. Conversely, with minimal reconstruction loss, while training reconstructions are accurate, newly generated images may vary substantially, necessitating the discovery of an optimal balance.
One of the main shortcomings of VAEs is the tendency to produce blurred outputs, a result of the method of data distribution recovery and loss function computation. Additionally, they are susceptible to mode collapse, a phenomenon where the model generates a limited variety of samples, thus failing to capture the full diversity of the data distribution. Furthermore, the KL divergence term in VAEs, integral for regularizing the latent space, can lead to an oversimplified latent representation. This simplification restricts the model’s capacity to encode complex data variations, potentially undermining the richness of the generated outputs.
The Denoising Diffusion Probabilistic Model (DDPM) can be regarded as a Markovian Variational Autoencoder (VAE) with a predefined encoding strategy. In essence, the forward process of DDPM acts as the encoder, conforming to a linear Gaussian model. This encoding sequence is mathematically formalized by Equation (
5) within the original literature. Conversely, the reverse process of DDPM is analogous to the decoder, which operates iteratively across several decoding phases. Notably, all latent variables in the decoder’s architecture are dimensionally equivalent to the input data, maintaining a consistent scale throughout the model’s generative process. Compared to VAEs, diffusion models exhibit superior performance in both the quality and diversity of generated samples. However, this advantage comes at the cost of increased computational requirements and longer sampling durations, as diffusion models operate iteratively over multiple time steps. Additionally, the architecture of diffusion models is inherently more complex than that of VAEs, necessitating extensive fine-tuning to optimize performance.
To elaborate further, the strengths of VAEs lie in their architectural simplicity and efficiency in training, making them suitable for a wide range of applications with limited computational resources. VAEs are particularly adept at learning compact latent representations, facilitating tasks such as anomaly detection and latent space arithmetic. However, VAEs often struggle with producing high-resolution images and may exhibit a trade-off between sample quality and diversity due to the imposed KL divergence constraint in their objective function.
On the other hand, diffusion models, while computationally intensive, are celebrated for their ability to generate highly detailed and diverse samples, outperforming many generative models in tasks requiring high fidelity and variation. Their iterative refinement process allows for the generation of samples that are often indistinguishable from real data. Nevertheless, the iterative nature of diffusion models leads to longer generation times, making real-time applications challenging. Moreover, their complexity requires careful tuning of hyperparameters and a deep understanding of the underlying stochastic processes to achieve optimal results.
4.2. Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) [
9] are a class of machine learning frameworks where two neural networks contest with each other in a game-theoretic setting. The system comprises a generator
G that creates samples from noise and a discriminator
D that evaluates them against real data.
The training of GANs is an adversarial process: starting from synthetic data, the generator and discriminator engage in a continuous dynamic of computation and backpropagation, aiming to produce synthetic outputs that are indistinguishable from authentic data. The generator G synthesizes fake data from a noise signal through a series of computations, while the discriminator D assesses both real and fake data against its judgment criteria.
During training, the generator strives to fabricate increasingly realistic data, whereas the discriminator endeavors to differentiate between real and fake samples. This process unfolds iteratively until the generator fabricates data of such verisimilitude that the discriminator can no longer reliably label it as fake. The overall training objective is formulated as a minimax game where
D is maximized, while the log likelihood
is minimized for
G, as per the following function:
In this equation, x represents real data, z is a point sampled from the noise distribution , and denotes the fake data generated by G. The expectations are over the real data distribution and the noise distribution , respectively. The discriminator D aims to assign the correct label to both real and fake data, while the generator G aims to produce data that are indistinguishable by D, culminating in a Nash equilibrium for the adversarial game.
Generative Adversarial Networks (GANs) manifest a critical limitation wherein an overly proficient discriminator yields near-binary feedback, severely curtailing the generator’s gradient-based learning; conversely, an adept generator can exploit weaknesses in the discriminator, leading to gradient saturation. This necessitates a delicate balance in training dynamics to maintain parity in performance between the two networks, contributing to the widely acknowledged training challenges associated with GANs. However, GANs offer significant advantages over diffusion models in certain aspects. Notably, GANs provide precise control over the positioning and boundaries of objects within generated images, a crucial attribute for creative content generation tasks such as photo editing and patching, as well as for data augmentation in discriminative learning. Another key advantage of GAN models relative to diffusion models is their rapid inference speed. GANs require only a single forward pass to generate an image, whereas diffusion models necessitate multiple iterative denoising steps, leading to slower inference speeds that may impact the practical usability of the models.
In stark contrast, diffusion models benefit from a more transparent training loss function that facilitates model convergence and mitigates the mode collapse issue inherent to GANs. Moreover, diffusion models are adept at capturing the intricacies of complex, non-linear distributions that extend beyond the capabilities of GANs, which excel in generating homogeneous data, such as singular-class imagery. GANs often struggle with the multifaceted distributions present in diverse-class image datasets, a domain where diffusion models demonstrate superior modeling prowess, capturing a broader spectrum of intricate image distributions. Dhariwal et al. [
6] have proven this. Despite these advantages, the distinct strengths of GANs, particularly in terms of object placement precision and rapid image generation, underscore the importance of selecting the appropriate generative model framework based on the specific requirements and goals of the task at hand.
4.3. Flow-Based Generative Models
Flow-based generative [
31,
32,
33,
34] models are conceptually attractive due to tractability of the exact log-likelihood, tractability of exact latent-variable inference, and parallelizability of both training and synthesis. The flow-based generative model is the exact log-likelihood model, which applies a series of invertible transformations to samples from a priori so that the exact log-likelihood of an observation can be computed. Unlike the previous two algorithms, the flow-based model directly optimizes the objective function (which is log-likelihood!), so the loss function is a negative log-likelihood. The flow model
f is constructed as an invertible transform that maps a high-dimensional random variable
x to a standard Gaussian latent variable
z. This model can be represented by an arbitrary bijective function and can be formed by superimposing individual simple invertible transforms.
Assuming a model with a generator
G, let
x denote a data sample.
represents the distribution of
x as predicted by
G, and
is the true distribution of the data. The objective in adjusting the generator is to bring
as close as possible to
—that is, to minimize the divergence between the two distributions. In the context of the model,
represents a sample drawn from the distribution
. Therefore, solving for the generator
G is akin to a maximum likelihood estimation, which is equivalent to maximizing the probability of observing each sampled data point, or equivalently, minimizing the Kullback–Leibler (KL) divergence between the two distributions, aligning with our expectations. This is formally expressed as
Here, the KL divergence measures how one probability distribution diverges from a second, expected probability distribution.
In flow-based generative models, the objective function is crafted to facilitate the computation of complex probability densities through a series of invertible transformations. For a given data sample
, the log-likelihood under the model’s distribution
is given by the transformation of
through the inverse of the generative model
G and the log-determinant of the Jacobian matrix of
. The objective function can be expressed as
where
is the prior probability distribution in the latent space and
is the Jacobian matrix of the inverse of
G.
Maximizing this log-likelihood is equivalent to minimizing the Kullback–Leibler divergence between the data distribution and the model’s distribution . However, computing the determinant of the Jacobian matrix can be highly time-consuming, especially for large matrices. Additionally, calculating the inverse of the generator requires that the dimensions of the input and output must be identical. This necessitates an ingenious network architecture design that simplifies the computation of both the determinant of the Jacobian matrix and the inverse of the generator.
In practical flow-based models, the architecture may involve multiple transformations
G. Since it is cumbersome to impose various constraints on a single transformation, these constraints are distributed across multiple
G mappings. The transformations are then concatenated, and the overall objective function becomes
where each
is an individual transformation within the sequence and
K is the total number of transformations. This formulation allows for the modular design of the network and eases the imposition of constraints across the multiple mappings.
The invertible nature of the transformations used in flow-based models ensures that sampling can be performed efficiently and no information about the input data is lost. This is a significant advantage over models like VAEs and GANs. But the requirement for transformations to be invertible in flow-based models also introduces some limitations to them. It often leads to increased computational complexity, both in terms of memory and processing power. This can make flow-based models less scalable to very high-dimensional data or very complex distributions. Actually, the time required for flow-based models to generate images of the same resolution is several times greater than the Diffusion models. In addition, designing transformations that are both invertible and expressive enough to capture complex data distributions is challenging, and the requirement for invertibility restricts the choice of neural network architectures that can be used in flow-based models. Flow-based generative models offer unique advantages in terms of exact likelihood computation and efficient, invertible sampling. However, their practical application is often limited by computational demands, the challenge of designing expressive yet invertible transformations, and architectural constraints.
6. Future Trends
Research into diffusion models stands at the precipice of exciting developments, with theoretical and practical advancements yet to be explored. As these models establish themselves as a powerful generative framework rivaling adversarial networks without the need for adversarial training, the focus turns to deepening our understanding of their operational efficacy across varied applications. Crucial to this endeavor is unraveling the characteristics that distinguish diffusion models from other generative mechanisms, such as variational autoencoders and flow-based models. This differentiation could illuminate their ability to generate high-quality samples while achieving competitive likelihoods.
Moreover, improving latent space representations remains an area ripe for innovation. Current diffusion models lack the efficacy of their generative counterparts in providing manipulable semantic data representations, often mirroring the data space’s dimensionality and thus impacting sampling efficiency.
The advent of Artificial-Intelligence-Generated Content (AIGC) and the proliferation of Diffusion Foundation Models signify a paradigm shift towards models that are pre-trained to produce content that resonates with human perception. The generative pre-training techniques utilized in models such as GPT [
45] and Visual ChatGPT [
82], which have shown emergent abilities and surprising generation performance, hold promise for the future of diffusion models. By integrating these techniques, diffusion models could potentially be transformed to perform generatively at scale, offering new avenues for AIGC and beyond.
Recent advancements in diffusion models have markedly improved the generation of text-to-video content, overcoming previous limitations in quality, resolution, and synthesis control. Gen-1 [
21] has set a new standard by utilizing text-conditioned image models like DALL-E2 and Stable Diffusion to produce videos with unprecedented detail from text prompts. Notably, it incorporates monocular depth estimates and embeddings from pre-trained neural networks for enhanced structural and content authenticity. Following this, Gen-2 has further pushed the boundaries by introducing the capability to generate videos from scratch, supporting 4K ultra-realistic definitions and showcasing the potential of diffusion models in video synthesis.
The introduction of the Sora model, utilizing a Diffusion transformer (DiT) architecture, represents a leap forward, generating videos at 1920 × 1080 resolution for up to a minute, surpassing previous models in both quality and length. These developments signal a transformative era in video generation from text, hinting at a future where creating complex, high-quality video content from simple text descriptions could become commonplace. However, the journey does not end here. The field is ripe for exploration, especially in modeling long-term temporal dynamics and interactions, to further extend video duration and realism. Future research could unlock even more sophisticated applications, potentially revolutionizing storytelling, education, and entertainment by making detailed, lifelike video generation accessible to everyone.
With systematic exploration and innovative thinking, the trajectory of diffusion models is set to redefine the boundaries of generative AI, edging closer to the multifaceted goal of artificial general intelligence.
7. Conclusions
In this review, we have navigated the evolving landscape of diffusion models, elucidating the distinctions and synergies among three foundational formulations: DDPMs, SGMs, and Score SDEs. These models have showcased their prowess in image generation, setting a benchmark for quality and diversity. However, it is crucial to acknowledge that original diffusion models possess certain limitations, such as slower sampling speeds, often requiring thousands of evaluation steps to generate a single sample. They also lag in maximum likelihood estimation when compared with likelihood-based models and exhibit limited generalizability across various data types. Despite these challenges, recent strides have been made from practical and theoretical standpoints to mitigate these constraints, enhancing their applicability and analytical robustness.
Yang et al. [
83] enhanced diffusion models by streamlining the architecture for greater efficiency, significantly lowering computational costs and making these models viable for limited-resource settings. They achieve this through reducing latent space dimensionality and employing a layer-pruning algorithm to maintain generative quality with fewer resources. Furthermore, they improve the diffusion process with an adaptive noise conditioning technique, tailoring the number of steps to the data complexity for effective image synthesis. Lee et al. [
84] tackle spectral inconsistencies in diffusion model outputs with a spectrum translation method grounded in contrastive learning. This approach effectively aligns the frequency components of generated images with those of real images, thereby elevating the visual quality and enabling the generation of complex textures and details with improved fidelity.
Through a comparative lens, we have identified avenues for integrating diffusion models with other generative frameworks, suggesting a path towards enriched generative capabilities. Our exploration across four distinct domains underscores the versatility of diffusion models and their potential to revolutionize not only image generation but also extend into multimodal and interdisciplinary applications. While recognizing the innate constraints of diffusion models, such as computational intensity and the complexity of architecture that currently hinder their scalability and broad application, it is important to note that substantial efforts have been dedicated to overcoming these barriers. The horizon for future research within diffusion models is vast and vibrant, with many opportunities to refine their efficiency and expand their utility. The promise of diffusion models to advance generative AI tasks is undeniable, beckoning a concerted effort to unearth their full potential and chart new territories in generative modeling. With ongoing research and development, we anticipate diffusion models to not only evolve in terms of their theoretical foundations but also in their practical executions, marking a new epoch in the field of generative AI.