Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models

Asperti, Andrea; Colasuonno, Gabriele; Guerra, Antonio

doi:10.3390/electronics13153091

Open AccessArticle

Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models

by

Andrea Asperti

^*

,

Gabriele Colasuonno

and

Antonio Guerra

Department of Informatics-Science and Engineering (DISI), University of Bologna, 40126 Bologna, Italy

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(15), 3091; https://doi.org/10.3390/electronics13153091

Submission received: 22 June 2024 / Revised: 28 July 2024 / Accepted: 31 July 2024 / Published: 5 August 2024

(This article belongs to the Special Issue Generative AI and Its Transformative Potential)

Download

Browse Figures

Versions Notes

Abstract

:

Accurately modeling the effects of illumination and shadows during head rotation is critical in computer vision for enhancing image realism and reducing artifacts. This study delves into the latent space of denoising diffusion models to identify compelling trajectories that can express continuous head rotation under varying lighting conditions. A key contribution of our work is the generation of additional labels from the CelebA dataset, categorizing images into three groups based on prevalent illumination direction: left, center, and right. These labels play a crucial role in our approach, enabling more precise manipulations and improved handling of lighting variations. Leveraging a recent embedding technique for Denoising Diffusion Implicit Models (DDIM), our method achieves noteworthy manipulations, encompassing a wide rotation angle of ±30°. while preserving individual distinct characteristics even under challenging illumination conditions. Our methodology involves computing trajectories that approximate clouds of latent representations of dataset samples with different yaw rotations through linear regression. Specific trajectories are obtained by analyzing subsets of data that share significant attributes with the source image, including light direction. Notably, our approach does not require any specific training of the generative model for the task of rotation; we merely compute and follow specific trajectories in the latent space of a pre-trained face generation model. This article showcases the potential of our approach and its current limitations through a qualitative discussion of notable examples. This study contributes to the ongoing advancements in representation learning and the semantic investigation of the latent space of generative models.

Keywords:

diffusion models; latent space; embedding; representation learning; semantic trajectories; editing; head rotation

1. Introduction

The possibility of manipulating images acting on their latent representation, which is typical of generative models, has always exerted a particular fascination for researchers. Understanding the effect a tiny modification has on the encoding of a generated sample helps us to better understand the properties of latent space and the disentanglement of the different features. This is strictly related to editing, since understanding semantically meaningful directions (such as color, pose, and shape) can be utilized to modify an image to include certain desired features.

The field of deep generative modeling has recently witnessed a significant shift with the emergence of Denoising Diffusion Models (DDM) [1], which are rapidly establishing themselves as the new state-of-the-art technology [2,3]. These models are likely poised to surpass the long-standing Generative Adversarial Networks (GANs) [4] by providing an excellent generative quality with high sample diversity, simple and stable training, and a solid probabilistic foundation. They have achieved impressive results in a wide range of diverse domains comprising, e.g., medical imaging [5], healthcare [6], protein synthesis [7], and weather forecasting [8].

While DDMs have shown remarkable capabilities in generating realistic samples, the exploration of latent space and the manipulation of generated samples to edit specific attributes remains a complex task. This is partly due to the high dimensionality of latent space, which poses challenges in navigating and understanding the underlying semantics, but also to the complexity of embedding data into the latent space, computing the internal encoding of a given sample. In the case of GANs that have been explored so far, most of the known techniques for semantic editing [9,10,11,12] are in fact based on the preliminary definition of a “recoder” [13,14,15], inverting the generative process and essentially providing a functionality similar to encoders for Variational Autoencoders [16,17].

The embedding problem for the particular but important case of Denoising Diffusion Implicit Models [18] has been recently investigated in [19]. A crucial difference in the latent space of denoising models is that it appears to be organized as a foliation, with a different slice for each data point. These slices correspond to the set of all noisy points in the space that will collapse onto the given data point during the denoising process. Slices are typically very large, occupying significant portions of the input space. As a result, the embedding problem is inherently multimodal and underconstrained. Embedding techniques, such as the one described in [19], typically select a point in the slice based on criteria that are difficult to control and decipher. Consequently, there is no evidence that we can organize the latent points extracted from the embedding network along meaningful trajectories. This is precisely the problem we aim to address in this work.

Unlike many works in the literature that focus on one-step modifications of the input (e.g., changing a color or adding or removing elements), we are interested in continuous modifications of the input image. The case of head rotation is particularly appealing for our study for several reasons. Firstly, face generation is a well-investigated domain, and the rotation problem is recognized as one of the most complex and intriguing editing operations. The challenge with head rotation is that it requires preserving the distinctive features of the person while applying significant transformations that cannot be defined in terms of texture, color, shapes, or other similar information associated with segmentation areas. Another significant point for considering rotation is the availability of good open-source libraries that can automatically measure the pose of the head. These can be used both to guide and to test the effectiveness of the operation.

By employing the DDIM embedding technique, we have been able to achieve remarkable manipulations in head orientation, spanning a large rotation angle of ±30° along the yaw direction. Some examples are given in Figure 1.

This seems to testify to the fact that compulsory trajectories can be defined in the latent space of diffusion models in spite of the intrinsically multimodal nature of the embedding function.

Our methodology involves utilizing a pre-trained generative latent model for face generation and computing in its latent space trajectories composed of rectilinear segments, thus simulating the rotation effect. The direction of each segment is computed by linear regression, fitting through clouds of latent representations of dataset samples with varying yaw rotations. Each segment is then translated to the correct and known source location. To obtain trajectories tailored to a specific source image, we restrict the analysis to subsets of data sharing significant attributes with it; this is usually sufficient to ensure that the essential characteristics of the face are preserved throughout the manipulation process. We tested several attributes and the most significant ones appear to be gender, expression (smiling/not smiling), age (young/old), and illumination source (left/center/right). This last attribute is not a traditional attribute of the CelebA datasets; we created such labeling in recent years through the collaboration of many students, following a methodology briefly described in Section 5.

The analysis and comparison of the attributes, highlighting the importance of considering the source of illumination to achieve good rotation effects, is the main contribution of our work in the specific domain of face editing.

In this article, we present our methodology and showcase some preliminary, experimental results of the manipulations performed using the DDIM embedding technique. We do not yet have a quantitative evaluation of our work, due to the difficulty in identifying proper metrics; this is left as a subject for further investigation. Nevertheless, our findings demonstrate the potential of DDMs in enabling intricate editing operations while maintaining the fidelity of generated samples. The insights gained from this research contribute to advancing the field of deep generative modeling and provide a valuable foundation for future developments in latent space exploration and attribute manipulation.

The article is structured in the following way. In Section 2 we discuss related works and clarify the scope of our research, which aims to understand the dynamic of head movement in the latent space of Diffusion Models. Section 3 briefly presents the theory of this class of generative models; this section does not contain original material and can be skipped by readers knowledgeable in the area. In Section 4, we discuss the architecture of the neural models used for our work. Section 5 introduces the CelebA dataset, its attributes, and our original labeling relative to the illumination source. In Section 6, we explain our methodology. Preprocessing operations (cropping and background removal) and postprocessing ones (super resolution and color correction) are discussed in Section 7. Numerous examples are given in Section 9 and in Appendix A. Finally, concluding remarks and ideas for future research directions are given in Section 10.

2. Related Works

The task of head rotation holds significant importance in computer vision, finding extensive applications in various domains like security, entertainment, and healthcare.

Before the rise of deep learning, facial rotation methods primarily revolved around applying the traits of an input face image onto a 3D face model and then rotating it to create the desired rotated version. Examples of this approach can be found in [20,21]. In [22], the rotation problem was tackled using a 3D transformation matrix, which mapped each point from a 2D face image to its corresponding point on a 3D face model. Although these techniques could generate rotated face images, they were constrained by distortion and blurring issues that arose during the conversion of 2D images into 3D models.

The progress of deep learning has significantly expedited advances in facial rotation techniques, especially those leveraging generative adversarial networks (GANs). A typical application is face frontalization, aiming to improve face recognition accuracy by synthesizing a frontal face image from a side-view facial image.

Popular techniques in this category include DR-GAN [23], TP-GAN [24], CAPG-GAN [25], and FNM [26]. DR-GAN isolates the input image’s features and angle to generate a frontal image, whereas TP-GAN separately learns the overall outline features and detailed features to synthesize the frontal face. CAPG-GAN utilizes a heat map to frontalize an input face and FNM leverages both labeled and unlabeled data to improve learning efficiency.

All these methods face challenges in producing convincing results for input images taken from near-side angles or angles that are not from frontal views.

Several 3D geometry-based approaches have been devised to tackle head rotation challenges by combining traditional techniques with GANs. Relevant methods in this domain include FF-GAN [27], UV-GAN [28], HF-PIM [29], and Rotate-and-Render [30]. These techniques leverage the strengths of both 3D modeling and GANs to achieve more realistic and accurate rotations, overcoming some limitations of purely 2D or GAN-only approaches.

In contrast to reconstruction-based techniques, 3D geometry-based methods produce more realistic results for side-facing images. However, the need to handle detailed geometrical data, perform extensive rendering calculations, and integrate multiple complex processes makes 3D geometry-based methods more resource-intensive compared to other generative techniques.

Neural Radiance Fields (NeRF) [31] is an advanced method for representing intricate 3D scenes by means of neural networks. NeRF models the radiance and volume density of a scene as a continuous function. This function is parameterized by a neural network that receives a 3D coordinate and a viewing direction as inputs. The scene’s appearance is rendered by integrating the radiance along each camera ray. In FENeRF [32], the authors utilize NeRF to forecast a 3D representation of a given face with a particular rotation. This representation can then be further manipulated to edit the facial attributes. All these approaches are sensibly different from our work, since they aim to train conditional models, taking into account the geometric or textural information constraining the generation. In our case, we simply start from an unconstrained generative model, already containing in its latent space the source and target image, and try to identify the path leading from the source to the target. The purpose of this research is to investigate the structure of the latent space of generative models to better understand the learned representation and the properties of the encodings.

Several works have been done in this direction in the case of GANs, aiming to manipulate and govern the attributes of generated faces through a latent space-based approach. These techniques enable control over various attributes, including the age, eyeglasses, gender, expression, and rotation angles of the synthesized faces. Different methods have been developed, including PCA analysis to extract important latent directions [10], semantic analysis to control various attributes [9], and composing a new latent vector to control multiple attributes [33]. The most recent research mostly focused on text-guided image editing [34,35], frequently exploiting segmentation masks to drive generation [36].

All methods address rotation as a single shot operation, failing to provide evidence of a smooth and continuous modification of the source along a given trajectory.

Similar works have been done in the case of Variational Autoencoders (VAEs). In this context, due to the Gaussian-like shape of the latent space induced by the Kullback–Leibler regularization, more principled approaches to the computation of trajectories can be considered, for instance considering geodesic paths [37,38,39]. In the case of DDMI, the source space is indeed also Gaussian, but this is just the source noisy space rapidly collapsing, after a few iterations of the denoising process, towards the actual manifold of the data. So, there is no evidence that following a geodesic path could be beneficial, and our investigation seems to suggest that this is not the case (see Section 8).

In the specific case of Diffusion Models, there are several recent investigations on text-guided generation [40,41,42,43], but we are aware of no work focused on trajectories for continuous transformations, as the ones addressed by our research.

3. Denoising Diffusion Models

This section provides a fairly self-contained theoretical introduction to diffusion models. It has been added for the sake of completeness, and to introduce the terminology. It does not contain original material and it can be skipped by people with knowledge in the domain. We refer the readers to these excellent textbooks for additional information [44,45].

3.1. Diffusion and Reverse Diffusion

In order to have data distributed according to some probability distribution,

x_{0} \sim q (x_{0})

. We thus consider a forward process which gradually adds noise to the data, producing noised samples

x_{1}, \dots, x_{T}

, for some time horizon

T > 0

. Specifically, the diffusion model

q (x_{0 : T})

is supposed to be a Markov chain with the following shape:

q (x_{t} | x_{t - 1}) = N (x_{t} | \sqrt{\frac{α_{t}}{α_{t - 1}}} x_{t - 1}; (1 - \frac{α_{t}}{α_{t - 1}}) \cdot I)

(1)

with

{α_{t}}_{t \in [0, T]}

being a decreasing sequence in the interval

[0, 1]

.

Considering the fact that the composition of Gaussian distributions is still Gaussian, in order to sample

x_{t} \sim q (x_{t} | x_{0})

we do not need to go through an iterative process. If we define

α_{t} = 1 - β_{t}

and

{\bar{α}}_{t} = \prod_{s = 0}^{t} α_{s}

, then

\begin{matrix} q (x_{t} | x_{0}) & = N (x_{t}; \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I) \\ = \sqrt{{\bar{α}}_{t}} x_{0} + ϵ \sqrt{1 - {\bar{α}}_{t}} \end{matrix}

for

ϵ \sim N (0, I)

. In these equations,

1 - {\bar{α}}_{t}

is the variance of the noise for an arbitrary time step t and

{\bar{α}}_{t}

could be equivalently used instead of

β_{t}

to define the schedule of the noising process.

The idea behind denoising generative models is to reverse the above process, addressing the distribution

q (x_{t - 1} | x_{t})

. If we know how to sample from

q (x_{t - 1} | x_{t})

, then we can generate a sample starting from a Gaussian noise input

x_{T} \sim N (0, I)

. In general, the distribution

q (x_{t - 1} | x_{t})

cannot be expressed in closed form and it will be approximated using a neural network. In [46] it was observed that

q (x_{t - 1} | x_{t})

approaches a diagonal Gaussian distribution when T is large and

β_{t} \to 0

, so in order to learn the distribution it suffices to train a neural network predicting the mean

μ_{θ}

and the diagonal covariance matrix

Σ_{θ}

:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

The whole reverse process is thus:

p_{θ} (x_{0 : T}) = p_{θ} (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t})

(2)

where

p_{θ} (x_{T}) = N (0, I)

.

For training, we can use a variational lower bound on the negative log likelihood:

\begin{matrix} - log p_{θ} (x_{0}) \\ \leq - log p_{θ} (x_{0}) + D_{KL} (q (x_{1 : T} | x_{0}) ∥ p_{θ} (x_{1 : T} | x_{0})) \\ = - log p_{θ} (x_{0}) + E_{x_{1 : T} \sim q (x_{1 : T} | x_{0})} [log \frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{0 : T}) / p_{θ} (x_{0})}] \\ = - log p_{θ} (x_{0}) + E_{q} [log \frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{0 : T})} + log p_{θ} (x_{0})] \\ = E_{q} [log \frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{0 : T})}] \\ = E_{q} [- log p (x_{T}) - \sum_{t \geq 1} log \frac{p_{θ} (x_{t - 1} | x_{t})}{q (x_{t} | x_{t - 1})}] = L (θ) \end{matrix}

This can be further refined expressing

L_{θ}

as the sum of the following terms [46]:

L_{θ} = L_{T} + L_{t - 1} + \dots + L_{0}

(3)

where

\begin{matrix} L_{T} = D_{KL} (q (x_{T} | x_{0}) ‖ p_{θ} (x_{T})) \\ L_{t} = D_{KL} (q (x_{t} | x_{t + 1}, x_{0}) ∥ p_{θ} (x_{t} | x_{t + 1})) for 1 \leq t \leq T - 1 \\ L_{0} = - log p_{θ} (x_{0} | x_{1}) \end{matrix}

The advantage of this formulation is that the forward process posterior

q (x_{t} | x_{t + 1}, x_{0})

becomes tractable when conditioned on

x_{0}

and assumes a Gaussian distribution:

q (x_{t - 1} | x_{t}, x_{0}) = N (x_{t - 1} | \tilde{μ} (x_{t}, x_{0}); {\tilde{β}}_{t} I)

(4)

where

{\tilde{μ}}_{t} (x_{t}, x_{0}) = \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t} + \frac{\sqrt{{\bar{α}}_{t - 1}} β_{t}}{1 - {\bar{α}}_{t}} x_{0}

(5)

and

{\tilde{β}}_{t} = \frac{1 - {\bar{α}}_{t - 1}}{1 - {\bar{α}}_{t}} \cdot β_{t} .

(6)

As a consequence of this, the KL divergences in Equation (3) are comparisons between Gaussians and they can be calculated in a Rao–Blackwellized fashion with closed-form expressions.

After a few manipulations, we get:

L (θ) = \sum_{t = 1}^{T} γ_{t} E_{q (x_{t} | x_{0})} [∥ μ_{θ} (x_{t}, α_{t}) - \tilde{μ} (x_{t}, x_{0}) ∥_{2}^{2}]

(7)

which is just a weighted mean squared error between the image produced from

p_{θ} (x_{t} | x_{0})

and the true image given by the reverse diffusion process

q (x_{t - 1} | x_{t}, x_{0})

for each time t.

In [18], they use a slightly different approach based on predicting the noise

ϵ_{θ} (x_{t}, t)

in a given image

x_{t}

instead of denoising it.

Recall that the purpose of the training network is to approximate the conditioned probability distributions of the reverse diffusion process:

p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))

Our goal is to train the network to predict

\tilde{μ}

of Equation (5). Since

x_{0} = \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{t})

, we have

μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t))

(8)

and

x_{t - 1} \sim N (x_{t - 1}; \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)), Σ_{θ} (x_{t}, t))

(9)

The network can be simply trained to minimize the quadratic distance between the actual and the predicted error. Ignoring weighting terms, this seems to be irrelevant if not harmful in practice, thus the loss is:

L_{t}^{simple} = E_{t \sim [1, T], x_{0}, ϵ_{t}} [∥ ϵ_{t} - ϵ_{θ} (x_{t}, t) ∥^{2}]

(10)

L_{t}^{simple}

does not give any learning signal for

Σ_{θ} (x_{t}, t)

. In [18], the authors preferred to fix it to a constant, testing both

β_{t} I

and

{\tilde{β}}_{t} I

, with no sensible difference between the two alternatives.

3.2. Pseudocode

With the above setting, the algorithms for training and sampling are very simple. The network

ϵ_{θ} (x_{t}, t)

inputs a noisy image

x_{t}

and time step t, and it is supposed to return the noise contained in the image. Suppose that you have a given noise scheduling

(α_{T}, \dots, α_{1})

. We can train the network in a supervised way, sampling a true image

x_{0}

, creating a noisy version of it

x_{t} = \sqrt{α_{t}} x_{0} + \sqrt{1 - α_{t}} ϵ

where

ϵ \sim N (0; I)

and instructing the network to guess

ϵ

. Note that we only have a single network that is parametric in the time step t (or, since it is equivalent, parametric in

α_{t}

). The procedure is schematically described in Algorithm 1.

Algorithm 1: Training

1:: Fix a noise scheduling $(α_{T}, \dots, α_{1})$
2:: repeat
3:: $x_{0} \sim P_{DATA}$
4:: $t \sim$ Uniform(1, …, T)
5:: $ϵ \sim N (0; I)$
6:: $x_{t} = \sqrt{α_{t}} x_{0} + \sqrt{1 - α_{t}} ϵ$
7:: Take gradient descent step on ${| | ϵ - ϵ_{θ} (x_{t}, α_{t}) | |}^{2}$
8:: until converged

Generative sampling is performed through an iterative loop: we start with a purely noisy image

x_{T}

and progressively remove noise by means of the denoising network (see Algorithm 2). The denoised version of the image at time step t is obtained using Equation (9).

Algorithm 2: Sampling

1:: $x_{T} \sim N (0, I)$
2:: for $t = T, \dots, 1$ do
3:: $z \sim N (0; I) if t > 1 else z = 0$
4:: $x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{θ} (x_{t}, t)) + σ_{t} z$
5:: end for

Several improvements can be made to this technique.

An important primary point concerns the noise scheduling

{α_{t}}_{t = 1}^{T}

. In [1], the authors used linear or quadratic schedules. This typically results in a very steep decrease during the initial time steps, which could be problematic for generation. In order to address this issue, alternative scheduling functions that incorporate a more gradual decrease, such as the ’cosine’ or ’continuous cosine’ schedule, have been proposed in the literature [47]. The precise choice of the scheduling function does not seem to matter, provided it shows a nearly linear behaviour in the middle of the generative process and smoother changes around the beginning and the end of the scheduling.

Another major issue concerns the speedup of the sampling process, which in the original approach was up to one or a few thousand steps. Since the generative model approximates the reverse of the inference process, in order to reduce the number of iterations required by the generative model, it could be worth rethinking the inference process. This investigation motivated the definition of Denoising Deterministic Implicit Models, which is explained in the following section.

3.3. Denoising Deterministic Implicit Models

Denoising Deterministic Implicit Models (DDIMs) [18] are a variation of the previous approach exploiting a non-Markovian noising process which has the same forward marginals as DDPM, but allows for better tuning of the variance of the reverse noise.

We start by defining the

q (x_{t - 1} | x_{t}, x_{0})

parametric with respect to a desired standard deviation

σ

:

\begin{matrix} x_{t - 1} & = \sqrt{{\bar{α}}_{t - 1}} x_{0} + \sqrt{1 - {\bar{α}}_{t - 1}} ϵ_{t - 1} \end{matrix}

(11)

\begin{matrix} = \sqrt{{\bar{α}}_{t - 1}} x_{0} + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} ϵ_{t} + σ_{t} ϵ \end{matrix}

(12)

\begin{matrix} = \sqrt{{\bar{α}}_{t - 1}} x_{0} + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} \frac{x_{t} - \sqrt{{\bar{α}}_{t}} x_{0}}{\sqrt{1 - {\bar{α}}_{t}}} + σ ϵ \end{matrix}

(13)

So,

\begin{matrix} q_{σ} (x_{t - 1} | x_{t}, x_{0}) & = N (x_{t - 1}; μ_{σ_{t}} (x_{0}, α_{t - 1}), σ_{t}^{2} I) \end{matrix}

(14)

with

\begin{matrix} μ_{σ_{t}} (x_{0}, α_{t - 1}) = \sqrt{{\bar{α}}_{t - 1}} x_{0} + \sqrt{1 - {\bar{α}}_{t - 1} - σ_{t}^{2}} \frac{x_{t} - \sqrt{{\bar{α}}_{t}} x_{0}}{\sqrt{1 - {\bar{α}}_{t}}} \end{matrix}

(15)

According to this approach, the forward process is no longer Markovian, but it depends both on the starting point

x_{0}

and on

x_{t - 1}

. However, it can be easily proved that the marginal distribution

q_{σ} (x_{t} | x_{0}) = N (x_{t} | \sqrt{{\bar{α}}_{t}} x_{0}; (1 - {\bar{α}}_{t}) \cdot I)

recovers the same marginals as in DDPM. As a result,

x_{t}

can be diffused from

x_{0}

and

α_{t}

by generating a realization of normally distributed noise

ϵ_{t} \sim N (ϵ_{t} | 0; I)

.

We can set

σ_{t}^{2} = η \cdot {\tilde{β}}_{t}

, where

η

is a control parameter that can be used to tune sampling stochasticity. In the special case where

η = 0

, the sampling process becomes completely deterministic.

The sampling procedure in case of DDIM is slightly different from the case of DDPM. In order to sample

x_{t - 1}

according to Equation (14), we need

x_{0}

, which is obviously unknown at the time of generation. However, since we are guessing the amount of noise

ϵ (x_{t}, t)

in

x_{t}

at each time point, we can generate a denoised observation

{\tilde{x}}_{0}

, which is a prediction of

x_{0}

given that

x_{t}

:

{\tilde{x}}_{0} = (x_{t} - \sqrt{1 - {\bar{α}}_{t}} ϵ_{θ} (x_{t}, t) / \sqrt{{\bar{α}}_{t})}

The full generative procedure is summarized in the pseudocode of Algorithm 3:

Algorithm 3: Sampling (DDIM)

1:: $x_{T} \sim N (0, I)$
2:: for $t = T, \dots, 1$ do
3:: $ϵ = ϵ_{θ} (x_{t}, {\bar{α}}_{t})$
4:: ${\tilde{x}}_{0} = \frac{1}{\sqrt{{\bar{α}}_{t}}} (x_{t} - \frac{1 - {\bar{α}}_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ)$
5:: $x_{t - 1} = \sqrt{{\bar{α}}_{t - 1}} {\tilde{x}}_{0} + \sqrt{1 - {\bar{α}}_{t - 1}} ϵ$
6:: end for

An interesting aspect of DDIM, which is frequently exploited in the literature, is that due to its deterministic nature, it also defines an implicit latent space, which opens the way for a very interesting operation comprising latent space interpolation, or the exploration of interesting trajectories for editing purposes. The latent space can be obtained by integrating an ODE in the forward direction and then reverse the process to get the latent encodings that produce a given real image [4]. In [19], it was shown that a deep neural network can also be trained to perform this embedding operation, sensibly reducing its cost. We shall provide details on the embedding network in Section 4.1.

4. Model Architecture

As made clear in Section 3, the main component of a diffusion model is a denoising network, which inputs noise variance

{\bar{α}}_{t}

, an image

x_{t}

corrupted with a corresponding amount of noise, and tries to guess the actual noise

ϵ_{θ} (x_{t}, {\bar{α}}_{t})

present in the image. Starting from an image

x_{0}

of the data distribution, we can generate a random noise

ϵ \sim N (0; I)

and produce a corrupted version

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ

. Then, the network is trained to minimize the distance between the actual noise

ϵ

and the predicted noise

ϵ_{θ} (x_{t}, {\bar{α}}_{t})

:

L o s s = ∥ ϵ - ϵ_{θ} (x_{t}, {\bar{α}}_{t}) ∥^{2}

(16)

The architecture of this network is traditionally based on the U-net [48], a well-known convolutional neural network introduced, at its origin, for image segmentation tasks.

The U-net (see Figure 2) features an encoder–decoder structure, incorporating skip connections between layers of the encoder and decoder with corresponding spatial dimensions. We worked on images with an initial resolution of

64 \times 64

[19].

The input relative to the noise variance

α_{t}

is typically embedded using sinusoidal position embeddings. Then, this information is vectorized and concatenated to the initial one. The detailed structures of the various modules is given in Figure 3.

4.1. The Embedding Network

In DDIM, the sampling process is deterministic given the initial noise

x_{T}

, so it is natural to try to address the reverse problem, computing

x_{T}

from

x_{0}

. The operation is not obvious, however, since the problem is clearly underconditioned, and we may have many different points in the latent space generating the same output.

The embedding problem for Denoising Models can be addressed in several different ways: by gradient ascent, integrating the ordinary differential equations (ODE) defining the forward direction, and then running the process in reverse to get the latent representations, or by directly trying to train a network to approximate the embedding task [19]. The advantage of the latter approach is that, once training is completed, the computation of the latent encoding is particularly fast.

The embedding network

Emb (x)

inputs an image and tries to compute its embedding. It is trained in a completely supervised way: given some noise

x_{T}

, we generate a sample

x_{0}

and train the network to synthesize

x_{T}

from

x_{0}

, using as the loss the distance between

x_{T}

and

E m b (x_{0})

. Modern neural computation environments enable us to effortlessly backpropagate gradients through the iterative loop of the reverse diffusion process.

Many different models of the Embedding Network were compared in [19]; the best results were obtained by the U-Net, essentially identical to the denoising network. The big difference with respect to generation is that we compute the latent representation with a single pass through the network. Reconstruction has an MSE around 0.0012 in the case of CelebA, which is definitely good.

5. CelebFaces Attributes Dataset

The CelebFaces Attributes dataset (CelebA) [49] consists of over 200,000 celebrity images, with a rich set of annotations. In the past, it has been widely used in the fields of computer vision and machine learning for tasks like attribute prediction, facial recognition, and generation. Each image is equipped with 40 binary attributes, covering characteristics like gender, age, hair color, and so on. Additionally, bounding box annotations for the faces are provided.

CelebA exhibits a wide diversity in image quality, resolution, and sources, capturing a broad spectrum of ethnicities, age groups, and genders. To facilitate the deep learning models’ development, an aligned version of the dataset is offered, where faces are centered in a common coordinate system, ensuring consistent size and orientation.

CelebAMask-HQ [50] is a recent derivative of CelebA. It contains 30,000 high-resolution images, manually annotated with segmentation masks relative to 19 different facial components and accessories. This dataset serves as a valuable resource for training and evaluating face parsing, recognition, generation, and editing.

5.1. Analysis of the Dataset

Since a generative model is designed to model the distribution of data, it is always good practice to begin the study of a model with an analysis of the underlying dataset. Any bias in the data, such as an imbalance between different classes, will likely be reflected in the model, potentially leading to unexpected generative behaviors. For example, in the CelebA dataset, there is a noticeable gender imbalance, with a majority of female images over male images. Consequently, when editing faces, this bias can cause the model to more frequently transform male faces into female ones.

We compared and selected the relevant attributes for our investigation using the methodology described in Section 8. This approach aimed to assess the impact of each attribute in the direction of the trajectory. Ultimately, we focused on a subset of attributes that included gender, age, smiling, and “mouth slightly open”.

In this section, we investigate the distribution of these attributes in CelebA; results are summarized in Figure 4.

With respect to some of the attributes, the dataset is highly unbalanced: approximately 58% of the images in the dataset depict females and around 77% feature young people. This imbalance could adversely affect the generative process, particularly when dealing with male and older people as compared to female and younger ones.

In the following sections, we provide further information relative to face orientation and illumination source not covered by the CelebA attributes.

5.2. Face Orientation

To guide the DDIM generation process in producing faces with different orientations, information about head orientation from the CelebA dataset is needed. This information is not included in the standard annotations of the CelebA-aligned dataset. However, Head Pose Estimation is a well-investigated topic [51], and a large number of libraries are available for this purpose. In its usual formulation, the task includes expressing a person’s head orientation in a three-dimensional space by calculating three rotation angles called yaw, pitch, and roll. Specifically, yaw denotes the rotation around the vertical axis, pitch is the rotation around the horizontal axis, and roll is the rotation around an axis perpendicular to the other two (see Figure 5a).

For the straightforward task of recognizing the orientation of nearly frontal faces, as is the case with the majority of faces in CelebA, there are open-source libraries that perform excellently. Specifically, we used the cv2.solvePnP() function from the OpenCV library [52], employing the same technique described in [53] and which is readily accessible in the public repository. In the same repositories, we also provide direct access to pre-computed angles for all images in the CelebA dataset.

An example of the kind of obtainable annotations is given in Figure 5.

More interestingly, we can examine the distribution of CelebA images concerning orientations, particularly yaw, as it represents the most significant rotation in the dataset: see Figure 6. CelebA is an aligned dataset: as expected, over

40 %

of the images have yaw within the

[- 10, + 10]

degree range. Moreover, only

4.48 %

of the images have yaw outside the

[- 40, + 40]

degree range. The limited number of images with high yaw values restricts the generative power of the model. Typically, rotations need to be confined within a region of the data with statistical significance, such as yaw in the

[- 30, + 30]

degree range.

5.3. Light Direction Analysis

When rotating a face, it is crucial to preserve the right shadows produced by the lighting conditions. Unfortunately, no attributes are available relative to the source of illumination in the case of CelebA, and to our knowledge there is no open source software able to correctly identify lighting directions in an automatic way.

An important byproduct of our work is the provision of labels for CelebA, categorizing images into three major groups based on their main source of illumination: left, center, and right. The labeling process was carried out in a semi-supervised way over the last few years with the collaboration of many students. The basic procedure involved manually annotating a large portion of the data, developing and training classification models, cross-validating the data using different models, manually revising the classifications, and repeating the process until no further critical issues emerged.

Nevertheless, this labeling process has proven to be a valuable tool for our methodology and we hope it can serve as a significant asset for further investigations into face processing tasks. The labeling can be freely accessed through the code in the github repository. We also provide pre-computed yaw, pitch, and poll angles for each CelebA image.

In Figure 7, we summarize the outcome of our labeling and the complex interplay between illumination and orientation by showing the mean faces corresponding to different light sources and poses.

In the picture, we also show the variance in each class (mean of variances of the pixels) and the squared Euclidean distance (MSE) between class centroids. We can observe the following points: (1) the different provenance of the light is still clearly recognizable in the mean faces, implicitly testifying to the quality of our labeling; (2) from the point of view of the position and shape of the shadows over the phase, the illumination and pose are strictly interconnected; and (3) the variance of each class is an order of magnitude larger than the distance between their centers, hinting to the complexity of the classification problem.

The second point is particularly important for our work. Investigations into the most relevant variables in the latent representation of images, including faces, have revealed that much of the information is conveyed by variables that explain macroattributes of the source image, such as colors and intensities of large regions (e.g., light/dark backgrounds, light/dark hair) [54]. The intensity and positions of dark/light regions on a face are strongly influenced by the source of illumination. Therefore, it is natural to expect that this information significantly impacts the latent encoding, and indeed, as is testified by this work, it does.

6. Methodology

The problem is finding trajectories in latent spaces corresponding to left/right rotations of the head.

Our starting point is a large dataset of head images enriched with information related to the rotation of the head and additional attributes such as lighting source, gender, age, and expression (smiling/not smiling). The selection of these attributes is discussed in Section 8. Images are preprocessed to remove the background as described in Section 7.

We also assume that there is a pre-trained generative model for the above dataset, along with an embedding tool capable of mapping an arbitrary sample of the dataset to its internal representation in the latent space of the generative model.

The other input is the image of the head to be rotated, let us call it X. Let

Θ_{X}

be its current rotation and let

Z_{X}

be its latent representation. This image can be one of the images in the dataset or a completely new one. In the latter case, its current rotation and its other attributes must be pre-computed and passed to the rotation model.

The methodology has the following steps:

Filtering. We restrict the investigation to a subset of the dataset sharing the selected attributes with X. So, if X is a young, smiling blond man with a frontal illumination, we shall restrict the analysis to images sharing the same attributes.
Clustering. Starting from $Θ_{X}$ we create clusters of images with a rotation around $Θ_{X} + Δ$ for increasing values of $Δ$ encompassing an overall rotation of ±30°.
Embedding and centroids computation. We embed the clusters in the latent space and compute their centroids. Each centroid conceptually corresponds to the latent representation of a “generic” person with the given attributes and rotations.
Rotation trajectories. We defined rotation trajectories by fitting linear lines through the centroids using linear regression. We experimented with different spline segmentations, but ultimately obtained the best results by splitting the problem into two directions; one for right rotation and another for left rotation.
Re-sourcing. The final step involves applying the trajectory vector corresponding to the rotation starting from $Z_{X}$ . We sample points along this trajectory and generate the corresponding images in the visible space.

In order to improve the quality of the final image we post-process it for super resolution and color correction.

For a fixed rotation movement (left or right), the approach is schematically described in Figure 8. Our attempts to split the rotation in a larger number of linear steps have been hindered by the progressive loss of the key facial characteristics of the source image.

The clustering phase is not an essential part of the algorithm, since we could directly apply regression on the cloud of embedded points. We compute centroids mostly for debugging purposes to visualize and compare “generic” faces for a given set of attributes and rotations.

The number of angles relative to centroids and their intervals can be easily customized by the user. Using a step size that is too small typically reduces the number of images retrieved from the dataset that matches that specific orientation, thereby diminishing the statistical significance of the cluster. We usually work with a step size of

10^{\circ}

.

In the nest sections we provide additional details on some of the main steps of our methodology.

Filtering CelebA Images

In our first attempts, we selected images from the dataset just using the rotation. However, it is important to select images that have at least a rough similarity with the source image we want to act on. To this aim, we use a basic set of attributes comprising lighting source, gender, age, and expression (smiling/not smiling). In Figure 9 we show the different mean value of CelebA data relative to the different configurations of some of these attributes.

In order to obtain a sufficiently representative number of candidates, we typically enlarge the dataset via a flipping operation, consistently inverting the relevant attributes (yaw and lighting source). For example, if we are looking for images with a yaw of +30° and a light direction of ’RIGHT’ we can also take into account images with a yaw of −30° and a light direction of ’LEFT’, provided we flip them. We aim to retrieve sets composed of at least 1K images, starting from a relatively narrow tolerance interval

[Θ - Δ, Θ + Δ]

around the desired angle

Θ

and possibly enlarging

Δ

if required.

The relevance of exploiting the attributes is exemplified in Figure 10, where we compare rotations obtained selecting clouds of images according to rotations (first row), with the case where we refine the selection with relevant attributes (second row).

In Section 8, we provide a more technical comparison of the different trajectories in terms of their cosine similarity. We also used this metric as a way to select the most relevant attributes. There is a delicate balance between the specificity provided by attributes and the statistical relevance of the images retrieved from the dataset, which is essential for the regression phase. More details can be found in [55].

7. Preprocessing and Postprocessing

The deployment of the previous technique requires a few preprocessing and postprocessing steps, discussed in this section. Preprocessing is aimed at preparing inputs in a format suitable for the DDIM embedder, while post-processing is devoted to enhancing the quality of the result.

7.1. Preprocessing

In this article, we restrict the input to aligned CelebA images. We were able to generalize the approach to an arbitrary image provided by the user, as we did in [53], but the scientific added value is negligible.

Since the input image is already aligned we work with a central crop with a dimension of

128 \times 128

, which is frequently used in the literature [56], and then resized to a dimension of

64 \times 64

.

The main step of the preprocessing phase is background removal, since we experimentally observed that this operation facilitates rotation. To this aim, we trained a U-Net model on the CelebAMask-HQ dataset, which includes high-quality, manually annotated face masks. All masks were combined and treated as a binary segmentation problem, focusing on background/foreground separation (see Figure 11).

This approach allowed us to obtain a fairly precise segmentation of the facial region, with a precision of

96.78 %

and a recall of

97.60 %

.

7.2. Postprocessing

To enhance the final results, we crafted a postprocessing pipeline featuring two additional steps: super-resolution and color correction. This meticulous process ensures sharper details and more accurate colors.

7.2.1. Super-Resolution

The initial output, generated at a resolution of 64 × 64, is enhanced to 256 × 256 using CodeFormer [57], a recently introduced model known for its proficiency in Super-Resolution and Blind Face Restoration. CodeFormer amalgamates the strengths of transformers and codebooks to achieve remarkable results. Transformers have gained widespread popularity and application in natural language processing and computer vision tasks. On the other hand, codebooks serve as a method to quantize and represent data more efficiently in a compact form. The codebook is learned via self-reconstruction of the HQ faces using a vector-quantized autoencoder, which embeds the image into a representation capturing the rich HQ details required for face restoration.

The key advantage of employing Codebook Lookup Transformers for face restoration lies in their ability to capture and exploit the structural and semantic characteristics of facial images. By employing a pre-defined codebook that encapsulates facial features, the model proficiently restores high-quality face images from low-quality or degraded inputs, effectively handling various types of noise, artifacts, and occlusions.

7.2.2. Color Correction

The final step of the post-processing phase involves applying a color correction technique to reduce color discrepancies between the generated faces and their corresponding source images. This technique is essential for enhancing the overall visual coherence of the final result.

The color correction process leverages the Lab color space to match the color statistics of the two images. It begins by converting both images to the Lab color space. Then, the Lab channels of the target image are adjusted by normalizing them according to the mean and standard deviation of the source image. Finally, the target image is converted back to the RGB color space, ensuring that the colors of the generated face closely match those of the original source image.

Once the trajectory is identified, we move along it for a specified number of steps, checking the rotation after each iteration. If the generated image does not show the expected rotation, we try to dynamically increase the number of steps.

In case of images with a large initial yaw, we also apply a preliminary face frontalization phase.

8. Analysis of the Trajectory Slopes

In this section, we contrast the trajectory slopes within the latent space of Diffusion Models acquired through distinct attribute selections. We utilize cosine similarity as a synthetic metric to gauge the correlation between these trajectories.

We recall that we approximate trajectories using linear steps derived from linear regression conducted over the centroids of diverse clusters of data point embeddings. The selection of data points is based on yaw angles and various attributes, comprising the source of the illumination.

Figure 12 provides a visual representation of the variation of the trajectory slopes across varying ranges of rotation degrees. A heatmap is employed to graphically portray the level of resemblance between the trajectories, where distinct colors denote the magnitude of similarity.

As depicted in the figure, altering the degree ranges employed for cluster creation by

10^{\circ}

leads to a consistently diminishing cosine similarity between the slopes. The most significant disparity is observable between the intervals [

- 40^{\circ}

,

0^{\circ}

] and [

0^{\circ}

,

+ 40^{\circ}

]. This means that the direction of the trajectory required to turn a face in the range (

- 40^{\circ}

,

0^{\circ}

) is very different from the direction required to turn it in the range (

0^{\circ}

,

+ 40^{\circ}

).

Based on the preceding analysis, it might appear that an incremental rotation approach involving frequent slope recalculations holds an advantage. Nevertheless, following the slope computation, it becomes necessary to translate the trajectory from centroids to the latent representation of the current image and move away from it. With each step, there is usually a gradual erosion of individual facial attributes. Thus, a trade-off arises: fewer steps could result in a relatively less precise rotations but better preservation of identity traits, while a greater number of steps could yield the opposite outcome.

According to our experimental findings, we obtained the best outcomes by just using two trajectories: the rightward trajectory [

0^{\circ}

,

- 40^{\circ}

] and the leftward trajectory [

0^{\circ}

,

+ 40^{\circ}

]. This is typically preceded by a frontalization step when required.

The remainder of this section is dedicated to evaluating the influence of auxiliary attributes on trajectory definition: does rotating a male head yield the same results as rotating a female one? What implications arise from factors such as age or face illumination?

To this aim, we fix an initial central pose as well as two fixed trajectories, rightward and leftward, and compare the slopes obtained selecting data points for centroids according to different attributes. Specifically, in Figure 13 we focus on gender and illumination, in Figure 14 we focus on illumination and age, and finally in Figure 15 we focus on gender, age, and smile. In all the Figures, the individual faces are merely meant to be representatives of their corresponding class of attributes and have no specific influence on the slope of the associated trajectory.

Figure 13 presents a heatmap that illustrates the cosine similarities between the left and right slopes computed on clusters of images taking into account only two attributes: light direction (center, left, or right) and gender (male or female). Gender is the most important factor of variation, more relevant than illumination source. Still, images with disparate light directions results in sensibly different trajectories.

These results suggest that both light direction and gender are important attributes for enhancing accuracy in trajectory calculation.

Figure 14 parallels the previous concept, differing solely in the replacement of gender with age (young or not young) as the attribute under consideration.

As depicted by the figure, a consistent trend emerges: diminished similarity values occur when the light direction is in contrast, and the same trend applies to the age parameter. Nevertheless, it is noteworthy that age has a relatively lower impact on trajectory definition.

Figure 15 embarks on a more intricate exploration of trajectory attributes. In this context, we delve into a refined set of attributes: gender (M or F), youthfulness (Y or NY), and smiling (S or NS).

Similar to previous instances, the lowest similarity emerges when all attributes stand distinct from each other. For instance, a comparison between a female who is not young and not smiling ([F, NY, NS]) and a male who is young and not smiling ([M, Y, NS]) represents the scenario with the least similarity. Furthermore, upon closer examination of the image, it becomes apparent that gender exerts the most significant impact on similarity, closely trailed by the “smiling” attribute.

In our work, we consistently employed the aforementioned technique to methodically select the most pertinent attributes for delineating trajectories within the latent space.

9. Results and Troubleshooting

Measuring the quality of the generative systems is a notoriously difficult task due to the lack of a ground truth to use as a comparison. This is particularly difficult in the case of the rotation operation, where we must assess both the model’s capacity to obtain the desired orientation and the fidelity of the target to the source sample. Traditional metrics used in the field of generative modeling, like the Fréchet Inception Distance (FID), cannot be used in this context, since they are designed to compare distributions of data, not individual samples. In our case, the rotation measured on the generated sample is a parameter used to control the short iterative loop governing the computation of the trajectory; so, apart from a few cases where the algorithm fails to achieve the desired rotation and is forcibly stopped, the rotation of the result is the one expected.

The difficult task is to quantify the similaritiesxsc z of the individual features of the target with those of the source. We are currently doing experiments with the Feature Similarity Index (FSIM) [58], the Identity Preservation Metric [59], ArcFace’s Additive Angular Margin Loss [60], and the Learned Perceptual Image Patch Similarity (LPIPS) [61]. All of them are valuable, but they also suffer from well-known limitations: FSIM and similar metrics like SSIM are based on local patterns and luminance but may not adequately capture global context or the perceptual importance of different image regions; the Identity Preservation Metric heavily depends on the facial recognition or feature extraction model used, while the Learned Perceptual Image Patch Similarity (LPIPS) can be significantly influenced by the diversity and representativeness of the training dataset used for the neural networks that underpin this metric. We shall report on these quantitative analyses in a forthcoming paper.

Also, a qualitative comparison with similar GAN-based architectures is problematic. As observed in [54], state-of-the-art GANs, especially when trained on CelebA-HQ, seem to have serious generative deficiencies: many images from CelebA seem to lie outside their generative range. This means that it is not always possible to embed a generic face in the latent space and reconstruct an image with sufficient similarities.

In this preliminary report, we shall just showcase the promising potential of our approach through some examples; the reader is also invited to test the system, freely available on GitHub at https://github.com/asperti/Head-Rotation.

Some examples of rotations are given in Figure 16. More examples are given in Appendix A. In general, the technique still suffers from of a few notable problems, and we shall devote the remaining part of this section to their discussion.

9.1. Loss of Individual’s Features

Preserving facial features while rotating the image of a large angle poses a significant challenge. This becomes especially problematic when employing a latent-based approach, where the essential traits of an individual are solely captured in the source point coordinates. Consequently, there is a risk of losing these key characteristics while following a given path, often resulting in more generic and less distinct features. Figure 17 illustrates this phenomenon. While the right rotation (from the observer’s point of view) appears reasonably accurate, the left rotation exhibits a gradual loss of the individual’s features.

9.2. High Pitch and Roll

In the presence of head poses with high pitch or roll (not very frequent in CelebA), the technique can encounter serious troubles, as exemplified in Figure 18.

Usually, the method tries to either correct the anomalous angles, as seen along the left rotation, or simply gets lost, as seen along the right rotation.

9.3. Hats and Other Artifacts

The generative model does not seem to have enough semantic information to handle situations involving the presence of artifacts such as microphones, hats, or any kind of headgear (see Figure 19).

Also, glasses over the head are usually a problem, as exemplified in Figure 20.

In the same figure, you may also observe the progressive loss of identity and change in expression along the left rotation. Glasses over the eyes may sometimes be lost during rotation, but otherwise they are handled correctly. Several examples are given in the Appendix A.

9.4. Deformation and Loss of Contours

Rotation may sometimes introduce anomalous deformations in the shape of the head; additionally, the technique is frequently unable to define precise contours for the face under extreme yaw angles. Both phenomena are evident in Figure 21.

9.5. Difficulty in Rotating Neck and Ears

The technique sometimes has trouble correctly rotating the neck or ears of subjects. They may get detached from the actual figure, remaining in the “background”. This is illustrated in Figure 22.

Sometimes, a similar situation also happens with hair.

10. Conclusions

This work contributes to the investigation of trajectories in the latent space of generative models, with particular attention to editing operations not easily expressible in terms of the texture, color, or shapes of well-identified segmentation areas that require holistic manipulation of the image. Head rotation, especially intended as a continuous transformation, is a typical example of these kinds of manipulations. Our investigation suggests that identifying compelling trajectories relies on recognizing relevant attributes of the source image that can guide the statistical search in latent space. Among these relevant attributes, in the case of head rotation, the direction of the illumination plays a crucial role, creating complex shadowing effects on the face that are difficult to manage during rotation. Emphasizing the importance of lighting conditions for achieving realistic generative results in head rotation is one of the contributions of this work.

As a side result of our research, we created additional labels for the CelebA dataset, categorizing images into three groups based on the prevalent illumination direction: left, right, or center. Since CelebA-HQ is a well-known subset of CelebA, our labeling can be easily extended to the former dataset.

Our work is at a preliminary stage, and many aspects deserve further investigation.

Firstly, the current version lacks a robust quantitative evaluation and a thorough comparison with alternative techniques. Secondly, continuous movements can be better investigated in a video setting, a research field that has undergone remarkable achievements in recent years, mostly thanks to stable diffusion techniques [62,63,64]. Specifically, exploiting the spatio-temporal coherence of adjacent frames could help in understanding the global structure and 3D perspective, which become particularly useful when dealing with artifacts such as hats, earrings, or eyeglasses.

In the context of video generation, our work could contribute to extracting a dataset of difficult cases, especially in terms of light conditions that could pose interesting challenges for generative models. This dataset would be valuable for testing and improving the robustness of generative models in handling complex scenarios, thereby advancing the field of video-based generative modeling.

Furthermore, our preliminary findings indicate that incorporating detailed lighting information into the generative process significantly enhances the realism of generated images. Future work should focus on developing more sophisticated methods for capturing and utilizing lighting attributes in the latent space. This includes exploring the use of advanced neural network architectures and loss functions specifically designed to preserve lighting consistency during image manipulation.

Additionally, we plan to extend our investigation to other types of holistic image manipulations beyond head rotation, such as changing facial expressions or body poses, which also require careful consideration of lighting and other contextual factors. By addressing these challenges, we contribute to providing a comprehensive framework for holistic image manipulation in generative models.

Author Contributions

Conceptualization, A.A.; Methodology, A.A., G.C. and A.G.; Software, G.C. and A.G.; Validation, G.C.; Formal analysis, A.G.; Data curation, G.C.; Writing—original draft, A.A., G.C. and A.G.; Supervision, A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the Future AI Research (FAIR) project of the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3 funded from the European Union-NextGenerationEU.

Data Availability Statement

The application described in this paper is open source. The software can be downloaded from the following github repository: https://github.com/asperti/Head-Rotation.

Acknowledgments

We would like to thank the many students who helped in the annotation of CelebA for illumination orientation, and in particular L. Bugo, D. Filippini and A. Rossolino.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Additional Rotation Examples

In this appendix we provide a short list of additional examples of rotations obtained by means of our model.

Figure A1. Examples of rotations.

Figure A2. Examples of rotations.

References

Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Choi, J.; Kim, S.; Jeong, Y.; Gwon, Y.; Yoon, S. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; pp. 14347–14356. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Dhariwal, P.; Nichol, A.Q. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
Eschweiler, D.; Yilmaz, R.; Baumann, M.; Laube, I.; Roy, R.; Jose, A.; Brückner, D.; Stegmaier, J. Denoising diffusion probabilistic models for generation of realistic fully-annotated microscopy image datasets. PLoS Comput. Biol. 2024, 20, e1011890. [Google Scholar] [CrossRef]
Shokrollahi, Y.; Yarmohammadtoosky, S.; Nikahd, M.M.; Dong, P.; Li, X.; Gu, L. A Comprehensive Review of Generative AI in Healthcare. arXiv 2023, arXiv:2310.00795. [Google Scholar]
Trippe, B.L.; Yim, J.; Tischer, D.; Baker, D.; Broderick, T.; Barzilay, R.; Jaakkola, T.S. Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem. In Proceedings of the the Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhao, Z.; Dong, X.; Wang, Y.; Hu, C. Advancing Realistic Precipitation Nowcasting With a Spatiotemporal Transformer-Based Denoising Diffusion Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Shen, Y.; Gu, J.; Tang, X.; Zhou, B. Interpreting the Latent Space of GANs for Semantic Face Editing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: Piscataway, NJ, USA, 2020; pp. 9240–9249. [Google Scholar] [CrossRef]
Härkönen, E.; Hertzmann, A.; Lehtinen, J.; Paris, S. GANSpace: Discovering Interpretable GAN Controls. Adv. Neural Inf. Process. Syst. 2020, 33, 9841–9850. [Google Scholar]
Li, Z.; Tao, R.; Wang, J.; Li, F.; Niu, H.; Yue, M.; Li, B. Interpreting the Latent Space of GANs via Measuring Decoupling. IEEE Trans. Artif. Intell. 2021, 2, 58–70. [Google Scholar] [CrossRef]
Shen, Y.; Yang, C.; Tang, X.; Zhou, B. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2004–2018. [Google Scholar] [CrossRef] [PubMed]
Creswell, A.; Bharath, A.A. Inverting the Generator of a Generative Adversarial Network. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 1967–1974. [Google Scholar] [CrossRef] [PubMed]
Alaluf, Y.; Tov, O.; Mokady, R.; Gal, R.; Bermano, A. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18511–18521. [Google Scholar]
Xia, W.; Zhang, Y.; Yang, Y.; Xue, J.H.; Zhou, B.; Yang, M.H. Gan inversion: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3121–3138. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
Asperti, A.; Evangelista, D.; Piccolomini, E.L. A Survey on Variational Autoencoders from a Green AI Perspective. SN Comput. Sci. 2021, 2, 301. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
Asperti, A.; Evangelista, D.; Marro, S.; Merizzi, F. Image Embedding for Denoising Generative Models. Artif. Intell. Rev. 2023, 56, 14511–14533. [Google Scholar] [CrossRef]
Hassner, T.; Harel, S.; Paz, E.; Enbar, R. Effective face frontalization in unconstrained images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4295–4304. [Google Scholar]
Zhu, X.; Lei, Z.; Yan, J.; Yi, D.; Li, S.Z. High-fidelity pose and expression normalization for face recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 787–796. [Google Scholar]
Moniz, J.R.A.; Beckham, C.; Rajotte, S.; Honari, S.; Pal, C. Unsupervised depth estimation, 3d face rotation and replacement. In Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, QC, Canada, 3–8 December 2018; pp. 9759–9769. [Google Scholar]
Tran, L.; Yin, X.; Liu, X. Disentangled representation learning gan for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 1415–1424. [Google Scholar]
Huang, R.; Zhang, S.; Li, T.; He, R. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2439–2448. [Google Scholar]
Hu, Y.; Wu, X.; Yu, B.; He, R.; Sun, Z. Pose-guided photorealistic face rotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8398–8406. [Google Scholar]
Qian, Y.; Deng, W.; Hu, J. Unsupervised face normalization with extreme pose and expression in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9851–9858. [Google Scholar]
Yin, X.; Yu, X.; Sohn, K.; Liu, X.; Chandraker, M. Towards large-pose face frontalization in the wild. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3990–3999. [Google Scholar]
Deng, J.; Cheng, S.; Xue, N.; Zhou, Y.; Zafeiriou, S. Uv-gan: Adversarial facial uv map completion for pose-invariant face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7093–7102. [Google Scholar]
Cao, J.; Hu, Y.; Zhang, H.; He, R.; Sun, Z. Learning a high fidelity pose invariant model for high-resolution face frontalization. Adv. Neural Inf. Process. Syst. 2018, 31, 2872–2882. [Google Scholar]
Zhou, H.; Liu, J.; Liu, Z.; Liu, Y.; Wang, X. Rotate-and-render: Unsupervised photorealistic face rotation from single-view images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5911–5920. [Google Scholar]
Mildenhall, B.; Srinivasan, P.P.; Tancik, M.; Barron, J.T.; Ramamoorthi, R.; Ng, R. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In Proceedings of the ECCV, Online, 23–28 August 2020. [Google Scholar]
Sun, J.; Wang, X.; Zhang, Y.; Li, X.; Zhang, Q.; Liu, Y.; Wang, J. FENeRF: Face Editing in Neural Radiance Fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June 2022; pp. 7662–7672. [Google Scholar]
Abdal, R.; Zhu, P.; Mitra, N.J.; Wonka, P. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Trans. Graph. (ToG) 2021, 40, 1–21. [Google Scholar] [CrossRef]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Gal, R.; Patashnik, O.; Maron, H.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. StyleGAN-NADA: CLIP-guided domain adaptation of image generators. ACM Trans. Graph. 2022, 41, 141:1–141:13. [Google Scholar] [CrossRef]
Morita, R.; Zhang, Z.; Ho, M.M.; Zhou, J. Interactive Image Manipulation with Complex Text Instructions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, 2–7 January 2023; pp. 1053–1062. [Google Scholar] [CrossRef]
Kalatzis, D.; Eklund, D.; Arvanitidis, G.; Hauberg, S. Variational autoencoders with riemannian brownian motion priors. arXiv 2020, arXiv:2002.05227. [Google Scholar]
Chadebec, C.; Allassonnière, S. A geometric perspective on variational autoencoders. Adv. Neural Inf. Process. Syst. 2022, 35, 19618–19630. [Google Scholar]
Shamsolmoali, P.; Zareapoor, M.; Zhou, H.; Tao, D.; Li, X. Vtae: Variational transformer autoencoder with manifolds learning. IEEE Trans. Image Process. 2023, 32, 4486–4500. [Google Scholar] [CrossRef]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross-Attention Control. In Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Zhang, Z.; Han, L.; Ghosh, A.; Metaxas, D.N.; Ren, J. SINE: SINgle Image Editing with Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 6027–6037. [Google Scholar] [CrossRef]
Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-Based Real Image Editing with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 6007–6017. [Google Scholar] [CrossRef]
Couairon, G.; Verbeek, J.; Schwenk, H.; Cord, M. DiffEdit: Diffusion-based semantic image editing with mask guidance. In Proceedings of the The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Sanseviero, O.; Cuenca, P.; Passos, A.; Whitaker, J. Hands-On Generative AI with Transformers and Diffusion Models; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2024. [Google Scholar]
Bishop, C.M.; Bishop, H. Diffusion Models. In Deep Learning: Foundations and Concepts; Springer: Berlin/Heidelberg, Germany, 2023; pp. 581–607. [Google Scholar]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In JMLR Workshop and Conference Proceedings, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015; JMLR: New York, NY, USA, 2015; Volume 37, pp. 2256–2265. [Google Scholar]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 8162–8171. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Lee, C.H.; Liu, Z.; Wu, L.; Luo, P. Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5549–5558. [Google Scholar]
Asperti, A.; Filippini, D. Deep Learning for Head Pose Estimation: A Survey. SN Comput. Sci. 2023, 4, 349. [Google Scholar] [CrossRef]
Bradski, G. The OpenCV Library. Dr. Dobb’s J. Softw. Tools 2000. [Google Scholar]
Asperti, A.; Colasuonno, G.; Guerra, A. Portrait Reification with Generative Diffusion Models. Appl. Sci. 2023, 13, 6487. [Google Scholar] [CrossRef]
Asperti, A.; Tonelli, V. Comparing the latent space of generative models. Neural Comput. Appl. 2023, 35, 3155–3172. [Google Scholar] [CrossRef]
Guerra, A. Exploring Latent Embeddings in Diffusion Models for Face Orientation Conditioning. Master’s Thesis, University of Bologna, Bologna, Italy, 2023. [Google Scholar]
Dai, B.; Wipf, D.P. Diagnosing and enhancing VAE models. In Proceedings of the Seventh International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Zhou, S.; Chan, K.C.K.; Li, C.; Loy, C.C. Towards Robust Blind Face Restoration with Codebook Lookup Transformer. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Zhang, L.; Zhang, L.; Mou, X.; Zhang, D. FSIM: A Feature Similarity Index for Image Quality Assessment. IEEE Trans. Image Process. 2011, 20, 2378–2386. [Google Scholar] [CrossRef] [PubMed]
Taigman, Y.; Yang, M.; Ranzato, M.; Wolf, L. DeepFace: Closing the Gap to Human-Level Performance in Face Verification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, 23–28 June 2014; IEEE Computer Society: Piscataway, NJ, USA, 2014; pp. 1701–1708. [Google Scholar] [CrossRef]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019; pp. 4690–4699. [Google Scholar] [CrossRef]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation/IEEE Computer Society: Piscataway, NJ, USA, 2018; pp. 586–595. [Google Scholar] [CrossRef]
Ho, J.; Salimans, T.; Gritsenko, A.A.; Chan, W.; Norouzi, M.; Fleet, D.J. Video Diffusion Models. In Proceedings of the NeurIPS, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Ho, J.; Chan, W.; Saharia, C.; Whang, J.; Gao, R.; Gritsenko, A.A.; Kingma, D.P.; Poole, B.; Norouzi, M.; Fleet, D.J.; et al. Imagen Video: High Definition Video Generation with Diffusion Models. arXiv 2022, arXiv:2210.02303. [Google Scholar]
Brooks, T.; Peebles, B.; Holmes, C.; DePue, W.; Guo, Y.; Jing, L.; Schnurr, D.; Taylor, J.; Luhman, T.; Luhman, E.; et al. Video Generation Models as World Simulators guides. OpenAI 2024. [Google Scholar]

Figure 1. Rotation examples. The sources are images no. 114, no. 16,399, and no. 98,018 of CelebA (central image).

Figure 2. The U-net architecture of our denoising model.

Figure 3. Main architectural modules, including the residual block, down block, and up block.

Figure 4. Distribution of CelebA attributes. In CelebA, each attribute is annotated with either −1 or 1. For example, for gender, “male = 1” stands for male, and “male = −1” stands for female.

Figure 5. (a) Yaw, Pitch, and Roll angles in HPE. (b) Examples of head pose estimation for CelebA images: yaw is in green, pitch in blue, and roll in red.

Figure 6. Yaw distribution in the CelebA dataset.

Figure 7. Illumination–Pose centroids. The different figures visualize the mean faces relative to different light sources and poses, considering three major orientation classes. We also report the variance in each class (mean of variances of the pixels) and the mean square error (MSE) between class centers.

Figure 8. Overall methodology. All pictures refer to the latent space of the generative model, schematically represented in two dimensions. We also suppose that the images were pre-filtered along the relevant attributes. (a) We use the embedder to compute clusters of latent points corresponding to specific rotation angles, expressed by different colors; (b) we compute the centroids of the clusters; (c) we fit a line through the centroids to compute a rotation trajectory; and (d) we move along this direction starting from the specific embedding of the source images we want to rotate.

Figure 9. Mean images for specified yaw angles for females (top row) and males (bottom row).

Figure 10. Relevance of attributes. For all images (a–d), the rotation in the first row corresponds to a trajectory computed considering angles, while the second one is relative to a trajectory taking attributes into account. These are examples of complex rotations due to the strong shadows over the face.

Figure 11. (a) Examples of CelebaMask-HQ segmentations and (b) cropped version with unified masks used to train the model on background removal tasks.

Figure 12. Cosine similarities between trajectory slopes. The different trajectories are obtained from data whose rotation yaw is comprised in the specified range.

Figure 13. Heatmap of Cosine similarities between left (a) and right (b) slopes, computed using only light direction (CENTER, LEFT, or RIGHT) and gender (M or F) as attributes. Each subset of attributes is represented by a corresponding sample from CelebA.

Figure 14. Heatmap of cosine similarities between left (a) and right (b) slopes, computed using only light direction (CENTER, LEFT, or RIGHT) and youthfulness (Y or NY) as attributes.

Figure 15. Heatmap of cosine similarities between left (a) and right (b) slopes, computed with light direction fixed to center and using only gender (M or F), youthfulness (Y or NY), and smiling (S or NS) as attributes.

Figure 16. Examples of rotations.

Figure 17. Troubleshooting: difficulty in preserving facial features.

Figure 18. Troubleshooting: problems with high pitch or rolls.

Figure 19. Troubleshooting: problems with hats and other artifacts.

Figure 20. Troubleshooting: cannot rotate glasses over the head.

Figure 21. Troubleshooting: deformation and loss of contours.

Figure 22. Troubleshooting: problems with neck and ears.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Asperti, A.; Colasuonno, G.; Guerra, A. Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models. Electronics 2024, 13, 3091. https://doi.org/10.3390/electronics13153091

AMA Style

Asperti A, Colasuonno G, Guerra A. Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models. Electronics. 2024; 13(15):3091. https://doi.org/10.3390/electronics13153091

Chicago/Turabian Style

Asperti, Andrea, Gabriele Colasuonno, and Antonio Guerra. 2024. "Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models" Electronics 13, no. 15: 3091. https://doi.org/10.3390/electronics13153091

APA Style

Asperti, A., Colasuonno, G., & Guerra, A. (2024). Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models. Electronics, 13(15), 3091. https://doi.org/10.3390/electronics13153091

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Illumination and Shadows in Head Rotation: Experiments with Denoising Diffusion Models

Abstract

1. Introduction

2. Related Works

3. Denoising Diffusion Models

3.1. Diffusion and Reverse Diffusion

3.2. Pseudocode

3.3. Denoising Deterministic Implicit Models

4. Model Architecture

4.1. The Embedding Network

5. CelebFaces Attributes Dataset

5.1. Analysis of the Dataset

5.2. Face Orientation

5.3. Light Direction Analysis

6. Methodology

Filtering CelebA Images

7. Preprocessing and Postprocessing

7.1. Preprocessing

7.2. Postprocessing

7.2.1. Super-Resolution

7.2.2. Color Correction

8. Analysis of the Trajectory Slopes

9. Results and Troubleshooting

9.1. Loss of Individual’s Features

9.2. High Pitch and Roll

9.3. Hats and Other Artifacts

9.4. Deformation and Loss of Contours

9.5. Difficulty in Rotating Neck and Ears

10. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Additional Rotation Examples

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI