An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer

Liu, Jingyun; Liu, Han; He, Yuxin; Tong, Shuo

doi:10.3390/app14146311

Open AccessArticle

An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer

School of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 6311; https://doi.org/10.3390/app14146311

Submission received: 14 June 2024 / Revised: 16 July 2024 / Accepted: 18 July 2024 / Published: 19 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

The rise of comics and games has led to increased artistic processing of portrait photos. With growing commercial demand and advancements in deep learning, neural networks for rapid facial style transfer have become a key research area in computer vision. This involves converting face photos into different styles while preserving content. Face images are more complex than regular images, requiring extensive modification. However, current methods often face issues such as unnatural color transitions, loss of detail in highlighted areas, and noticeable artifacts along edges, resulting in low-quality stylized images. In this study, an enhanced generative adversarial network (GAN) is proposed, which is based on Adaptive Layer Instance Normalization (AdaLIN) + Laplacian. This approach incorporates the AdaLIN normalization method, allowing for the dynamic adjustment of Instance Normalization (IN) and Layer Normalization (LN) parameters’ weights during training. By combining the strengths of both normalization techniques, the model selectively preserves and alters content information to some extent, aiming to strike a balance between style and content. This helps address problems such as unnatural color transitions and loss of details in highlights that lead to color inconsistencies. Furthermore, the introduction of a Laplacian regularization term aids in denoising the image, preventing noise features from interfering with the color transfer process. This regularization also helps reduce color artifacts along the face’s edges caused by noise while maintaining the image’s contour information. These enhancements significantly enhance the quality of the generated face images. To compare our method with traditional CycleGAN and recent algorithms such as XGAN and CariGAN, both subjective and objective evaluations were conducted. Subjectively, our method demonstrates more natural color transitions and superior artifact elimination, achieving higher scores in Mean Opinion Score (MOS) evaluations. Objectively, experiments using our method yielded better scores across three metrics: FID, SSIM, and MS-SSIM. The effectiveness of the proposed methods is validated through both objective and subjective evaluations.

Keywords:

generative adversarial network; facial style transfer; AdaLIN; Laplacian regular

1. Introduction

Facial style transfer, a humorous art form, captivates audiences with its unique interpretations of human features [1]. Artists use abstract strokes, fluid lines, and exaggerated expressions to create stylized portraits, revealing personality and emotion in ways traditional portraiture might miss [2]. However, this process is complex and heavily dependent on the artist’s creative inspiration, requiring skill, time, and multiple iterations [3]. In the era of self-media, where content creation is ubiquitous, this traditional approach struggles to meet the demand for instant, personalized content [4].

With advancements in deep learning and image-to-image translation, AI-powered facial style transfer has gained attention. These technologies promise near-instantaneous stylization, democratizing access to this art form and allowing users to experiment with various styles effortlessly [5]. The main challenge lies in balancing style transformation with identity preservation. Algorithms must accurately replicate artistic styles while maintaining the subject’s core facial features. This requires sophisticated techniques to capture style essences and apply them without compromising recognizability [6].

The main difficulty in facial style transfer lies in both the stylistic transformation of the input portrait features and maintaining the identity consistency of the generated image with the input photo [7,8]. Before the emergence of deep learning, the facial style transfer problem was primarily addressed using traditional methods such as interactive-based techniques, rule-based five-feature contour deformation methods, example-based learning methods, and database-based style matching methods. For instance, the face shape and the five-feature contour of the face are depicted by [9] through the positioning of key facial points, with the contour being corrected using geometric template detection to automatically generate a line drawing. Similarly, the image is chunked directly by [10] with a sliding window to divide the image into overlapping blocks of the same size, and a Markov network model is employed to find the stylized image block that best matches the face image block, with these blocks being synthesized to form a complete stylized face image. However, most of these methods cannot achieve a fully automated facial stylization migration process and suffer from complex rules, single-generation effects, and a lack of creativity. With the development of deep learning, these problems are being resolved one by one [11,12,13].

Deep learning-based methods for facial style transfer can be categorized into two main types: neural style transfer (NST) and generative adversarial network (GAN)-based methods [14,15]. Ref. [16] pioneered the NST approach, using convolutional neural networks (CNNs) to separate and reorganize image content and style. With NST, an image’s style can be modified while preserving its content, achieving artistic re-creation. However, numerous experiments have shown that CNN-based methods tend to focus primarily on the transfer of image texture, color, and pixel-level style features. Additionally, CNNs are often slow to train, insensitive to fine details, and struggle to achieve comprehensive stylization of face images [17].

In recent years, generative adversarial networks (GANs) have garnered widespread attention due to their powerful image generation capabilities [18,19]. GAN is an unsupervised learning framework proposed by Goodfellow et al. in 2014, consisting of a generator G and a discriminator D. The generator converts random noise images into target images, while the discriminator judges whether an image is real or generated by the generator. When the generator and discriminator reach a balance through adversarial training, the generator can produce realistic fake images. Among GAN-based methods, CycleGAN [20] has become a research hotspot in the field of portrait style migration due to its convenient data acquisition, ability to achieve diverse transformations, and excellent generalization ability.

Research based on CycleGAN addresses the challenge of acquiring challenging datasets and implements a transformation process from portrait photographs to stylized images [21]. However, these methods encounter several issues. Many current approaches struggle with overly uniform normalization methods that fail to adapt to varying learning rates. Moreover, their regularization techniques often inadequately control model complexity, resulting in problems such as overfitting. Consequently, generated images may lack texture details, exhibit unevenly colored patches in facial highlights, introduce artifacts from noisy input data, or appear unnaturally saturated with colors [22,23,24], as shown in Figure 1. Compared to ordinary images, the face structure is more delicate and requires intensive transformation during the style transfer process, making existing methods unsuitable for generating stylized portrait transfers. To address this issue, considering the difficulty of obtaining pairwise datasets, this paper designs a non-pairwise dataset based on CycleGAN. It utilizes an improved generative network and discriminative network for adversarial training and balancing, while also adopting semantic constraints to ensure similarity between the generated image and the real photo. This approach retains rich detail information and preserves identity details, resulting in more lifelike and richly detailed stylized portraits. The work carried out includes the following three points:

In this paper, we propose an improved CycleGAN using AdaLIN, which can select appropriate Instance Normalization (IN) and Layer Normalization (LN) parameter weights during the training process. This approach balances the contradiction between style and content, enhances image details, and preserves the detailed features of the original picture, particularly excelling at relocating facial highlights.
We introduce a Laplacian regular module for denoising the image. This approach helps to prevent noise features from affecting the color transfer of the image and reduces the color artifacts caused by noise.
Compared to other methods, the Improved Detail-Enhancement CycleGAN using AdaLIN proposed in this paper yields superior results in image detail enhancement and artifact removal. This advancement significantly enhances the quality of the generated face images.

The rest of this paper is structured as follows: The related works are introduced in Section 2. Section 3 describes the methods proposed in this paper in detail. Section 4 shows the experiment results and evaluations. Section 5 concludes this paper and describes our future work.

2. Related Work

2.1. Generative Adversarial Network

Paired data. Currently, GAN-based facial style transfer algorithms can be broadly categorized into two types: those based on paired data and those based on unpaired data [25]. The first type of method places significant demands on the dataset, requiring each original image to have a corresponding target style image. During training, the generative network is constrained by comparing the difference between the generated image and the real target image [26]. A representative method of this type is Pix2pix [27], where the discriminative network essentially functions as a binary categorization network, determining whether the input image belongs to the same style as the real image. Building upon Pix2Pix, ref. [28] introduced Pix2PixHD to address high-resolution reconstruction challenges in semantic segmentation. This approach proposes synthetic reconstruction based on segmentation information, tackling the loss of detail and texture in ultra-high-resolution image synthesis. Subsequently, Cartoon GAN, also based on Pix2Pix, was proposed to cartoonize real-world images using adversarial networks [29]. It exhibits promising results for images like landscapes and buildings. However, methods reliant on pairwise datasets face a common limitation—they struggle to gather a large volume of pairwise data in real-world scenarios, presenting a significant challenge for such approaches.

Unpaired data. The emergence of the second type of method aimed to overcome the limitations of the previous approaches. This method operates on the assumption of a common implicit space for two styles of images. By mapping an image from style A to this implicit space, it can be transformed into an image with style B using the corresponding style decoder. CycleGAN [20], discoGAN [30], and dualGAN [31] leverage the concept of reconstruction to constrain the generative network. In the absence of a real image counterpart, an effective model can transition an image from style A to style B and back to A, ensuring content preservation through the reconstruction process. Of these methods, ref. [32] introduced DualGGAN, a Transformer style transfer network based on dual generators and the fusion of relative position encoding. This addresses the background penetration issue often encountered in unsupervised style transfer algorithms. Ref. [33] presented FacialGAN, a novel framework enabling rich style transfers and interactive manipulation of facial attributes simultaneously. Additionally, ref. [34] proposed StyleCariGAN, an automated system for creating realistic and detailed caricatures from input photos with optional controls for shape exaggeration and color stylization.

Among the methods utilizing unpaired datasets, CycleGAN [20] introduced by Zhu et al. stands out as a landmark. The core concept involves training two pairs of generator–discriminator models to facilitate image transformation between different domains. The training process emphasizes cyclic consistency, supported by a corresponding loss function designed to enforce this consistency. This ensures that the image generated when transitioning from the source domain to the target domain and back remains as faithful as possible to the original image.

2.2. Facial Style Transfer

Facial style transfer necessitates precise techniques to extract diverse facial features, ensuring successful application of the migrated style onto the original image without distorting the characters [14]. Given the unique characteristics of facial structures like eyes, nose, and lips, the algorithm’s requirements for style migration are more stringent compared to other image types.

A novel universal face photo-sketch style transfer method that does not require training with source domain images was introduced by [35]. Ref. [36] proposed the FusIon of STyles (FIST) network for facial images, utilizing pre-trained multipath style transfer networks to address data volume challenges in training and fuse multiple styles at the output. Ref. [37] presented a technique for transferring painting styles onto head portraits, imposing spatial constraints through local color distribution transfers. Ref. [38] proposed BlendGAN for arbitrary stylized face generation, leveraging a flexible blending strategy and a generic artistic dataset, along with a weighted blending module for implicit face and style representation blending. Ref. [39] combined generative adversarial networks with attention mechanisms to address these challenges. Ref. [40] introduced a two-stage geometric style transfer method dedicated to face portraits, simultaneously transferring statistical and structural styles. Ref. [41] performed incremental facial exaggeration from real images to caricatures using encoder and generator latent spaces. Ref. [42] presented an asymmetric double-stream generative adversarial network (ADS-GAN) to address issues like facial deformation and contour missing during style transfer in portrait photos. Ref. [43] modeled caricature generation as a weakly paired image-to-image translation and proposed CariGAN to address related issues and used manually annotated facial landmarks as an additional constraint to ensure reasonable exaggeration and facial deformation in the generated images. Ref. [44] introduced XGAN, a dual adversarial auto-encoder, which captures a shared representation of the common domain semantic content in an unsupervised way, while jointly learning the domain-to-domain image translations in both directions. Ref. [45] proposed a multi-exaggeration warper network to learn the distribution-level mapping from photos to facial exaggerations, which makes it possible to generate diverse and reasonable exaggerations from randomly sampled warp codes given one input photo. Ref. [46] proposed ETCari, a novel weakly supervised exaggeration transfer network. ETCari enables the learning of diverse exaggeration caricature styles from various artists, better meeting individual customization requirements and achieving diversified exaggeration while retaining identity features.

However, many of the stylized face photos generated by these methods encounter problems like texture detail neglect, unnatural color transitions, and ineffective artifact handling. To address these challenges, this paper proposes an Improved Detail-Enhancement CycleGAN approach.

3. The Proposed Method

CycleGAN is an unsupervised generative adversarial network that works by training two pairs of generator–discriminator models to facilitate image transformations between different domains. The key technique of CycleGAN is recurrent consistency, which was already mentioned in the [47] mentioned. This means that when the generator is applied sequentially, the resulting image should be very similar to the original image and is achieved through L1 loss. Thus, the cyclic loss function is crucial to prevent the generator from transforming the image into a domain that is completely unrelated to the original image.

The model comprises two mapping functions,

G : X \to Y

and

F : Y \to X

, along with corresponding adversarial discriminators

D_{X}

and

D_{Y}

.

D_{Y}

encourages G to translate images from domain X into the style of domain Y, and vice versa. Additionally, two loop consistency loss functions are introduced to ensure that the translated style can be reverted back to its original state after the inverse translation process, thereby standardizing the mapping.

3.1. Model Structure

The framework of this paper is based on CycleGAN, whose specific structure is shown in Figure 2; in this network, the source domain is a realistic image and the target domain is a stylized image.

D_{X}

is the source domain discriminator,

D_{Y}

is the target domain discriminator, G is the source domain to target domain generator and F is the target domain to source domain generator.

\tilde{Y}

is the source domain to the target domain generated image, and

\tilde{x}

is the image of the source domain generated by

\tilde{Y}

through F.

\tilde{Y}

is the target domain image of the previous round, which serves as the input image (i.e., as the source domain image of this part) in the symmetric loop.

\tilde{X}

is the generated image of y through F, and

\tilde{y}

is the target domain image generated by the loop. The network is composed of two symmetric loops, and the upper part of the model is obtained by passing the input image x as the source domain through the generator G to obtain the target domain image

\tilde{Y}

. Subsequently, this newly generated

\tilde{Y}

generates the source domain image

\tilde{x}

through F. The distance between

\tilde{x}

and the original image x is computed by comparing

\tilde{x}

with the original image to define the mapping relationship that the unpaired dataset did not originally have.

The two inputs x and y are the real image and the image generated by the generator, respectively. The discriminators

D_{X}

and

D_{Y}

will score the input images with [0, 1], respectively, which is used to differentiate whether both are realistic images or images generated by the generator.

3.2. Stylization Module

The structure of the facial style conversion module is shown in Figure 3, comprising a generator and a discriminator. This module deviates from the basic recurrently consistent generative adversarial network by integrating three components into its generator: a content encoder, a style encoder, and a decoder. Figure 4 provides a detailed illustration of the generator’s specific structure.

The content encoder within the generator compresses the image and incorporates feature maps using four subsampled convolutional layers, followed by two ResNet blocks for further processing to map the input content image to its content encoding. Each convolutional layer is subsequently followed by Instance Normalization (IN). The styler encoder comprises five convolutional layers and utilizes Average Pooling to vectorize the style image, resulting in the styler encoding. Subsequently, the decoder processes the content-coding through a series of AdaLIN residual blocks and ultimately reconstructs the input image using several upsampling convolutional layers.

AdaLIN dynamically adjusts the weighting of Instance Normalization (IN) and Layer Normalization (LN) parameters during training, leveraging the advantages of both techniques to selectively retain or alter content information. This capability enables the model to flexibly control the degree of shape and texture changes, thereby enhancing overall robustness. Affine transformation parameters

γ

and

β

are derived from the style encoding through a Multilayer Perceptron (MLP), while the learning parameter

ρ

is utilized to fine-tune the Instance Normalization and Layer Normalization ratio. Specifically, affine transformation parameters

γ

and

β

are obtained via the method illustrated in Figure 5.

As can be seen in Figure 5, the calculation of specific parameters in AdaLIN is shown in Equation (1):

A d a L I N (a, γ, β) = γ \cdot (ρ \cdot \hat{a_{I}} + (1 - ρ) \cdot \hat{a_{L}}) + β

(1)

where

\hat{a_{I}} = \frac{a - μ_{I}}{\sqrt{σ_{I}^{2} + ε}}

(2)

\hat{a_{L}} = \frac{a - μ_{L}}{\sqrt{σ_{L}^{2} + ε}}

(3)

ρ \leftarrow c l i p_{[0, 1]} (ρ - τ Δ ρ)

(4)

With AdaLIN, the core concept is to integrate style information into the content domain, facilitating more effective style transfer and yielding superior results.

Conversely, the discriminator employs PatchGAN to provide feedback to the generator. The specific parameters used are detailed in Figure 6.

In the traditional GAN model, the output of the discriminator D network is a scalar, between 0 and 1, representing the probability of being a real image. While the PatchGAN output is no longer a scalar, but the output is an

n \times n

matrix X. Each element

x [i] [j]

represents a patch, corresponding to a receptive field of the image, and finally, the mean value of each patch site is taken to represent the probability that the final total picture is a real picture, and such training makes the model pay more attention to the details of the image. In this paper, we use the PatchGAN to replace the original model to discriminate the whole image, and set the n-value to 70, to discriminate the 70 × 70 image blocks. In this paper, except for the last convolutional layer, Instance Normalization and Leaky ReLU activation function are used after the other convolutional layers, where the value of

α

is set to 0.2. Unlike the ReLU, Leaky ReLU has a function value of

α \times x

when the input is negative, so that, when the input is negative, Leaky ReLU still has a weak activation effect.

This mechanism integrates local image features and overall image characteristics, discerning differences through each patch to extract and characterize local image features. This approach is conducive to achieving higher-resolution image production. Additionally, averaging the final classified feature maps enables comparison between real and generated images. This computational process is akin to weighted summation averaging of the entire image, allowing the discriminator in this paper to represent loss more reasonably than traditional discriminator networks, particularly for local image features with significant differences.

3.3. Loss Function

The loss function is carefully crafted to preserve the original character identity features while ensuring high image quality. It comprises three key components: generative adversarial loss, image reconstruction loss, and cycle consistency loss.

3.3.1. Generative Adversarial Loss

The generative adversarial loss of CycleGAN is shown in Equation (5):

L_{a d v} (G, D_{Y}, X, Y) = E_{y \sim p_{d a t a} (y)} [l o g D_{γ} (y)] + E_{x \sim p_{d a t a} (x)} [log (1 - D_{γ} (G (x)))]

(5)

where y denotes the sample in domain Y and x denotes the sample in domain X.

D_{Y} (y)

denotes the score of the sample in the true Y in the discriminator

D_{Y}

, and the closer to 1, the more true the discriminator considers this sample.

G (x)

is the sample in the same distribution as Y generated by the generator based on x.

D_{Y} (G (x))

is the score of the discriminator based on the generated samples; if

D_{Y}

considers the generated samples to be more false, then closer to 0 the score

D_{Y} (G (x))

is, the closer

1 - D_{Y} (G (x))

is to 1. The stronger the discriminator is, the better it will be able to distinguish the real y from the

G (x)

generated by the generator based on x, and the larger this loss value will be.

3.3.2. Image Reconstruction Loss

In this paper, to enhance the accuracy of CycleGAN, the image reconstruction loss is incorporated into the generator’s loss function alongside the generative adversarial loss. This loss function component measures the disparity between the generator input image and its reconstructed counterpart. To quantify this similarity, the Structural Similarity Index Method (SSIM) is introduced. The structural similarity between the input image x and its reconstructed image

\tilde{x}

is defined as shown in Equation (6):

S S I M (x, \hat{x}) = \frac{(2 μ_{χ} μ_{\hat{χ}} + c_{1}) (2 σ_{χ} σ_{\hat{χ}} + c_{2})}{(μ_{χ}^{2} + μ_{\hat{χ}}^{2} + c_{1}) (σ_{χ}^{2} + σ_{\hat{χ}}^{2} + c_{2})}

(6)

where

μ_{x}

and

μ_{\hat{x}}

denote the mean values of x and

\hat{x}

, respectively,

σ_{x}

and

σ_{\hat{x}}

denote the variances of x and

\hat{x}

, respectively,

σ_{x} σ_{\hat{x}}

denote the covariance of

σ_{x}

and

σ_{\hat{x}}

. where

\hat{x} = F (G (x))

(7)

To address artifact issues, additional constraints are introduced into the loss function. The objective function of the stylization module incorporates a Laplacian regular term. This term in the loss function aims to retain the clarity and fidelity of details and textures in the generated image, preserving the nuanced features of the original content image. It serves to smooth out and maintain these details, contributing to high-quality image results. The Laplacian regular term is illustrated in Equation (8):

T r (X L X^{T}) = \sum_{i, j = 1}^{N_{p}} W_{i, j} {∥x_{i} - x_{j}∥}_{2}^{2} = \sum_{Y_{i} \sim Y_{j}} {∥x_{i} - x_{j}∥}_{F}^{2} .

(8)

where X and

X^{T}

denote the matrix and the transpose matrix, respectively,

W_{i, j}

denote the symmetric weight matrix, and

x_{i}

and

x_{j}

denote the elements in the matrix X.

3.4. Algorithm

Based on the above model and loss function, the algorithm flow of this paper is shown in Algorithm 1.

Algorithm 1 CycleGAN Algorithm Flow

Input: Gallery X, gallery Y, generator G and its corresponding parameter

θ_{G}

, discriminator

D_{Y}

and its corresponding parameter

ω_{Y}

, generator F and its corresponding parameter

θ_{F}

, discriminator

D_{X}

and its corresponding parameter

ω_{X}

, weight factor

λ

, maximum number of cycles

N_{e p o c h}

, number of rounds to start reducing the learning rate

N_{o f f s e t}

, current number of iterations t, generator initial learning rate

η_{g}

and discriminator initial learning rate

η_{d}

.

Output: Stylized photo

step 1: Initialization: $θ_{G}$ , $ω_{Y}$ , $θ_{F}$ and $ω_{X}$ .
step 2: Updating: $t = t + 1$ .
step 3: Sample images from X and Y, generate corresponding outputs, and label real as 1 and generated as 0.
step 4: Input generated samples to generators G and F to obtain reconstructed images and calculate reconstruction losses.
step 5: Input generated and real samples to discriminators $D_{X}$ and $D_{Y}$ . Compute and minimize the discriminator objective function. Update discriminator weights using back-propagation and improved Adam optimization.
step 6: Minimize generator objective function and update weights of generators G and F using back-propagation and improved Adam optimization.
step 7: Adjust learning rate based on current iteration count t and $N_{o f f s e t}$ .
step 8: Repeat steps 3–7 until maximum iterations are reached.

4. Experiment

To validate the effectiveness of the proposed method in this paper, several steps are undertaken. Firstly, the specific method is implemented using the PyTorch framework for coding. Then, the primary training and testing phases are conducted using the WebCaricature cartoon face dataset, and the convergence of the training loss is analyzed. Subsequently, the results are evaluated based on specific experimental content. The experimental evaluation can be divided into two main parts. Firstly, objective indicators are used to evaluate the performance of the model proposed in this paper and compare it with other different network models. This comparison is based on objective indicators to demonstrate the effectiveness of image stylization. Secondly, subjective evaluations are conducted to validate the effect of image stylization. This involves comparing the method proposed in this paper with the existing CycleGAN method. The comparison results are then analyzed, and volunteers are recruited for a random survey to gather subjective feedback. The results of the survey are then analyzed accordingly. Finally, the experimental data are summarized and analyzed comprehensively to assess both the advantages and shortcomings of the method proposed in this paper.

4.1. Dataset

The dataset utilized in this study is the publicly available WebCaricature dataset, comprising 6042 caricature images and 5974 photo images from 252 individuals. The dataset includes images of varying resolutions, some in grayscale and others in RGB, and encompasses a range of artistic styles. Data preprocessing involves several steps such as face detection (key point detection), face alignment, and face normalization. During the training phase, data augmentation techniques like random flipping of images are employed to enhance the dataset.

For experimentation purposes, the dataset is randomly split into a training set consisting of 202 character images (totaling 4804 photos and 4773 caricatures) and a test set containing 50 character images (comprising 1170 photos and 1269 caricatures). The experimental results presented in this paper are based on the character images in the test set, ensuring that the model’s performance is evaluated on unseen data. The facial keypoints used in the experiments are the 17 keypoints provided within the WebCaricature dataset. The experiments are conducted on a computer running the Windows 10 operating system, equipped with a GTX 2080Ti GPU and CUDA version 12.0. The neural network is constructed using the PyTorch deep learning library, which is an open-source project developed by Facebook.

4.2. Implementation

The models in this research are trained and evaluated using the public WebCaricature dataset. Since the focus is on facial conversion, the images in the dataset that include regions below the shoulders can introduce unwanted complexity during the style transfer process. Therefore, the facial images undergo preprocessing steps where they are cropped to isolate the facial region. During image preprocessing, the image is initially rotated using the line connecting the eyes of the person as the horizontal reference line. The facial region containing the hair, ears, and chin is then cropped based on the facial keypoints. Specifically, an initial region box is created using the keypoints of the two ears, the top of the forehead, and the chin. This region box is expanded outward by 1.5 times to create the final cropping region. The resulting cropped images are resized to a uniform size of

256 \times 256

, and new facial keypoints are computed accordingly.

The weights assigned to different components of the loss function play a crucial role in determining the quality of the generated images. In this algorithm, based on the CycleGAN framework, the cyclic consistency loss is prioritized to preserve texture features, typically being 5 to 12 times greater than the adversarial loss in the image stylization task. The image reconstruction loss, complementing the cyclic consistency loss, is usually smaller than the adversarial loss.

Through extensive experimentation and comparisons, the optimal weight settings for the loss components are determined as presented in Table 1. The model training process involves a batch size of 4 for 30,000 steps, with an initial learning rate of 0.0001 that is halved every 5000 training steps.

4.3. Results

The development of generative adversarial networks has faced challenges in establishing objective evaluation metrics for assessing picture quality, making it difficult to define standardized evaluation criteria. As a result, evaluating the outcomes of GAN experiments primarily relies on subjective comparisons, supported by objective assessments when possible.

4.3.1. Subjective Evaluation

Subjective evaluation involves visually assessing an image and assigning a subjective quality rating based on personal perception. This method relies on statistical analysis and requires multiple observers for reliable results. While subjective evaluation aligns with common human perceptions and is applicable to various image types, it is labor-intensive, costly, and susceptible to human biases, leading to potential result variations. Typically, it is suitable for evaluating a small number of images, while objective methods are preferred for larger datasets.

Many mainstream algorithms, as seen in references [36,37,38,39], employ absolute evaluation to assess generated images. Absolute evaluation compares the generated image against a reference (real source image), using methods like Double Stimulus Continuous Scale (DSCQS). In DSCQS, observers alternate between the source and evaluated images, noting differences and scoring based on predefined criteria, such as the full superiority scale shown in Table 2. The Mean Opinion Score (MOS) is then calculated to compare the performance of different methods.

The experimental results of this paper were compared with both traditional CycleGAN algorithms and excellent methods that have emerged in the past five years. The results are shown in Figure 7. When performing style transfer on general facial images (such as the first and second columns), the method proposed in this paper achieves very natural color transitions and preserves the original texture information without producing uneven color patches (for example, rows 2, 4, 6 in the first column, and rows 2, 3, 4, 5 in the second column), and does not ignore facial details (for example, row 3 in the first column). When processing style transfer for facial images with more prominent highlights (such as the third and fourth columns), our method can preserve highlight details while not losing other facial details, which other algorithms fail to achieve. When processing facial images that may produce artifacts during style transfer (such as the fifth and sixth columns), the method proposed in this paper can eliminate artifact phenomena at the edges of portraits caused by lighting conditions and the transfer process, while fully retaining the edge information of the portraits. Other methods are notably inferior in this aspect.

The above comparative experiments demonstrate the advanced nature of this paper in addressing portrait style transfer problems.

To evaluate visual quality based on MOS criteria (refer to Table 2), 30 individuals with diverse professional backgrounds were chosen, assessing 50 generated images from both traditional CycleGAN and the proposed method. Image quality evaluation focuses on detecting color shifts, texture distortions, high-frequency noise, or other defects. During testing, subjects viewed and rated images from both methods, and mean and standard deviation scores were calculated for comparison (see Table 3).

The research data indicate that the Improved Detail-Enhancement CycleGAN method proposed in this paper consistently achieves higher scores in both mean and standard deviation, signifying its superiority in imaging quality over other methods in terms of subjective vision. It effectively addresses issues such as incomplete style transfer, unnatural color transitions, and image artifacts prevalent in other methods. These experimental findings strongly validate the effectiveness of the proposed method outlined in this paper.

4.3.2. Objective Evaluation

In this paper, to comprehensively analyze the generation effect of face images, three objective evaluation metrics are employed. These metrics include the Frechet Inception Distance score (FID), Structural Similarity Index (SSIM), and the multi-scale image quality assessment SSIM method (MS-SSIM). These metrics are utilized to provide a thorough and quantitative assessment of the generated face images’ quality, ensuring a comprehensive evaluation of the model’s performance.

(a) FID

FID, a commonly used evaluation metric for generative adversarial networks (GANs), transforms the data distribution into Gaussian random variables using a feature function. This feature function calculates the distance between two Gaussian distributions based on their mean and covariance matrices, as illustrated in Equation (9):

F I D (x, g) = {‖ μ_{x} - μ_{g} ‖}_{2}^{2} + T_{r} (Σ_{x} + Σ_{g} - 2 {(Σ_{x} Σ_{g})}^{\frac{1}{2}})

(9)

where x is the real image, g is the generated image,

μ

denotes the mean, ∑ denotes the covariance matrix, and

T_{r}

denotes the trace of the matrix.

(b) SSIM

In real visual scenes, images exhibit a strong correlation, and there is a significant relationship between pixels within the same image. The structural similarity metric quantifies this similarity by measuring various information within the image, such as grayscale range, to determine the degree of resemblance between two images.

(c) MS-SSIM

In addition, the visibility of image details is influenced by factors such as image signal acquisition density, distance from the image plane to the observer, and the observer’s visual system’s perceptual ability. As these factors vary, the evaluation of an image also changes. Therefore, to complement the SSIM metrics, this paper introduces another method for image quality assessment called MS-SSIM (Multi-Scale Structural Similarity Index Method). MS-SSIM combines measurements on different scales to obtain an overall evaluation, as shown in Equation (10):

M S - S S I M (x, y) = {[l (x, y)]}^{α_{M}} \cdot \underset{M}{\prod^{j = 1}} {[c (x, y)]}_{j}^{β} \cdot {[s (x, y)]}^{γ_{j}}

(10)

where function

l (x, y)

represents the brightness contrast of the image, function

c (x, y)

represents the structural contrast of the image, and function

s (x, y)

represents the contrast of the image, parameters

α_{M}

,

β

and

γ_{j}

are used to adjust the relative importance of different components.

For the experimental results mentioned above, FID was calculated separately, and the results are presented in Table 4. From the computational results, it is evident that the method proposed in this paper achieves better scores in terms of FID. Particularly when compared to the traditional CycleGAN, the algorithm proposed in this paper shows a 26% improvement in FID scores. This is sufficient to illustrate the advantage of the proposed algorithm in preserving the integrity of facial features in portraits.

For the experimental results mentioned above, SSIM and MS-SSIM were calculated separately, and the results are presented in Table 5, providing a detailed quantitative analysis of our method’s performance. As can be clearly observed from the data, the method proposed in this paper consistently achieves superior scores in both SSIM and MS-SSIM metrics compared to other state-of-the-art approaches. This superiority is evident across various test cases and style transfer scenarios. The higher scores indicate that our method more effectively preserves the structural information and perceptual quality of the original images during the style transfer process. Furthermore, these results strongly suggest that the images generated by our method exhibit a distribution that closely aligns with that of real images, demonstrating the effectiveness of our approach in producing realistic and high-quality style-transferred facial images.

4.4. Ablation Study

To analyze the importance and role of each module, ablation experiments were conducted in this paper. The result of the ablation study is shown in Figure 8.

In Figure 8, we analyze our model using different variables. It can be observed that in the structure using only AdaIN (the first row in Figure 8), although most of the content is preserved, there are issues like extremely uneven patches in sample 1 (the first column) and missing details in the eyebrows. Sample 3 (the third column) exhibits extremely uneven skin color, with visible artifacts at the face edges, and sample 4 (the fourth column) shows a noticeable loss of eye details and inconsistent hair color trends. The structure with AdaLIN (the second row in Figure 8) shows improvements in certain aspects, such as sample 2 (the second column). However, artifacts around the face are evident in sample 1, affecting the overall image quality. This issue is also visible in sample 3 and sample 6 (the sixth column), and sample 5 (the fifth column) shows poorly generated style. Our proposed method using AdaLIN+Laplacian (the last row in Figure 8) further improves detail and image generation quality. Artifacts at the face edges in sample 1 and sample 6 are eliminated, and the eyebrow details in sample 1 are well presented. Sample 2 preserves eye details perfectly with rich colors, aligning with human aesthetics. Sample 3 and sample 5 exhibit more natural skin tone transitions, clearly displaying facial wrinkle information.

The scores of objective and subjective evaluations for each part of the ablation experiment are shown in Table 6 and Table 7, respectively.

By analyzing the results of the ablation experiments and comparing the second and third rows of Figure 8, we can understand the roles of the AdaLIN module and the Laplacian module. When the original image has noticeable highlights on the face, the model with the AdaLIN module preserves the details of these highlights during style transfer, resulting in a more realistic and delicate color representation in the generated image. Similarly, comparing the second and fourth rows of Figure 8, we observe that when there are artifacts from noise in the original image, the model with the Laplacian module handles the color transition at the portrait edges better during style transfer. This improves the overall quality by avoiding artifacts and unnatural color transitions, leading to a more refined generated image. From an objective evaluation perspective, the FID scores of the proposed method in this paper are improved by 20.88% and 17.47%, respectively, highlighting the significant impact of both the AdaLIN module and the Laplacian module on the imaging effect.

4.5. Discussion

Comprehensive empirical evaluations demonstrate that the proposed methodology has yielded efficacious outcomes in style transfer for a preponderance of facial images, particularly in addressing intricate details within facial highlights and mitigating image edge artifacts. Nevertheless, our experimental investigations have revealed certain limitations in specific scenarios, primarily encompassing the following aspects:

While our approach exhibits robust performance in processing images of Caucasian and Asian subjects, it demonstrates a propensity for detail attenuation when confronted with images of individuals possessing darker skin tones. This phenomenon can be primarily attributed to inherent biases in training data, as numerous facial style transfer models are predominantly trained on datasets comprising Caucasian or Asian subjects, resulting in suboptimal recognition and processing of facial features characteristic of darker-skinned individuals. Furthermore, the complexities involved in processing diverse skin tones are exacerbated by the more nuanced tonal and textural variations present in darker skin, thereby intensifying the challenges associated with the accurate capture and transformation of these subtle details.

Our method achieves commendable results in style transfer for adult facial images; however, its performance is suboptimal when applied to children’s facial images. This discrepancy can be attributed to several factors: firstly, the significant morphological differences between children’s and adults’ facial features, such as more rounded face shapes, proportionally larger eyes, and distinct forehead-to-face ratios, which may lead to inadequate performance of models trained exclusively on adult data. Secondly, the inherently smoother texture of children’s skin, lacking the fine lines and textural complexities present in adult faces, may result in either loss or excessive addition of texture during processing. Lastly, the style transfer process may inadvertently introduce age-inappropriate features, effectively “aging” children’s faces and diminishing age-specific characteristics.

Additionally, our current methodology exhibits limitations in terms of personalization and user control. The existing framework autonomously learns and selects styles corresponding to facial images through the model, without providing users the opportunity to actively participate in style selection during the transformation process. This constraint may not adequately address the diverse preferences and requirements of all users.

These identified limitations provide valuable insights for future research directions and potential enhancements to our proposed method.

5. Conclusions

In this study, we present an Improved Detail-Enhancement CycleGAN methodology designed to address the complexities of style transfer in facial images. Our novel approach integrates the AdaLIN technique to mitigate detail attenuation resulting from facial feature highlights, while simultaneously incorporating Laplacian regularization to minimize post-transfer artifacts induced by image noise. To rigorously assess the efficacy of our proposed method, we conducted a comprehensive comparative analysis against state-of-the-art approaches published within the past triennium. Empirical findings conclusively demonstrate that the facial images generated by our method significantly outperform those produced by extant methods, as evidenced by superior scores in both subjective and objective evaluation metrics. These advancements represent a substantial contribution to the field of facial image style transfer. However, we acknowledge that our method exhibits limitations, which warrant further investigation and refinement in future studies. One of the research directions is to enhance style prominence while maintaining facial integrity, which may involve exploring adaptive style weight mechanisms or multi-scale style transfer approaches. Another direction is to increase personalization and user control, allowing the model to provide users with greater control over the style transfer process. This would enable users to adjust style intensity or select specific facial features for style emphasis, which may involve developing interfaces and addressing related issues. By addressing these future research directions, we aim to not only overcome the current limitations of our method but also to push the boundaries of facial style transfer technology, opening up new possibilities in digital art, entertainment, and human–computer interaction.

Author Contributions

Conceptualization, J.L., Y.H. and H.L.; methodology, J.L. and H.L.; software, J.L. and Y.H.; validation, J.L.; formal analysis, J.L.; investigation, J.L. and Y.H.; resources, H.L.; data curation, Y.H. and S.T.; writing—original draft preparation, J.L.; writing—review and editing, J.L. and H.L.; visualization, J.L., Y.H. and S.T.; supervision, H.L.; project administration, H.L.; funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grants 92270117.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Jin, L.; Tan, F.; Jiang, S. Generative adversarial network technologies and applications in computer vision. Comput. Intell. Neurosci. 2020, 2020, 1459107. [Google Scholar] [CrossRef] [PubMed]
Fišer, J.; Jamriška, O.; Simons, D.; Shechtman, E.; Lu, J.; Asente, P.; Lukáč, M.; Sỳkora, D. Example-based synthesis of stylized facial animations. ACM Trans. Graph. (TOG) 2017, 36, 1–11. [Google Scholar] [CrossRef]
Shiri, F.; Yu, X.; Porikli, F.; Hartley, R.; Koniusz, P. Identity-preserving face recovery from stylized portraits. Int. J. Comput. Vis. 2019, 127, 863–883. [Google Scholar] [CrossRef]
Melnik, A.; Miasayedzenkau, M.; Makaravets, D.; Pirshtuk, D.; Akbulut, E.; Holzmann, D.; Renusch, T.; Reichert, G.; Ritter, H. Face generation and editing with stylegan: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3557–3576. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Dong, W.; Ma, C.; Mei, X.; Li, K.; Huang, F.; Hu, B.G.; Deussen, O. Data-driven synthesis of cartoon faces using different styles. IEEE Trans. Image Process. 2016, 26, 464–478. [Google Scholar] [CrossRef] [PubMed]
Yi, R.; Xia, M.; Liu, Y.J.; Lai, Y.K.; Rosin, P.L. Line Drawings for Face Portraits from Photos Using Global and Local Structure Based GANs. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3462–3475. [Google Scholar] [CrossRef]
Ho, S.T.; Huu, M.K.N.; Nguyen, T.D.; Phan, N.; Nguyen, V.T.; Ngo, T.D.; Le, D.D.; Nguyen, T.V. Abstraction-perception preserving cartoon face synthesis. Multimed. Tools Appl. 2023, 82, 31607–31624. [Google Scholar] [CrossRef]
Yoon, D.; Kim, J.; Lorant, V.; Kang, S. Manipulation of Age Variation Using StyleGAN Inversion and Fine-Tuning. IEEE Access 2023, 11, 131475–131486. [Google Scholar] [CrossRef]
Li, Y.; Kobatake, H. Extraction of facial sketch image based on morphological processing. In Proceedings of the International Conference on Image Processing, Santa Barbara, CA, USA, 26–29 October 1997; IEEE: Piscataway, NJ, USA, 1997; Volume 3, pp. 316–319. [Google Scholar]
Zhang, C.; Liu, G.; Wang, Z. Cartoon face synthesis based on Markov Network. In Proceedings of the 2010 International Symposium on Intelligent Signal Processing and Communication Systems, Chengdu, China, 6–8 December 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–4. [Google Scholar]
Guo, R.; Liu, H.; Xie, G.; Zhang, Y.; Liu, D. A self-interpretable soft sensor based on deep learning and multiple attention mechanism: From data selection to sensor modeling. IEEE Trans. Ind. Inform. 2022, 19, 6859–6871. [Google Scholar] [CrossRef]
Zhang, H.; Liu, H.; Liang, L.; Ma, W.; Liu, D. BiLSTM-TANet: An adaptive diverse scenes model with context embeddings for few-shot learning. Appl. Intell. 2024, 54, 5097–5116. [Google Scholar] [CrossRef]
Guo, R.; Chen, Q.; Liu, H.; Wang, W. Adversarial Robustness Enhancement for Deep Learning-Based Soft Sensors: An Adversarial Training Strategy Using Historical Gradients and Domain Adaptation. Sensors 2024, 24, 3909. [Google Scholar] [CrossRef]
Jing, Y.; Yang, Y.; Feng, Z.; Ye, J.; Yu, Y.; Song, M. Neural style transfer: A review. IEEE Trans. Vis. Comput. Graph. 2019, 26, 3365–3385. [Google Scholar] [CrossRef] [PubMed]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 4401–4410. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M.; Hertzmann, A.; Shechtman, E. Controlling perceptual factors in neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3985–3993. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Fang, Y.; Deng, W.; Du, J.; Hu, J. Identity-aware CycleGAN for face photo-sketch synthesis and recognition. Pattern Recognit. 2020, 102, 107249. [Google Scholar] [CrossRef]
Lu, M.; Xu, F.; Zhao, H.; Yao, A.; Chen, Y.; Zhang, L. Exemplar-based portrait style transfer. IEEE Access 2018, 6, 58532–58542. [Google Scholar] [CrossRef]
Ho, S.T.; Nguyen, V.T.; Ngo, T.D. Interpolation based anime face style transfer. In Proceedings of the 2020 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Ha Noi, Vietnam, 8–9 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Lin, P.; Liu, B.; Wang, L.; Lei, Z.; Cheng, J. Face Translation based on Semantic Style Transfer and Rendering from One Single Image. In Proceedings of the 2021 10th International Conference on Software and Computer Applications, Kuala Lumpur, Malaysia, 23–26 February 2021; pp. 166–172. [Google Scholar]
Almahairi, A.; Rajeshwar, S.; Sordoni, A.; Bachman, P.; Courville, A. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 195–204. [Google Scholar]
Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE Trans. Knowl. Data Eng. 2021, 35, 3313–3332. [Google Scholar] [CrossRef]
Isola, P.; Zhu, J.Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Wang, T.C.; Liu, M.Y.; Zhu, J.Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8798–8807. [Google Scholar]
Chen, Y.; Lai, Y.K.; Liu, Y.J. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9465–9474. [Google Scholar]
Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1857–1865. [Google Scholar]
Yi, Z.; Zhang, H.; Tan, P.; Gong, M. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2849–2857. [Google Scholar]
Zhao, Y.; Peng, C.; Zhang, X. DualGGAN: A New Facial Style Transfer Network. In Proceedings of the Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2023; Volume 2637, p. 012024. [Google Scholar]
Durall Lopez, R.; Jam, J.; Strassel, D.; Yap, M.H.; Keuper, J. Facialgan: Style transfer and attribute manipulation on synthetic faces. In Proceedings of the 32nd British Machine Vision Conference, Online, 22–25 November 2021; pp. 1–14. [Google Scholar]
Jang, W.; Ju, G.; Jung, Y.; Yang, J.; Tong, X.; Lee, S. StyleCariGAN: Caricature generation via StyleGAN feature map modulation. ACM Trans. Graph. (TOG) 2021, 40, 1–16. [Google Scholar] [CrossRef]
Peng, C.; Wang, N.; Li, J.; Gao, X. Universal face photo-sketch style transfer via multiview domain translation. IEEE Trans. Image Process. 2020, 29, 8519–8534. [Google Scholar] [CrossRef]
Khowaja, S.A.; Nkenyereye, L.; Mujtaba, G.; Lee, I.H.; Fortino, G.; Dev, K. Fistnet: Fusion of Style-Path Generative Networks for Facial Style Transfer. Inf. Fusion 2024, 112, 102572. [Google Scholar] [CrossRef]
Selim, A.; Elgharib, M.; Doyle, L. Painting style transfer for head portraits using convolutional neural networks. ACM Trans. Graph. (ToG) 2016, 35, 1–18. [Google Scholar] [CrossRef]
Liu, M.; Li, Q.; Qin, Z.; Zhang, G.; Wan, P.; Zheng, W. Blendgan: Implicitly gan blending for arbitrary stylized face generation. Adv. Neural Inf. Process. Syst. 2021, 34, 29710–29722. [Google Scholar]
Zhang, T.; Yu, L.; Tian, S. CAMGAN: Combining attention mechanism generative adversarial networks for cartoon face style transfer. J. Intell. Fuzzy Syst. 2022, 42, 1803–1811. [Google Scholar] [CrossRef]
Dai, M.; Yin, H.; Yi, R.; Ma, L. Geometric Style Transfer for Face Portraits. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, Tokyo, Japan, 6 December 2023; pp. 1–7. [Google Scholar]
Laishram, L.; Shaheryar, M.; Lee, J.T.; Jung, S.K. High-Quality Face Caricature via Style Translation. IEEE Access 2023, 11, 138882–138896. [Google Scholar] [CrossRef]
Kong, F.; Pu, Y.; Lee, I.; Nie, R.; Zhao, Z.; Xu, D.; Qian, W.; Liang, H. Unpaired Artistic Portrait Style Transfer via Asymmetric Double-Stream GAN. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 5427–5439. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Xiong, W.; Liao, H.; Huo, J.; Gao, Y.; Luo, J. Carigan: Caricature generation through weakly paired adversarial learning. Neural Netw. 2020, 132, 66–74. [Google Scholar] [CrossRef] [PubMed]
Royer, A.; Bousmalis, K.; Gouws, S.; Bertsch, F.; Mosseri, I.; Cole, F.; Murphy, K. Xgan: Unsupervised image-to-image translation for many-to-many mappings. Domain Adapt. Vis. Underst. 2020, 2020, 33–49. [Google Scholar]
Gu, Z.; Dong, C.; Huo, J.; Li, W.; Gao, Y. Carime: Unpaired caricature generation with multiple exaggerations. IEEE Trans. Multimed. 2021, 24, 2673–2686. [Google Scholar] [CrossRef]
Tong, S.; Liu, H.; He, Y.; Du, C.; Wang, W.; Guo, R.; Liu, J. Weakly Supervised Exaggeration Transfer for Caricature Generation With Cross-Modal Knowledge Distillation. IEEE Comput. Graph. Appl. 2024, early access. [CrossRef]
Harms, J.; Lei, Y.; Wang, T.; Zhang, R.; Zhou, J.; Tang, X.; Curran, W.J.; Liu, T.; Yang, X. Paired cycle-GAN-based image correction for quantitative cone-beam computed tomography. Med. Phys. 2019, 46, 3998–4009. [Google Scholar] [CrossRef]

Figure 1. Problems with facial style transfer at this stage (a–d). The red boxes in the figure indicate the defects in the generated images.

Figure 2. The proposed method structure.

Figure 3. Styler module structure.

Figure 4. The network structure and parameters of the generator.

Figure 5. The details of the improved generator.

Figure 6. The network structure and parameters of the discriminator.

Figure 7. Comparison between the experimental results of ours and those of others. The red and green boxes represent detail problems in the highlights of the face and artifacts at the edges of the portrait, respectively.

Figure 8. The effect of AdaLIN module and Laplacianmodule for facial style transfer.

Table 1. The detail parameter value.

Parameter	$λ_{cyc}$	$λ_{rec}$	$λ_{adv}$	$β_{1}$	$β_{2}$	$l_{r}$	esp	Batch Size	Learning Rate
Value	10	1	1	0.5	0.999	0.0001	1 × 10⁻⁵	4	0.0001

Table 2. MOS scoring system.

Score	Quality Standard	Description
5	Excellent	There is no way to tell if the image is real or fake.
4	Good	Slightly perceptible but does not affect overall appearance.
3	Medium	Image quality has a slight impact on the overall look and feel.
2	Rather poor	Image quality has affects the overall look and feel.
1	Poor quality	Image quality seriously affects the overall look and feel.

Table 3. Subjective score of experimental results.

Method	Score
Method	Mean ↑	Std ↓
CycleGAN	2.347	1.021
MUNIT	3.731	1.392
WarpGAN	2.718	1.113
Image2StyleGAN	4.231	0.985
S-Embedding	3.076	1.207
Hierarchical Optimization	2.725	1.113
CariGAN	3.985	1.028
XGAN	3.154	1.114
SyleCariGAN	2.986	1.217
CariMe	3.012	1.034
Proposed	4.017	0.962

Table 4. FID score tables for each method.

Method	FID ↓
CycleGAN	45.34
MUNIT	50.01
WarpGAN	36.28
CariGAN	41.77
XGAN	40.35
StyleCariGAN	37.29
CariMe	33.56
Proposed	33.28

Table 5. SSIM and MS-SSIM score tables for each method.

Method	Score
Method	Mean ↑	Std ↓
CycleGAN	0.7482	0.8874
Image2StyleGAN	0.9265	-
Hierarchical Optimization	0.9526	-
S-Embedding	0.9701	-
CariGAN	0.9126	0.8734
XGAN	0.8542	0.9068
StyleCariGAN	0.9618	0.8735
CariMe	0.8976	0.8993
Proposed	0.9772	0.9182

Table 6. FID, SSIM and MS-SSIM score tables for the ablation study.

Method	FID ↓	SSIM ↑	MS-SSIM ↑
CycleGAN + AdaIN	42.57	0.7743	0.8995
CycleGAN + AdaLIN	40.81	0.7812	0.9014
Proposed	33.28	0.9772	0.9182

Table 7. Subjective score of ablation study.

Method	Score
Method	Mean ↑	Std ↓
CycleGAN + AdaIN	2.940	1.047
CycleGAN + AdaLIN	2.977	1.437
Proposed	4.017	0.962

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, J.; Liu, H.; He, Y.; Tong, S. An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer. Appl. Sci. 2024, 14, 6311. https://doi.org/10.3390/app14146311

AMA Style

Liu J, Liu H, He Y, Tong S. An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer. Applied Sciences. 2024; 14(14):6311. https://doi.org/10.3390/app14146311

Chicago/Turabian Style

Liu, Jingyun, Han Liu, Yuxin He, and Shuo Tong. 2024. "An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer" Applied Sciences 14, no. 14: 6311. https://doi.org/10.3390/app14146311

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved Detail-Enhancement CycleGAN Using AdaLIN for Facial Style Transfer

Abstract

1. Introduction

2. Related Work

2.1. Generative Adversarial Network

2.2. Facial Style Transfer

3. The Proposed Method

3.1. Model Structure

3.2. Stylization Module

3.3. Loss Function

3.3.1. Generative Adversarial Loss

3.3.2. Image Reconstruction Loss

3.4. Algorithm

4. Experiment

4.1. Dataset

4.2. Implementation

4.3. Results

4.3.1. Subjective Evaluation

4.3.2. Objective Evaluation

4.4. Ablation Study

4.5. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI