1. Introduction
Facial style transfer, a humorous art form, captivates audiences with its unique interpretations of human features [
1]. Artists use abstract strokes, fluid lines, and exaggerated expressions to create stylized portraits, revealing personality and emotion in ways traditional portraiture might miss [
2]. However, this process is complex and heavily dependent on the artist’s creative inspiration, requiring skill, time, and multiple iterations [
3]. In the era of self-media, where content creation is ubiquitous, this traditional approach struggles to meet the demand for instant, personalized content [
4].
With advancements in deep learning and image-to-image translation, AI-powered facial style transfer has gained attention. These technologies promise near-instantaneous stylization, democratizing access to this art form and allowing users to experiment with various styles effortlessly [
5]. The main challenge lies in balancing style transformation with identity preservation. Algorithms must accurately replicate artistic styles while maintaining the subject’s core facial features. This requires sophisticated techniques to capture style essences and apply them without compromising recognizability [
6].
The main difficulty in facial style transfer lies in both the stylistic transformation of the input portrait features and maintaining the identity consistency of the generated image with the input photo [
7,
8]. Before the emergence of deep learning, the facial style transfer problem was primarily addressed using traditional methods such as interactive-based techniques, rule-based five-feature contour deformation methods, example-based learning methods, and database-based style matching methods. For instance, the face shape and the five-feature contour of the face are depicted by [
9] through the positioning of key facial points, with the contour being corrected using geometric template detection to automatically generate a line drawing. Similarly, the image is chunked directly by [
10] with a sliding window to divide the image into overlapping blocks of the same size, and a Markov network model is employed to find the stylized image block that best matches the face image block, with these blocks being synthesized to form a complete stylized face image. However, most of these methods cannot achieve a fully automated facial stylization migration process and suffer from complex rules, single-generation effects, and a lack of creativity. With the development of deep learning, these problems are being resolved one by one [
11,
12,
13].
Deep learning-based methods for facial style transfer can be categorized into two main types: neural style transfer (NST) and generative adversarial network (GAN)-based methods [
14,
15]. Ref. [
16] pioneered the NST approach, using convolutional neural networks (CNNs) to separate and reorganize image content and style. With NST, an image’s style can be modified while preserving its content, achieving artistic re-creation. However, numerous experiments have shown that CNN-based methods tend to focus primarily on the transfer of image texture, color, and pixel-level style features. Additionally, CNNs are often slow to train, insensitive to fine details, and struggle to achieve comprehensive stylization of face images [
17].
In recent years, generative adversarial networks (GANs) have garnered widespread attention due to their powerful image generation capabilities [
18,
19]. GAN is an unsupervised learning framework proposed by Goodfellow et al. in 2014, consisting of a generator G and a discriminator D. The generator converts random noise images into target images, while the discriminator judges whether an image is real or generated by the generator. When the generator and discriminator reach a balance through adversarial training, the generator can produce realistic fake images. Among GAN-based methods, CycleGAN [
20] has become a research hotspot in the field of portrait style migration due to its convenient data acquisition, ability to achieve diverse transformations, and excellent generalization ability.
Research based on CycleGAN addresses the challenge of acquiring challenging datasets and implements a transformation process from portrait photographs to stylized images [
21]. However, these methods encounter several issues. Many current approaches struggle with overly uniform normalization methods that fail to adapt to varying learning rates. Moreover, their regularization techniques often inadequately control model complexity, resulting in problems such as overfitting. Consequently, generated images may lack texture details, exhibit unevenly colored patches in facial highlights, introduce artifacts from noisy input data, or appear unnaturally saturated with colors [
22,
23,
24], as shown in
Figure 1. Compared to ordinary images, the face structure is more delicate and requires intensive transformation during the style transfer process, making existing methods unsuitable for generating stylized portrait transfers. To address this issue, considering the difficulty of obtaining pairwise datasets, this paper designs a non-pairwise dataset based on CycleGAN. It utilizes an improved generative network and discriminative network for adversarial training and balancing, while also adopting semantic constraints to ensure similarity between the generated image and the real photo. This approach retains rich detail information and preserves identity details, resulting in more lifelike and richly detailed stylized portraits. The work carried out includes the following three points:
In this paper, we propose an improved CycleGAN using AdaLIN, which can select appropriate Instance Normalization (IN) and Layer Normalization (LN) parameter weights during the training process. This approach balances the contradiction between style and content, enhances image details, and preserves the detailed features of the original picture, particularly excelling at relocating facial highlights.
We introduce a Laplacian regular module for denoising the image. This approach helps to prevent noise features from affecting the color transfer of the image and reduces the color artifacts caused by noise.
Compared to other methods, the Improved Detail-Enhancement CycleGAN using AdaLIN proposed in this paper yields superior results in image detail enhancement and artifact removal. This advancement significantly enhances the quality of the generated face images.
The rest of this paper is structured as follows: The related works are introduced in
Section 2.
Section 3 describes the methods proposed in this paper in detail.
Section 4 shows the experiment results and evaluations.
Section 5 concludes this paper and describes our future work.
3. The Proposed Method
CycleGAN is an unsupervised generative adversarial network that works by training two pairs of generator–discriminator models to facilitate image transformations between different domains. The key technique of CycleGAN is recurrent consistency, which was already mentioned in the [
47] mentioned. This means that when the generator is applied sequentially, the resulting image should be very similar to the original image and is achieved through L1 loss. Thus, the cyclic loss function is crucial to prevent the generator from transforming the image into a domain that is completely unrelated to the original image.
The model comprises two mapping functions, and , along with corresponding adversarial discriminators and . encourages G to translate images from domain X into the style of domain Y, and vice versa. Additionally, two loop consistency loss functions are introduced to ensure that the translated style can be reverted back to its original state after the inverse translation process, thereby standardizing the mapping.
3.1. Model Structure
The framework of this paper is based on CycleGAN, whose specific structure is shown in
Figure 2; in this network, the source domain is a realistic image and the target domain is a stylized image.
is the source domain discriminator,
is the target domain discriminator,
G is the source domain to target domain generator and
F is the target domain to source domain generator.
is the source domain to the target domain generated image, and
is the image of the source domain generated by
through
F.
is the target domain image of the previous round, which serves as the input image (i.e., as the source domain image of this part) in the symmetric loop.
is the generated image of
y through
F, and
is the target domain image generated by the loop. The network is composed of two symmetric loops, and the upper part of the model is obtained by passing the input image
x as the source domain through the generator
G to obtain the target domain image
. Subsequently, this newly generated
generates the source domain image
through
F. The distance between
and the original image
x is computed by comparing
with the original image to define the mapping relationship that the unpaired dataset did not originally have.
The two inputs x and y are the real image and the image generated by the generator, respectively. The discriminators and will score the input images with [0, 1], respectively, which is used to differentiate whether both are realistic images or images generated by the generator.
3.2. Stylization Module
The structure of the facial style conversion module is shown in
Figure 3, comprising a generator and a discriminator. This module deviates from the basic recurrently consistent generative adversarial network by integrating three components into its generator: a content encoder, a style encoder, and a decoder.
Figure 4 provides a detailed illustration of the generator’s specific structure.
The content encoder within the generator compresses the image and incorporates feature maps using four subsampled convolutional layers, followed by two ResNet blocks for further processing to map the input content image to its content encoding. Each convolutional layer is subsequently followed by Instance Normalization (IN). The styler encoder comprises five convolutional layers and utilizes Average Pooling to vectorize the style image, resulting in the styler encoding. Subsequently, the decoder processes the content-coding through a series of AdaLIN residual blocks and ultimately reconstructs the input image using several upsampling convolutional layers.
AdaLIN dynamically adjusts the weighting of Instance Normalization (IN) and Layer Normalization (LN) parameters during training, leveraging the advantages of both techniques to selectively retain or alter content information. This capability enables the model to flexibly control the degree of shape and texture changes, thereby enhancing overall robustness. Affine transformation parameters
and
are derived from the style encoding through a Multilayer Perceptron (MLP), while the learning parameter
is utilized to fine-tune the Instance Normalization and Layer Normalization ratio. Specifically, affine transformation parameters
and
are obtained via the method illustrated in
Figure 5.
As can be seen in
Figure 5, the calculation of specific parameters in AdaLIN is shown in Equation (
1):
where
With AdaLIN, the core concept is to integrate style information into the content domain, facilitating more effective style transfer and yielding superior results.
Conversely, the discriminator employs PatchGAN to provide feedback to the generator. The specific parameters used are detailed in
Figure 6.
In the traditional GAN model, the output of the discriminator D network is a scalar, between 0 and 1, representing the probability of being a real image. While the PatchGAN output is no longer a scalar, but the output is an matrix X. Each element represents a patch, corresponding to a receptive field of the image, and finally, the mean value of each patch site is taken to represent the probability that the final total picture is a real picture, and such training makes the model pay more attention to the details of the image. In this paper, we use the PatchGAN to replace the original model to discriminate the whole image, and set the n-value to 70, to discriminate the 70 × 70 image blocks. In this paper, except for the last convolutional layer, Instance Normalization and Leaky ReLU activation function are used after the other convolutional layers, where the value of is set to 0.2. Unlike the ReLU, Leaky ReLU has a function value of when the input is negative, so that, when the input is negative, Leaky ReLU still has a weak activation effect.
This mechanism integrates local image features and overall image characteristics, discerning differences through each patch to extract and characterize local image features. This approach is conducive to achieving higher-resolution image production. Additionally, averaging the final classified feature maps enables comparison between real and generated images. This computational process is akin to weighted summation averaging of the entire image, allowing the discriminator in this paper to represent loss more reasonably than traditional discriminator networks, particularly for local image features with significant differences.
3.3. Loss Function
The loss function is carefully crafted to preserve the original character identity features while ensuring high image quality. It comprises three key components: generative adversarial loss, image reconstruction loss, and cycle consistency loss.
3.3.1. Generative Adversarial Loss
The generative adversarial loss of CycleGAN is shown in Equation (
5):
where
y denotes the sample in domain
Y and
x denotes the sample in domain
X.
denotes the score of the sample in the true
Y in the discriminator
, and the closer to 1, the more true the discriminator considers this sample.
is the sample in the same distribution as
Y generated by the generator based on
x.
is the score of the discriminator based on the generated samples; if
considers the generated samples to be more false, then closer to 0 the score
is, the closer
is to 1. The stronger the discriminator is, the better it will be able to distinguish the real
y from the
generated by the generator based on
x, and the larger this loss value will be.
3.3.2. Image Reconstruction Loss
In this paper, to enhance the accuracy of CycleGAN, the image reconstruction loss is incorporated into the generator’s loss function alongside the generative adversarial loss. This loss function component measures the disparity between the generator input image and its reconstructed counterpart. To quantify this similarity, the Structural Similarity Index Method (SSIM) is introduced. The structural similarity between the input image
x and its reconstructed image
is defined as shown in Equation (
6):
where
and
denote the mean values of
x and
, respectively,
and
denote the variances of
x and
, respectively,
denote the covariance of
and
. where
To address artifact issues, additional constraints are introduced into the loss function. The objective function of the stylization module incorporates a Laplacian regular term. This term in the loss function aims to retain the clarity and fidelity of details and textures in the generated image, preserving the nuanced features of the original content image. It serves to smooth out and maintain these details, contributing to high-quality image results. The Laplacian regular term is illustrated in Equation (
8):
where
X and
denote the matrix and the transpose matrix, respectively,
denote the symmetric weight matrix, and
and
denote the elements in the matrix
X.
3.4. Algorithm
Based on the above model and loss function, the algorithm flow of this paper is shown in Algorithm 1.
Algorithm 1 CycleGAN Algorithm Flow |
Input: Gallery X, gallery Y, generator G and its corresponding parameter , discriminator and its corresponding parameter , generator F and its corresponding parameter , discriminator and its corresponding parameter , weight factor , maximum number of cycles , number of rounds to start reducing the learning rate , current number of iterations t, generator initial learning rate and discriminator initial learning rate . |
Output: Stylized photo |
step 1: Initialization: , , and . step 2: Updating: . step 3: Sample images from X and Y, generate corresponding outputs, and label real as 1 and generated as 0. step 4: Input generated samples to generators G and F to obtain reconstructed images and calculate reconstruction losses. step 5: Input generated and real samples to discriminators and . Compute and minimize the discriminator objective function. Update discriminator weights using back-propagation and improved Adam optimization. step 6: Minimize generator objective function and update weights of generators G and F using back-propagation and improved Adam optimization. step 7: Adjust learning rate based on current iteration count t and . step 8: Repeat steps 3–7 until maximum iterations are reached.
|
4. Experiment
To validate the effectiveness of the proposed method in this paper, several steps are undertaken. Firstly, the specific method is implemented using the PyTorch framework for coding. Then, the primary training and testing phases are conducted using the WebCaricature cartoon face dataset, and the convergence of the training loss is analyzed. Subsequently, the results are evaluated based on specific experimental content. The experimental evaluation can be divided into two main parts. Firstly, objective indicators are used to evaluate the performance of the model proposed in this paper and compare it with other different network models. This comparison is based on objective indicators to demonstrate the effectiveness of image stylization. Secondly, subjective evaluations are conducted to validate the effect of image stylization. This involves comparing the method proposed in this paper with the existing CycleGAN method. The comparison results are then analyzed, and volunteers are recruited for a random survey to gather subjective feedback. The results of the survey are then analyzed accordingly. Finally, the experimental data are summarized and analyzed comprehensively to assess both the advantages and shortcomings of the method proposed in this paper.
4.1. Dataset
The dataset utilized in this study is the publicly available WebCaricature dataset, comprising 6042 caricature images and 5974 photo images from 252 individuals. The dataset includes images of varying resolutions, some in grayscale and others in RGB, and encompasses a range of artistic styles. Data preprocessing involves several steps such as face detection (key point detection), face alignment, and face normalization. During the training phase, data augmentation techniques like random flipping of images are employed to enhance the dataset.
For experimentation purposes, the dataset is randomly split into a training set consisting of 202 character images (totaling 4804 photos and 4773 caricatures) and a test set containing 50 character images (comprising 1170 photos and 1269 caricatures). The experimental results presented in this paper are based on the character images in the test set, ensuring that the model’s performance is evaluated on unseen data. The facial keypoints used in the experiments are the 17 keypoints provided within the WebCaricature dataset. The experiments are conducted on a computer running the Windows 10 operating system, equipped with a GTX 2080Ti GPU and CUDA version 12.0. The neural network is constructed using the PyTorch deep learning library, which is an open-source project developed by Facebook.
4.2. Implementation
The models in this research are trained and evaluated using the public WebCaricature dataset. Since the focus is on facial conversion, the images in the dataset that include regions below the shoulders can introduce unwanted complexity during the style transfer process. Therefore, the facial images undergo preprocessing steps where they are cropped to isolate the facial region. During image preprocessing, the image is initially rotated using the line connecting the eyes of the person as the horizontal reference line. The facial region containing the hair, ears, and chin is then cropped based on the facial keypoints. Specifically, an initial region box is created using the keypoints of the two ears, the top of the forehead, and the chin. This region box is expanded outward by 1.5 times to create the final cropping region. The resulting cropped images are resized to a uniform size of , and new facial keypoints are computed accordingly.
The weights assigned to different components of the loss function play a crucial role in determining the quality of the generated images. In this algorithm, based on the CycleGAN framework, the cyclic consistency loss is prioritized to preserve texture features, typically being 5 to 12 times greater than the adversarial loss in the image stylization task. The image reconstruction loss, complementing the cyclic consistency loss, is usually smaller than the adversarial loss.
Through extensive experimentation and comparisons, the optimal weight settings for the loss components are determined as presented in
Table 1. The model training process involves a batch size of 4 for 30,000 steps, with an initial learning rate of 0.0001 that is halved every 5000 training steps.
4.3. Results
The development of generative adversarial networks has faced challenges in establishing objective evaluation metrics for assessing picture quality, making it difficult to define standardized evaluation criteria. As a result, evaluating the outcomes of GAN experiments primarily relies on subjective comparisons, supported by objective assessments when possible.
4.3.1. Subjective Evaluation
Subjective evaluation involves visually assessing an image and assigning a subjective quality rating based on personal perception. This method relies on statistical analysis and requires multiple observers for reliable results. While subjective evaluation aligns with common human perceptions and is applicable to various image types, it is labor-intensive, costly, and susceptible to human biases, leading to potential result variations. Typically, it is suitable for evaluating a small number of images, while objective methods are preferred for larger datasets.
Many mainstream algorithms, as seen in references [
36,
37,
38,
39], employ absolute evaluation to assess generated images. Absolute evaluation compares the generated image against a reference (real source image), using methods like Double Stimulus Continuous Scale (DSCQS). In DSCQS, observers alternate between the source and evaluated images, noting differences and scoring based on predefined criteria, such as the full superiority scale shown in
Table 2. The Mean Opinion Score (MOS) is then calculated to compare the performance of different methods.
The experimental results of this paper were compared with both traditional CycleGAN algorithms and excellent methods that have emerged in the past five years. The results are shown in
Figure 7. When performing style transfer on general facial images (such as the first and second columns), the method proposed in this paper achieves very natural color transitions and preserves the original texture information without producing uneven color patches (for example, rows 2, 4, 6 in the first column, and rows 2, 3, 4, 5 in the second column), and does not ignore facial details (for example, row 3 in the first column). When processing style transfer for facial images with more prominent highlights (such as the third and fourth columns), our method can preserve highlight details while not losing other facial details, which other algorithms fail to achieve. When processing facial images that may produce artifacts during style transfer (such as the fifth and sixth columns), the method proposed in this paper can eliminate artifact phenomena at the edges of portraits caused by lighting conditions and the transfer process, while fully retaining the edge information of the portraits. Other methods are notably inferior in this aspect.
The above comparative experiments demonstrate the advanced nature of this paper in addressing portrait style transfer problems.
To evaluate visual quality based on MOS criteria (refer to
Table 2), 30 individuals with diverse professional backgrounds were chosen, assessing 50 generated images from both traditional CycleGAN and the proposed method. Image quality evaluation focuses on detecting color shifts, texture distortions, high-frequency noise, or other defects. During testing, subjects viewed and rated images from both methods, and mean and standard deviation scores were calculated for comparison (see
Table 3).
The research data indicate that the Improved Detail-Enhancement CycleGAN method proposed in this paper consistently achieves higher scores in both mean and standard deviation, signifying its superiority in imaging quality over other methods in terms of subjective vision. It effectively addresses issues such as incomplete style transfer, unnatural color transitions, and image artifacts prevalent in other methods. These experimental findings strongly validate the effectiveness of the proposed method outlined in this paper.
4.3.2. Objective Evaluation
In this paper, to comprehensively analyze the generation effect of face images, three objective evaluation metrics are employed. These metrics include the Frechet Inception Distance score (FID), Structural Similarity Index (SSIM), and the multi-scale image quality assessment SSIM method (MS-SSIM). These metrics are utilized to provide a thorough and quantitative assessment of the generated face images’ quality, ensuring a comprehensive evaluation of the model’s performance.
(a) FID
FID, a commonly used evaluation metric for generative adversarial networks (GANs), transforms the data distribution into Gaussian random variables using a feature function. This feature function calculates the distance between two Gaussian distributions based on their mean and covariance matrices, as illustrated in Equation (
9):
where
x is the real image,
g is the generated image,
denotes the mean, ∑ denotes the covariance matrix, and
denotes the trace of the matrix.
(b) SSIM
In real visual scenes, images exhibit a strong correlation, and there is a significant relationship between pixels within the same image. The structural similarity metric quantifies this similarity by measuring various information within the image, such as grayscale range, to determine the degree of resemblance between two images.
(c) MS-SSIM
In addition, the visibility of image details is influenced by factors such as image signal acquisition density, distance from the image plane to the observer, and the observer’s visual system’s perceptual ability. As these factors vary, the evaluation of an image also changes. Therefore, to complement the SSIM metrics, this paper introduces another method for image quality assessment called MS-SSIM (Multi-Scale Structural Similarity Index Method). MS-SSIM combines measurements on different scales to obtain an overall evaluation, as shown in Equation (
10):
where function
represents the brightness contrast of the image, function
represents the structural contrast of the image, and function
represents the contrast of the image, parameters
,
and
are used to adjust the relative importance of different components.
For the experimental results mentioned above, FID was calculated separately, and the results are presented in
Table 4. From the computational results, it is evident that the method proposed in this paper achieves better scores in terms of FID. Particularly when compared to the traditional CycleGAN, the algorithm proposed in this paper shows a 26% improvement in FID scores. This is sufficient to illustrate the advantage of the proposed algorithm in preserving the integrity of facial features in portraits.
For the experimental results mentioned above, SSIM and MS-SSIM were calculated separately, and the results are presented in
Table 5, providing a detailed quantitative analysis of our method’s performance. As can be clearly observed from the data, the method proposed in this paper consistently achieves superior scores in both SSIM and MS-SSIM metrics compared to other state-of-the-art approaches. This superiority is evident across various test cases and style transfer scenarios. The higher scores indicate that our method more effectively preserves the structural information and perceptual quality of the original images during the style transfer process. Furthermore, these results strongly suggest that the images generated by our method exhibit a distribution that closely aligns with that of real images, demonstrating the effectiveness of our approach in producing realistic and high-quality style-transferred facial images.
4.4. Ablation Study
To analyze the importance and role of each module, ablation experiments were conducted in this paper. The result of the ablation study is shown in
Figure 8.
In
Figure 8, we analyze our model using different variables. It can be observed that in the structure using only AdaIN (the first row in
Figure 8), although most of the content is preserved, there are issues like extremely uneven patches in sample 1 (the first column) and missing details in the eyebrows. Sample 3 (the third column) exhibits extremely uneven skin color, with visible artifacts at the face edges, and sample 4 (the fourth column) shows a noticeable loss of eye details and inconsistent hair color trends. The structure with AdaLIN (the second row in
Figure 8) shows improvements in certain aspects, such as sample 2 (the second column). However, artifacts around the face are evident in sample 1, affecting the overall image quality. This issue is also visible in sample 3 and sample 6 (the sixth column), and sample 5 (the fifth column) shows poorly generated style. Our proposed method using AdaLIN+Laplacian (the last row in
Figure 8) further improves detail and image generation quality. Artifacts at the face edges in sample 1 and sample 6 are eliminated, and the eyebrow details in sample 1 are well presented. Sample 2 preserves eye details perfectly with rich colors, aligning with human aesthetics. Sample 3 and sample 5 exhibit more natural skin tone transitions, clearly displaying facial wrinkle information.
The scores of objective and subjective evaluations for each part of the ablation experiment are shown in
Table 6 and
Table 7, respectively.
By analyzing the results of the ablation experiments and comparing the second and third rows of
Figure 8, we can understand the roles of the AdaLIN module and the Laplacian module. When the original image has noticeable highlights on the face, the model with the AdaLIN module preserves the details of these highlights during style transfer, resulting in a more realistic and delicate color representation in the generated image. Similarly, comparing the second and fourth rows of
Figure 8, we observe that when there are artifacts from noise in the original image, the model with the Laplacian module handles the color transition at the portrait edges better during style transfer. This improves the overall quality by avoiding artifacts and unnatural color transitions, leading to a more refined generated image. From an objective evaluation perspective, the FID scores of the proposed method in this paper are improved by 20.88% and 17.47%, respectively, highlighting the significant impact of both the AdaLIN module and the Laplacian module on the imaging effect.
4.5. Discussion
Comprehensive empirical evaluations demonstrate that the proposed methodology has yielded efficacious outcomes in style transfer for a preponderance of facial images, particularly in addressing intricate details within facial highlights and mitigating image edge artifacts. Nevertheless, our experimental investigations have revealed certain limitations in specific scenarios, primarily encompassing the following aspects:
While our approach exhibits robust performance in processing images of Caucasian and Asian subjects, it demonstrates a propensity for detail attenuation when confronted with images of individuals possessing darker skin tones. This phenomenon can be primarily attributed to inherent biases in training data, as numerous facial style transfer models are predominantly trained on datasets comprising Caucasian or Asian subjects, resulting in suboptimal recognition and processing of facial features characteristic of darker-skinned individuals. Furthermore, the complexities involved in processing diverse skin tones are exacerbated by the more nuanced tonal and textural variations present in darker skin, thereby intensifying the challenges associated with the accurate capture and transformation of these subtle details.
Our method achieves commendable results in style transfer for adult facial images; however, its performance is suboptimal when applied to children’s facial images. This discrepancy can be attributed to several factors: firstly, the significant morphological differences between children’s and adults’ facial features, such as more rounded face shapes, proportionally larger eyes, and distinct forehead-to-face ratios, which may lead to inadequate performance of models trained exclusively on adult data. Secondly, the inherently smoother texture of children’s skin, lacking the fine lines and textural complexities present in adult faces, may result in either loss or excessive addition of texture during processing. Lastly, the style transfer process may inadvertently introduce age-inappropriate features, effectively “aging” children’s faces and diminishing age-specific characteristics.
Additionally, our current methodology exhibits limitations in terms of personalization and user control. The existing framework autonomously learns and selects styles corresponding to facial images through the model, without providing users the opportunity to actively participate in style selection during the transformation process. This constraint may not adequately address the diverse preferences and requirements of all users.
These identified limitations provide valuable insights for future research directions and potential enhancements to our proposed method.
5. Conclusions
In this study, we present an Improved Detail-Enhancement CycleGAN methodology designed to address the complexities of style transfer in facial images. Our novel approach integrates the AdaLIN technique to mitigate detail attenuation resulting from facial feature highlights, while simultaneously incorporating Laplacian regularization to minimize post-transfer artifacts induced by image noise. To rigorously assess the efficacy of our proposed method, we conducted a comprehensive comparative analysis against state-of-the-art approaches published within the past triennium. Empirical findings conclusively demonstrate that the facial images generated by our method significantly outperform those produced by extant methods, as evidenced by superior scores in both subjective and objective evaluation metrics. These advancements represent a substantial contribution to the field of facial image style transfer. However, we acknowledge that our method exhibits limitations, which warrant further investigation and refinement in future studies. One of the research directions is to enhance style prominence while maintaining facial integrity, which may involve exploring adaptive style weight mechanisms or multi-scale style transfer approaches. Another direction is to increase personalization and user control, allowing the model to provide users with greater control over the style transfer process. This would enable users to adjust style intensity or select specific facial features for style emphasis, which may involve developing interfaces and addressing related issues. By addressing these future research directions, we aim to not only overcome the current limitations of our method but also to push the boundaries of facial style transfer technology, opening up new possibilities in digital art, entertainment, and human–computer interaction.