Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation

Lo, Shih-Lun; Cheng, Hsu-Yung; Yu, Chih-Chang

doi:10.3390/electronics13234761

Open AccessEditor’s ChoiceArticle

Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation

by

Shih-Lun Lo

¹,

Hsu-Yung Cheng

^1,*

and

Chih-Chang Yu

^2,*

¹

Department of Computer Science and Information Engineering, National Central University, Taoyuan 320, Taiwan

²

Department of Information and Computer Engineering, Chun-Yuan Christian University, Taoyuan 320, Taiwan

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(23), 4761; https://doi.org/10.3390/electronics13234761

Submission received: 23 October 2024 / Revised: 24 November 2024 / Accepted: 30 November 2024 / Published: 2 December 2024

(This article belongs to the Special Issue Applications and Challenges of Image Processing in Smart Environment)

Download

Browse Figures

Versions Notes

Abstract

:

We propose an anime style transfer model to generate anime faces from human face images. We improve the model by modifying the normalization function to obtain more feature information. To make the face feature position of the anime face similar to the human face, we propose facial landmark loss to calculate the error between the generated image and the real human face image. To avoid obvious color deviation in the generated images, we introduced perceptual color loss into the loss function. In addition, due to the lack of reasonable metrics to evaluate the quality of the animated images, we propose the use of Fréchet anime inception distance to calculate the distance between the distribution of the generated animated images and the real animated images in high-dimensional space, so as to understand the quality of the generated animated images. In the user survey, up to 74.46% of users think that the image produced by the proposed method is the best compared with other models. Also, the proposed method reaches a score of 126.05 for Fréchet anime inception distance. Our model performs the best in both user studies and FAID, showing that we have achieved better performance in human visual perception and model distribution. According to the experimental results and user feedback, our proposed method can generate results with better quality compared to existing methods.

Keywords:

generative adversarial network; anime face style transfer; artificial intelligence

1. Introduction

Automatic anime image generation holds several important applications, contributing to both creative and practical domains. Anime images are widely used in various media forms, such as comics, animations, video games, and visual novels. Automatic generation can aid in producing vast amounts of diverse content, reducing production time and costs for these media industries. Anime-style avatars, characters, and illustrations have gained popularity in personal profiles, social media, and online interactions. On YouTube, a new industry called VTuber has emerged, which captures a person’s facial expressions and interacts with the audience through the appearance of an animated character. However, in the production of VTuber characters, it is necessary to manually draw the character first and then use Live2D [1] to set each part of the character, such as the head angle and eye movement position. Then, FaceRig [2] is used to capture the person’s facial features in order to generate corresponding facial animations for the animated character. The purpose of this paper is to use a generative adversarial network in deep learning to achieve anime facial style transfer, thereby reducing the time and cost of creating character models for VTubers. Also, the proposed method allows individuals to customize their digital personas while maintaining a consistent visual style.

In previous methods of image style transfer, Winnemöller et al. [3] used image processing methods to convert images into cartoon styles by detecting edges and setting pixel threshold values manually. Although this method is simple, it can only present simple textures, and it also requires different threshold values to be set manually for different images. Efros et al. [4] proposed using Markov random fields for texture generation. Based on this method, Hertzmann et al. [5] analogized the image style before and after style transfer for other groups to the desired image to obtain new images. Although the style can be specified, it still cannot convert more complex textures well.

With the development of deep learning, related research has begun to apply neural networks to image style transfer. Gatys et al. [6] first proposed the use of minimizing the Gram matrix to make the style of the target image consistent with the style image in the VGG16 [7] feature map by reducing the difference. However, this method requires retraining when inputting new images. Both Johnson et al. [8] and Gatys et al. [6] used VGG16 for feature extraction, but the difference is that Johnson et al. [8] optimized the model instead of the target image, allowing the model to achieve real-time conversion of a single style without the need to retrain with new images.

Isola et al. [9] used generative adversarial networks to achieve better image quality, but this method requires paired style and target images, and data collection needs to be carefully processed. Zhu and Park et al. [10] proposed cycle loss, which requires the original image to be converted back to the original distribution after being converted to the target data distribution and is closer to the input image, allowing image style transfer to learn the transformation of two data distributions without paired image datasets. U-GAT-IT [11], based on CycleGAN [10], added the concept of a class activation map proposed by Zhou et al. [12], weighted the feature map by an auxiliary classifier, and learned the important areas for style transfer. Li et al. [13] proposed adaptive point-wise layer instance normalization to reduce the problem of reducing feature information when combining instance normalization with layer normalization in adaptive instance layer normalization in U-GAT-IT. Zeng et al. [14] used conditional generative adversarial networks to generate high-quality virtual facial animations with facial expressions. The work in [15] also emphasized controlling the expressions of the generated animated faces using their proposed lip-sync landmark animation generation network, emotional landmark animation generation network, and the landmark-to-animation translation network. In [16], the authors translated real-world face images into different artistic styles with reduced computational cost. Yang and Qiu [17] proposed supporting multiple target domains in image conversion and generating images of diverse styles in multiple domains. In [18], cycle consistency was used to facilitate efficient learning and domain adaptation, promoting greater variety and flexibility in expression translation across datasets.

We use U-GAT-IT [11] as the main architecture and enhance its performance in this research. We add adaptive point-wise layer instance normalization in the framework. We propose facial landmark loss based on convolutional pose machines [19], which help the generated images learn the positions of the original facial features. Also, we use perceptual color distance proposed by Zhao et al. [20] to make the generated images more consistent with the original images in color. For the evaluation of anime images, we improved the Fréchet inception distance proposed by Martin et al. [21] and proposed a Fréchet anime inception distance to make the evaluation of generated anime images more reasonable. The main contributions of this work include the following aspects: First, we combine the advantages of different normalization methods through convolution operation to improve the generation quality. Second, we use facial landmark loss to capture and calculate the error between the key point positions of real human faces and generated animated character faces. Third, we use perceptual color loss based on the CIEDE2000 color difference formula to obtain animated images with similar colors as the original image for human eyes. Fourth, we propose a Fréchet anime inception distance to improve the traditional Fréchet inception distance in evaluating animated characters.

2. Methodology

The proposed face animation generation framework is illustrated in Figure 1. The input image goes through an encoder consisting of down-sampling blocks and residual blocks. After the down-sampling blocks encode the input image, it obtains richer feature information and avoids the problem of gradient disappearance through four layers of Resnet blocks. The auxiliary classifier takes the feature map output by the encoder as input and obtains feature vectors through global average pooling and global max pooling. It learns the weight of each channel with a fully connected layer. The output feature maps are multiplied with the output by the original encoder to obtain the attention feature map, which serves as the input of the decoder. The decoder consists of residual blocks and up-sampling blocks. In order to learn the normalization parameters, the attention feature map will obtain γ and β through the fully connected layer. The normalization method used in the residual blocks is adaptive point-wise layer instance normalization (AdaPoLIN) [13]. Finally, the animation image is generated through up-sampling blocks.

2.1. Network Architecture

We use U-GAT-IT [11] as our base network model. In the U-GAT-IT experiment, some generated images had obvious differences in color tone compared to the original images, and others produced blurry results. This paper aims to address these two issues.

To improve the blurry results, we proposed the facial landmark loss, which utilizes a pre-trained convolutional pose machine [19] model to help the generated model learn the facial landmark positions for face transformation. To address the color tone differences in the generated images, we added the perceptual color distance [20] based on CIEDE2000 [22] as a loss function to obtain images that are more consistent with human perception. Finally, we used adaptive point-wise layer instance normalization (AdaPoLIN) [13] to generate images with better quality in terms of color and texture.

We use the PatchGAN discriminator [9] as the discriminator, which consists of five and seven layers of convolutional layers to form two models that output matrices of different sizes to determine the authenticity under different receptive fields. The full loss function can be formulated as Equation (1).

\begin{array}{l} L_{t o t a l} = λ_{L S G A N} L_{L S G A N}^{s \to t} + λ_{F a c i a l_L a n d m a r k_L o s s} L_{F a c i a l_L a n d m a r k_L o s s} + λ_{p e r c} L_{p e r c} \\ + λ_{c y c l e} L_{c y c l e} + λ_{d e n t i t y} L_{I d e n t i t y} + λ_{C A M} L_{C A M} \end{array}

(1)

In Equation (1),

L_{L S G A N}

,

L_{c y c l e}

,

L_{I d e n t i t y}

, and

L_{C A M}

represent the adversarial loss, cycle loss, identity loss, and class activation map loss, respectively. These losses are described in Equations (2)–(5).

L_{L S G A N}^{s \to t} = E_{x ~ X_{t}} [D_{t - 5} {(x)}^{2}] + E_{x ~ X_{s}} [{(1 - D_{t - 5} (G_{s \to t} (x)))}^{2}] + E_{x ~ X_{t}} [D_{t - 7} {(x)}^{2}] + E_{x ~ X_{s}} [{(1 - D_{t - 7} (G_{s \to t} (x)))}^{2}]

(2)

L_{c y c l e}^{s \to t} = E_{x \sim X s} [{| x - G_{t \to s} (G_{s \to t} (x))) |}_{1}]

(3)

L_{i d e n t i t y}^{s \to t} = E_{x \sim X t} [{| x - G_{s \to t} (x) |}_{1}]

(4)

\begin{array}{l} L_{C A M} = E_{x ~ X_{t}} [η_{D_{t - 5}} {(x)}^{2}] + E_{x ~ X_{s}} [{(1 - η_{D_{t - 5}} (G_{s \to t} (x)))}^{2}] \\ + E_{x ~ X_{t}} [η_{D_{t - 7}} {(x)}^{2}] + E_{x ~ X_{s}} [{(1 - η_{D_{t - 7}} (G_{s \to t} (x)))}^{2}] + {[- (E}_{x ~ X_{s}} [{l o g (η}_{s} (x))] + E_{x ~ X_{t}} [l o g (1 - η_{s} (x))])] \end{array}

(5)

where

X_{t}

represents the target domain and

X_{s}

represents the source domain.

D_{t - 5}

and

D_{t - 7}

denote the PatchGAN discriminator with five layers and seven layers, respectively.

G_{s \to t}

denotes the generator from the source to the target domain. In (5),

η_{s}

represents the auxiliary classifier of the generator, while

η_{D_{t - 5}}

and

η_{D_{t - 7}}

represent the auxiliary classifiers of discriminators with five layers and seven layers, respectively.

In Equation (1), we set the parameters as

λ_{L S G A N} = 1, λ_{c y c l e} = 10, λ_{d e n t i t y} = 10, a n d λ_{C A M} = 1000

. These parameters are selected according to [11]. The parameters

λ_{F a c i a l_L a n d m a r k_L o s s}

and

λ_{p e r c}

are set as

λ_{F a c i a l_L a n d m a r k_L o s s} = 5,

and

λ_{p e r c} = 8

empirically after testing several combinations of weights to balance the influence of facial landmark loss, perceptual color loss, and the other terms in the full loss function.

2.2. Facial Landmark Loss

To make the positions of facial features match before and after conversion, we propose using two pre-trained models to obtain the facial landmarks of the generated and original images. By minimizing the distance between the two, we help the generator learn the position of the facial features.

Real human faces use the facial landmarks detected by the convolutional pose machines pre-trained on the AFLW [23] dataset. For animated faces, we use the convolutional pose machines pre-trained on the animated face dataset. Both models are open-sourced on GitHub [24,25].

Since the facial features of animated faces have significant structural differences from those of real human faces, we only retain four coordinates corresponding to the left eye center, right eye center, nose center, and mouth center in the output of the convolutional pose machines model to avoid generating facial features that do not fit the data distribution, such as eye corners or lips. After obtaining the facial landmarks of the generated and original images, we use Smooth L1 Loss to calculate the coordinates of the corresponding facial features, making the positions of the facial features in the generated animated image close to those in the original image, as shown in Figure 2. The facial landmark loss can be formulated as Equation (6), and the

t h r e s h o l d

is set as 1. In Equation (6), the term

L (G_{s \to t} (x))

indicates the key point locations of the generated image, and

L (x)

indicates the key point position of the original image.

L_{F a c i a l_L a n d m a r k_L o s s} = \{\begin{matrix} \frac{0.5 {(L (G_{s \to t} (x)) - L (x))}^{2}}{t h r e s h o l d}, |L (G_{s \to t} (x)) - L (x)| < t h r e s h o l d \\ |L (G_{s \to t} (x)) - L (x)| - 0.5 * b e t a, o t h e r w i s e \end{matrix}

(6)

2.3. Perceptual Color Loss

When calculating the difference between colors, the most commonly used method is the Euclidean distance between two points in the RGB space. However, the perception of different colors by human eyes is not consistent, and some colors are more sensitive to changes, as shown in Figure 3. In Figure 3, the three colors in RGB space are (0, 0, 255), (50, 50, 255), and (50, 0, 205). The distance between (0, 0, 255) and (50, 50, 255) is 100. The distance between (50, 50, 255) and (50, 0, 205) is also 100. Although the two distances in the RGB color space are the same, human eyes would perceive that the color (50, 50, 255) is closer to (0, 0, 255) than (50, 0, 205). The aforementioned method assumes that the impact of the variation of the three colors on human perception is the same, which does not conform to human visual perception. In anime face style transfer, we hope that the generated image can maintain the color distribution of the original image, such as skin tone, hair color, and background color, while changing the style. However, U-GAT-IT generates images that have significant differences in color compared to the original images in Figure 4. To address this issue, we use the CIEDE2000 [22] (G. Sharma, W. Wu, and E. N. Dalal, 2005) color difference formula to obtain generated images that are more in line with human visual perception of color.

CIEDE2000 was proposed by the International Commission on Illumination. It calculates by converting colors from the RGB color space to the CIELAB color space first and then to the CIEL×C×H× color space using Equation (7), as follows:

C I E D E 2000 ((L_{1}^{*}, a_{1}^{*}, b_{1}^{*}), (L_{2}^{*}, a_{2}^{*}, b_{2}^{*})) = \sqrt{{(\frac{∆ L^{'}}{k_{L} S_{L}})}^{2} + {(\frac{∆ C^{'}}{k_{C} S_{C}})}^{2} + {(\frac{∆ H^{'}}{k_{H} S_{H}})}^{2} + R_{T} \frac{∆ C^{'}}{k_{C} S_{C}} \frac{∆ H^{'}}{k_{H} S_{H}}}

(7)

where

{\{L_{1}^{*}, a_{1}^{*}, b_{1}^{*}\}}_{i = 1}^{2}

are the representation of two colors in the CIELAB color space and

∆ L^{'}, ∆ C^{'}, a n d ∆ H^{'}

are the differences between two colors in terms of lightness, chroma, and hue, respectively. The rotation term

R_{T}

is added to solve the problem of inaccurate calculations in the blue and violet areas. The details of the weighting factors,

S_{L}, S_{C}, a n d S_{H}

, and parametric factors,

k_{L}, k_{C}, a n d k_{H}

, can be found in [22].

To improve the color deviation issue of U-GAT-IT, we introduce the differentiable CIEDE2000 formula proposed in [20] as the loss function to calculate the color difference between source images and generated images, making them more similar in human perception. The

L_{p e r c}

can be formulated as Equation (8).

L_{p e r c} = \frac{1}{H * W * 100} \sum_{y = 1}^{H} \sum_{x = 1}^{W} C I E D E 2000 (s, G_{s \to t} (s))

(8)

2.4. Adaptive Point-Wise Layer Instance Normalization (AdaPoLIN)

In Figure 5a, batch normalization [26] calculates the mean and variance of the same channel for all feature maps. Although it can obtain the data distribution information in the batch, it loses the features of a single image. In addition, the bias is generated when the sampling size is not large enough or the data distribution in the dataset is not uniform. These two problems will affect the generated image and not meet the goal of style transfer. In Figure 5b, layer normalization [27] calculates the mean and variance of all channels of a single input image, reducing the problem of data distribution in batch sampling in batch normalization. In Figure 5c, instance normalization [28] normalizes a single channel of the feature map and can retain more details of image features compared to batch normalization. Therefore, it is widely used in style transfer-related tasks.

U-GAT-IT proposes to combine instance normalization and layer normalization by training a ρ between zero and one to obtain the advantages of both methods. However, this combination does not consider that each channel represents different content, such as color, shape, and texture, resulting in information loss due to multiplication by ρ.

AniGAN [13] proposes adaptive point-wise layer instance normalization (AdaPoLIN), which concatenates instance normalization and layer normalization by channel before convolution. The parameters

γ

and

β

are learned by the fully connected layer. This method obtains feature maps with the same dimensions without reducing feature information. The AdaPoLIN can be formulated as Equation (9).

A d a P o L I N (z, γ, β) = γ \cdot C o n v ([\frac{z - μ_{I} (z)}{σ_{I} (z)}, \frac{z - μ_{L} (z)}{σ_{L} (z)}]) + β

(9)

In Equation (9),

z

represents the input feature map. The terms

μ_{I} (z)

and

σ_{I} (z)

are the mean and standard deviation of instance normalization, respectively, while

μ_{L} (z)

and

σ_{L} (z)

are the mean and standard deviation of layer normalization, respectively.

3. Experimental Results

We trained and evaluated our model on the Selfie2Anime dataset [11]. For better evaluation of anime images, we fine-tuned the InceptionV3 of FID [21] using the Caltech-256 Object Category dataset [29] and subsets of the Danbooru dataset [30] as training images for new categories. The hardware specification and software environment for the experiments are listed in Table 1 and Table 2, respectively. The optimizer used in the experiments is the Adam optimizer. The initial learning rate was set as 0.0001 for both the generator and discriminator.

In the field of image generation, quality is often measured using the Fréchet inception distance (FID) [21] metric. This metric calculates the distance between the mean and covariance matrix of the feature maps of the generated images and the real images at the intermediate layer of the InceptionV3 network to measure their similarity. However, this metric is based on the feature extraction model (InceptionV3) to obtain high-level features for comparison. The problem is that although Imagenet [31] has thousands of categories, if the input image data do not belong to any of these categories, the model will not be able to determine which category the image belongs to. In Imagenet, there are no data on anime categories, so the original FID cannot be reasonably used as an evaluation metric for our task.

To address this issue, in order to reduce computing resources while increasing animation categories, we fine-tuned the pre-trained inceptionV3 model on the Caltech-256 Object Category dataset, with subsets of Danbooru added. The dataset we used for this purpose contains 258 categories, with 27,726 training data and 3081 testing data. When fine-tuning the model, the training process ran for 16 epochs and achieved 87.8611% accuracy on the testing dataset. We calculated the output feature map of this new model that evaluated anime images. The FID formula was used, and we named it the Fréchet anime inception distance (FAID). To validate its effectiveness, we added real human facial images to a dataset with 100 animated images. Starting from 10, we increased the number by 5 each time. If a metric could distinguish between anime faces and real human faces, its value should have gradually increased to reflect the mixture of different categories in the dataset. As shown in Figure 6, the FAID gradually increased as the number of real human faces increased, while the original FID decreased when 10 or 15 images were added. Also, the FID was smaller at 25 images than at 20 images. This shows that our proposed FAID can more reasonably distinguish between animated and real human faces compared to FID.

For face style transfer, in order to ensure that the color distribution of the transformed image is similar to that of the original image, we used the perceptual color distance (PCD) in Equation (8) as the metric between the original image and the generated image. The smaller the value of this metric, the smaller the color difference between the two images, indicating that the generated image is closer to the original image in terms of human perception.

In order to evaluate whether the generative model simply converts real human faces to animated faces without generating animated faces with facial features in the same position, we used the facial landmark loss in Equation (6) as our evaluation metric, calculating the difference in facial landmarks using L1 loss. The smaller the difference, the closer the facial feature positions are before and after the conversion, indicating that the generated animated faces are more similar to the original real human faces.

Our survey compared the images generated by four models, the model proposed by us, U-GAT-IT [11], ACLGAN [32], and Council-GAN [33], with each model generating 50 images. A total of 13 people participated in the survey, resulting in 650 votes. We calculated the percentage of total votes for each option to evaluate users’ preferences for images generated by different models. This metric served as the main criterion for comparing our models and others.

Figure 7 shows the qualitative comparison results of our model with other models. It can be seen that in terms of generated quality, our model and U-GAT-IT can both generate realistic animated faces, while ACLGAN and Council-GAN have obvious primitive facial lines and features, meaning that our proposed model has better performance in style transfer. As for the color deviation issue, it can be seen that our model generates images that are closer in color to the original image, while U-GAT-IT has obvious color deviation problems. Similarly, it can also be seen that our model is closer to the original image in terms of facial features compared to U-GAT-IT, indicating that our proposed method can effectively improve the generated quality of U-GAT-IT.

Table 3 shows the quantitative results of images generated by different models. All four models listed in Table 3 were trained on the Selfie2Anime dataset with an image size of 128. Among these, UGATIT and ACLGAN were trained with the parameters provided in the papers, while Council-GAN used the model weights publicly available in the reference.

Our model performed the best in both the user study and FAID, showing that we had better performance in human visual perception and model distribution. In terms of color distance (PCD) and facial landmark distance (Land), we ranked second in performance metrics, which can be seen in Table 3. However, these two metrics only measure the similarity between the generated anime images and the original input face images. They cannot always accurately reflect the quality of the generated anime images. Due to the obvious facial features of ACLGAN, especially the eye features, the skin around the eye sockets remained the same. On the contrary, the output of our model demonstrated a significant change in eye shape. This led to a lower score on the color difference metric in that area. Nevertheless, it is obvious that the quality of images generated by our method is significantly better, as shown in Figure 7. Therefore, we believe that these two metrics should be used to evaluate the improvement of the model that can already generate images with decent quality, like U-GAT-IT and the proposed method. As Table 3 shows, we have made significant progress on these scores compared to U-GAT-IT.

Table 4 presents the ablation study that evaluated the impact of each component, including facial landmark loss, perceptual color loss, and AdaPoLIN normalization. Through the experimental results listed in Table 4, we can observe the specific contribution and improvement of each individual component. The original method sometimes failed to correctly generate the facial features of some animated faces. However, this problem was alleviated after adding the facial landmark loss. Our method can indeed help the model learn the positions of facial features in real-person images and generate corresponding facial features in animated images. After adding the perceptual color loss, the color deviation problem of the original method was greatly improved, especially for skin color, hair color, and other areas related to characters. We can see that although adding AdaPoLIN did not help in color preservation and facial feature position, it retained more features than the original method, such as hairstyle and facial expressions. The FAID value was also higher, and the generation quality was better than the original method.

4. Conclusions

We propose a facial image style transfer model based on U-GAT-IT. By combining the advantages of convolutional operation with instance normalization and layer normalization, the model improves the quality of generated images. The facial landmark loss is used to extract the facial landmarks of real human faces and generate animated characters. The error between the two is calculated to help the model generate images with facial features that are more in line with real human faces. In terms of image color, the perceptual color loss based on the CIEDE2000 color difference formula is used to obtain animated images that are visually similar to the original human images. To better evaluate animated characters, we propose the Fréchet anime inception distance to improve the shortcomings of the Fréchet inception distance in evaluating animated characters.

According to the qualitative results, our proposed model can improve the color deviation and inaccurate facial feature positioning issues in U-GAT-IT and generate images that are significantly better than those generated by ACLGAN and Council-GAN. In quantitative experiments, we conducted user surveys to obtain subjective feedback and found that the generated images of our model were more popular among the audience. In addition, the model performs better than our baseline model, U-GAT-IT, in both perceptual color distance and facial landmark distance. However, the proposed method still has some limitations. One of the limitations is that it relies on pre-trained models, such as convolutional pose machines, for facial landmark detection. If the model has biases or inaccuracies in detecting landmarks, the generated image is affected. The second limitation is that there are very few male characters in the dataset for the current system. Therefore, the generation results of male characters are not as good as those of female characters. Adding images with more male characters into the training data could help to resolve this problem.

For future studies, we suggest the following directions: First, since the painting styles of animated characters are very diverse, we believe that after collecting more relevant information, we can improve the ability of the model to identify images of different painting styles to make the value of Fréchet anime inception distance more reasonable. Second, by adding facial semantic segmentation features during model training, more diverse and realistic results can be obtained in the generated images. Third, we believe that by adding datasets with labels, such as glasses, hair color, and other characteristics, and modifying the generation space, we can control the characteristics of the generated characters and achieve the effect of image editing.

Author Contributions

Conceptualization, S.-L.L.; Methodology, S.-L.L. and H.-Y.C.; Software, S.-L.L.; Validation, C.-C.Y.; Formal analysis, H.-Y.C.; Investigation, C.-C.Y.; Resources, H.-Y.C.; Data curation, S.-L.L.; Writing—original draft, S.-L.L.; Writing—review & editing, H.-Y.C. and C.-C.Y.; Project administration, H.-Y.C.; Funding acquisition, H.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Science and Technology Council NSTC, Taiwan, under grant number 112-2221-E-008-069-MY3.

Data Availability Statement

The data that support the findings of this study are openly available in [11,29,30].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nakajo, T. Live2D. Available online: https://www.live2d.com/en/ (accessed on 8 December 2023).
Animaze by Facerig|Custom Avatars|Create Your Own Avatar, Holotech Studios, Inc. Available online: https://www.animaze.us/ (accessed on 18 December 2023).
Winnemöller, H.; Olsen, S.C.; Gooch, B. Real-time video abstraction. ACM Trans. Graph. 2006, 25, 1221–1226. [Google Scholar] [CrossRef]
Efros, A.A.; Leung, T.K. Texture Synthesis by Non-Parametric Sampling. In Proceedings of the International Conference on Computer Vision (ICCV), Corfu, Greece, 20–27 September 1999; p. 1033. [Google Scholar]
Hertzmann, A.; Jacobs, C.E.; Oliver, N.; Curless, B.; Salesin, D.H. Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, New York, NY, USA, 12–17 August 2001; pp. 327–340. [Google Scholar]
Gatys, L.A.; Ecker, A.S.; Bethge, M. Image Style Transfer Using Convolutional Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2414–2423. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 694–711. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-To-Image Translation with Conditional Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5967–5976. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar]
Kim, J.; Kim, M.; Kang, H.; Lee, K.H. U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation. In Proceedings of the International Conference on Learning Representations, Virtual, 26–30 April 2020; pp. 1–19. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Li, B.; Zhu, Y.; Wang, Y.; Lin, C.-W.; Ghanem, B.; Shen, L. AniGAN: Style-Guided Generative Adversarial Networks for Unsupervised Anime Face Generation. IEEE Trans. Multimed. 2021, 24, 4077–4091. [Google Scholar] [CrossRef]
Zeng, J.; He, X.; Li, S.; Wu, L.; Wang, J. Virtual Face Animation Generation Based on Conditional Generative Adversarial Networks. In Proceedings of the 2022 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Xi’an, China, 28–30 October 2022; pp. 580–583. [Google Scholar]
Zhao, Z.; Zhang, Y.; Wu, T.; Guo, H.; Li, Y. Emotionally Controllable Talking Face Generation from an Arbitrary Emotional Portrait. Appl. Sci. 2022, 12, 12852. [Google Scholar] [CrossRef]
Komatsu, R.; Gonsalves, T. Multi-CartoonGAN with Conditional Adaptive Instance-Layer Normalization for Conditional Artistic Face Translation. AI 2022, 3, 37–52. [Google Scholar] [CrossRef]
Yang, Z.; Qiu, Z. An Image Style Diversified Synthesis Method Based on Generative Adversarial Networks. Electronics 2022, 11, 2235. [Google Scholar] [CrossRef]
Kong, C. Research on Animation Character Expression Generation Based on Attention Conditioned Cyclegan. J. Cases Inf. Technol. JCIT 2024, 26, 1–22. [Google Scholar] [CrossRef]
Wei, S.-E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
Zhao, Z.; Liu, Z.; Larson, M. Towards Large yet Imperceptible Adversarial Image Perturbations with Perceptual Color Distance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1036–1045. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems 30; The MIT Press: Long Beach, CA, USA, 2017. [Google Scholar]
Sharma, G.; Wu, W.; Dalal, E.N. The CIEDE2000 color-difference formula: Implementation notes, supplementary test data, and mathematical observations. Color Res. Appl. 2005, 30, 21–30. [Google Scholar] [CrossRef]
Köstinger, M.; Wohlhart, P.; Roth, P.M.; Bischof, H. Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization. In Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 2144–2151. [Google Scholar]
Nicehuster. Github. Available online: https://github.com/nicehuster/cpm-facial-landmarks (accessed on 11 October 2023).
Kanosawa. Github. Available online: https://github.com/kanosawa/anime_face_landmark_detection (accessed on 20 October 2023).
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv 2016, arXiv:1607.08022. [Google Scholar]
Griffin, G.; Holub, A.; Perona, P. Caltech-256 Object Category Dataset; California Institute of Technology: Pasadena, CA, USA, 2007; Available online: https://resolver.caltech.edu/CaltechAUTHORS:CNS-TR-2007-001 (accessed on 2 November 2023).
Branwen, G.; Arfafax; Presser, S.; Anonymous. Danbooru Community. Anime Crop Datasets: Faces, Figures, & Hands. 2020. Available online: https://gwern.net/crop#danbooru2019-figures (accessed on 5 November 2023).
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Zhao, Y.; Wu, R.; Dong, H. Unpaired Image-to-Image Translation using Adversarial Consistency Loss. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 800–815. [Google Scholar]
Nizan, O.; Tal, A. Breaking the Cycle-Colleagues Are All You Need. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7857–7866. [Google Scholar]

Figure 1. Network architecture for face animation generation.

Figure 2. Example of facial landmark loss.

Figure 3. Example of mismatch between the RGB color model and human eyes.

Figure 4. Color deviation example of U-GAT-IT.

Figure 5. (a–c) Different types of normalization method.

Figure 6. Result of FAID and FID on anime dataset with different amounts of human data.

Figure 7. Qualitative comparison results of the Selfie2anime testing dataset with U-GAT-IT, ACLGAN, Council-GAN, and our method.

Table 1. Hardware specification for experiments.

Device	Specification
CPU	Intel^® Core™ i9-12900 K @ 3.20 GHz
GPU	NVIDIA Geforce RTX 3090
CPU Memory	64 GB
GPU Memory	24 GB

Table 2. Software environment for experiments.

Python	3.8.13
Pytorch	1.13.0
Opencv-python	3.4.11
scikit-learn	0.23.2
OS	Ubuntu 18.04 LTS

Table 3. Quantitative comparison results of the Selfie2Anime testing dataset with U-GAT-IT, Council-GAN, ACLGAN, and our method (↓ lower is better; ↑ higher is better).

Metrics	Methods
Metrics	U-GAT-IT	Council-GAN	ACLGAN	Ours
User ↑	11.85%	4.77%	8.92%	74.46%
FAID ↓	150.65	126.55	208.40	126.05
PCD ↓	22.74	22.78	15.82	16.45
Land ↓	6.93	6.57	6.03	6.28

Table 4. Quantitative results of ablation study (↓ lower is better; ↑ higher is better).

Facial Landmark Loss	Perceptual Color Loss	AdaPoLIN	FAID ↓	PCD ↓	Land ↓
			150.65	22.74	6.93
v			136.72	22.37	6.41
	v		132.85	18.16	6.67
		v	130.73	22.58	6.92
v	v		132.19	16.93	6.36
v	v	v	126.05	16.45	6.28

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lo, S.-L.; Cheng, H.-Y.; Yu, C.-C. Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation. Electronics 2024, 13, 4761. https://doi.org/10.3390/electronics13234761

AMA Style

Lo S-L, Cheng H-Y, Yu C-C. Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation. Electronics. 2024; 13(23):4761. https://doi.org/10.3390/electronics13234761

Chicago/Turabian Style

Lo, Shih-Lun, Hsu-Yung Cheng, and Chih-Chang Yu. 2024. "Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation" Electronics 13, no. 23: 4761. https://doi.org/10.3390/electronics13234761

APA Style

Lo, S.-L., Cheng, H.-Y., & Yu, C.-C. (2024). Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation. Electronics, 13(23), 4761. https://doi.org/10.3390/electronics13234761

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Weighted Cycle Generative Adversarial Network with Facial Landmark Recognition and Perceptual Color Distance for Enhanced Face Animation Generation

Abstract

1. Introduction

2. Methodology

2.1. Network Architecture

2.2. Facial Landmark Loss

2.3. Perceptual Color Loss

2.4. Adaptive Point-Wise Layer Instance Normalization (AdaPoLIN)

3. Experimental Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI