5.1. Qualitative Results
We used videos from the publicly available FaceForensics++ dataset as our source images. This dataset includes 977 face videos downloaded from YouTube and 977 face-swapped videos generated by DeepFakes. We demonstrated the effects of 12 sets of videos: 12 target face videos, and 12 source face videos. Six sets served as our unoccluded comparative experiments, and the other six as our occluded comparative experiments. All these images were excluded from our training set. We performed face swapping on all source and target pairs.
In unoccluded scenarios, we compared our method with mainstream SOTA methods: FaceSwap, DeepFaceLab, SimSwap, Inswapper, and BlendSwap, as shown in
Figure 7. FaceSwap is a popular open-source project, but when there are significant differences in resolution, lighting conditions, or angles between the source and the target images, the generated face-swapped images may appear blurred or unnatural, with unclear edges or noticeable stitching marks. Additionally, for images with rich facial details such as beards or eyebrows, these features may be inaccurately fitted or distorted. DeepFaceLab can maintain the same facial shape as the target face, but still suffers from poor fitting or blurred details when dealing with images with significant differences in lighting, angles, and skin tones. SimSwap performs well in lighting processing but cannot maintain the target face’s pose well. The model fails to correctly reconstruct textures in some areas, resulting in artifacts in some face-swapped results. Inswapper and BlendSwap perform well in detail fusion and handling facial occlusions, but still have shortcomings in lighting and skin tone consistency. Our AmazingFS method addresses these issues better by generating believable face-swapping results while achieving better performance in retaining target attributes. Additionally, it demonstrates strong identity preservation capabilities and presents a framework for better image restoration. STAM is a specially designed attention mechanism to overcome the loss of facial details by better focusing on details such as the eyes and facial edges of the target person. AdaIN+ reduces the adverse effects of style mismatches, enhancing the identity similarity of the resulting faces. Overall, our AmazingFS perfectly preserves lighting and facial styles while capturing the target face’s pose well, generating high-quality face-swapped results and retaining the source face’s identity.
In occlusion scenarios, we compared AmazingFS with FaceSwap, DeepFaceLab, SimSwap, Inswapper and BlendSwap. The comparison results in
Figure 8 show that each SOTA’s method has shortcomings in face-swapping images. FaceSwap produces blurred images and unnatural facial blending. The facial details and contours are noticeably blurred, and the alignment and fusion of facial features lack harmony. DeepFaceLab suffers from loss of facial details and significant skin tone differences. The details of facial features, such as eyes and mouths, are not well restored, leading to a lack of vivid facial expressions. When dealing with noticeable skin tone differences, DeepFaceLab tends to produce unnatural transitions, causing obvious image artifacts. SimSwap generates unnatural facial features and loses detail. The facial features, especially expressions and contours, are not natural enough. SimSwap’s performance is unsatisfactory in handling rich facial details, leading to detail loss and uneven edges between the face and the background. Inswapper has some occlusion resistance but performs poorly in processing facial details and expressions. The generated images tend to have poor fusion in complex backgrounds. BlendSwap, although having good occlusion resistance, still exhibits unnatural facial detail restoration and skin tone transitions. It tends to lose details when handling complex facial expressions, resulting in less clear outcomes. In contrast, AmazingFS demonstrates superior performance. It excels in aligning and fusing facial features, generating natural and realistic images. AmazingFS accurately restores facial details and handles expressions and features well. It achieves natural skin tone transitions with a harmonious overall effect. Most importantly, AmazingFS exhibits excellent occlusion resistance. When faced with occlusions like hair, glasses, and microphones, the face-swapping effect remains natural and comfortable, with hardly any noticeable flaws.
In the experimental results of AmazingFS, an “incomplete face swap” phenomenon appears. This happens because AmazingFS retains the source face features too strongly, making certain features too prominent and overshadowing the target face features. When AmazingFS keeps distinct features like jawlines, eye shapes, or cheekbones, these elements can dominate the swapped image and make the swap seem less complete. AmazingFS excels in handling lighting and skin tone consistency. This ensures that the lighting and skin tone of the swapped face blend seamlessly with the rest of the image and reduce visual discrepancies. However, this strength can also make the swapped face resemble the source face too closely and create the impression that the faces have not been completely swapped. For example, if the source face has a particular skin texture or lighting condition that is well preserved, the target face may inherit these characteristics too faithfully, making the distinction between the two faces less apparent. The attention mechanism designed to focus on key features might overly emphasize prominent aspects of the source face like unique eye shapes or mouth curvature. Similarly, the style fusion module, which integrates stylistic elements of both faces, might blend the features in a way that favors the source face’s distinctive traits. As a result, the final generated image can seem “incomplete” in terms of the swap, appearing to retain more of the source face than intended. This illusion of incomplete face swapping is particularly noticeable when swapping faces within the same ethnicity. For instance, swapping Asian faces with other Asian faces or Caucasian faces with other Caucasian faces might result in subtler differences in facial features due to inherent similarities within the same ethnic group. The minimal variation in features like skin tone, facial structure, and eye shape means that even slight retention of source features can make the swap seem less significant. Consequently, AmazingFS’s ability to maintain lighting and skin tone consistency, while generally advantageous, can exacerbate the perception of an incomplete swap in these scenarios.
We downloaded face videos from YouTube to perform frame-by-frame face swapping and demonstrated five sets of results, extracting the swapped face results from the 1st, 50th, 100th, 150th and 200th frames. AmazingFS exhibits significant advantages in face swapping across different frames, primarily due to its use of attention mechanisms, AdaIN+ and AmazingSeg technologies. The attention mechanism enables AmazingFS to more accurately capture and process facial features and details. This ensures the generated face-swapped images maintain high quality and a natural, realistic appearance under various complex backgrounds and lighting conditions. As shown in
Figure 9, with an increasing number of frames, facial feature alignment and fusion remain precise, and detail processing remains on point. AdaIN+ allows for AmazingFS to better adapt and integrate different styles and textures of source images. This plays a crucial role in maintaining facial consistency and detail restoration. Despite significant style differences between the source and target images, AdaIN+ helps achieve a natural face-swapping effect. AmazingSeg segmentation technology improves the segmentation accuracy of facial regions. This makes the face-swapping effect more natural and delicate. When dealing with hair, accessories, glasses, and microphones of different sizes, positions, and angles, AmazingSeg can accurately segment facial regions, avoiding common issues of unsmooth edges or improper occlusion handling, ensuring that the face-swapping effect remains natural and comfortable. By combining attention mechanisms, AdaIN+, and AmazingSeg technologies, AmazingFS significantly enhances the quality of face-swapped images. Its performance is stable and excellent across different frames, capable of handling various complex facial features and backgrounds, resulting in natural, realistic, and detailed face-swapping effects, demonstrating movie-level face-swapping quality.
5.2. Quantitative Results
We perform a quantitative comparison on the FaceForensics++ video dataset using SSIM, ID preservation, pose error, expression error, and face shape error to further demonstrate the effectiveness of our AmazingFS. For FaceSwap, DeepFaceLab, and SimSwap, we uniformly sample ten frames from each video to form a 10K test set.
The Structural Similarity Index (SSIM) measures the similarity between two images by comparing their luminance, contrast, and structural information. Specifically, SSIM divides the images into small blocks and calculates the mean, variance, and covariance of these blocks. These statistics are then used to measure the luminance similarity, contrast similarity, and structural similarity between the images, resulting in an SSIM value between 0 and 1. A value closer to 1 indicates higher similarity and better quality of the face-swapped image.
The ID preservation metric evaluates identity consistency by comparing the original face image with the face-swapped image. We extract feature vectors from both images and calculate the similarity between these feature vectors. A higher similarity indicates that the face-swapped image retains the identity features of the original face, verifying the authenticity and quality of the face swap. We use the pre-trained face recognition model ArcFace [
34] to ensure accurate feature extraction.
The pose error metric determines the accuracy of the facial pose by analyzing the positions and angles of facial keypoints. The algorithm identifies keypoints (e.g., eyes, nose, mouth) on both the source and target faces and solves the perspective-n-point problem. This involves estimating the camera’s (viewpoint’s) pose using a set of 3D model points and their 2D image correspondences. The resulting rotation matrix can be converted into Euler angles (pitch, yaw, roll). The differences in these Euler angles are calculated using cosine similarity and angle differences to determine the 3D facial pose. The pose metric evaluates the naturalness and realism of the face swap by comparing the facial pose in the swapped image with the target face’s pose. Smaller differences indicate better alignment with the original image’s head pose, resulting in a higher evaluation.
Expression error evaluation is based on analyzing and comparing facial expressions in the images. By comparing the differences in facial expressions between two images, we measure their similarity. We use facial feature extraction techniques such as facial keypoint detection and facial expression classification, and then calculate the Euclidean distance between the 2D landmarks of the target face and the swapped face. Smaller distances indicate better expression retention.
The core idea of the shape error metric is to measure the geometric similarity between the result face and the target face. By detecting keypoints on both the source and target faces, we obtain the positions of these keypoints and then calculate the differences between the two sets of keypoints. We use the Euclidean distance to quantify these differences, with smaller values indicating higher matching degrees in shape.
Table 1 presents a quantitative comparison of five face-swapping algorithms (FaceSwap, DeepFaceLab, SimSwap, Inswapper, and BlendSwap) across several key metrics, including SSIM, ID preservation, pose error, expression error, and shape error. The results indicate that AmazingFS outperforms the others in all metrics. Specifically, AmazingFS achieved the highest SSIM score of 0.82, indicating superior image quality. In terms of ID preservation, AmazingFS leads with a score of 95.12, demonstrating its excellence in maintaining identity consistency. In contrast, other face-swapping methods fall short on these key metrics. For pose and expression, AmazingFS attained relatively small errors of 2.35 and 23.45, respectively, highlighting its superior performance in preserving pose and expression. Finally, in face shape, AmazingFS again scored the best with 44.61, indicating high efficiency in geometric shape matching. Therefore, AmazingFS shows significant advantages in the realism and quality of face-swapping effects.
At the same time, we designed an evaluation system comprising five subjective evaluation metrics to assess the effectiveness of face-swapping algorithms. These metrics are identity consistency, attribute retention, anti-occlusion capability, detail preservation, and overall fidelity. Each metric is rated on a scale from 0 to 5, with higher scores indicating better performance.
Identity primarily measures whether the face-swapped image retains the original identity features. Higher scores indicate a greater similarity between the face-swapped image and the original face in terms of identity features.
Attribute evaluates whether the face-swapped image retains the original face’s attribute features, such as age, gender, and emotional expression. Higher scores indicate better attribute retention.
Anti-occlusion assesses the performance of the face-swapping algorithm in handling partial occlusions, such as wearing glasses or hats. Higher scores indicate that the face-swapped image maintains good recognition performance, even under occlusion.
Details primarily measures whether the face-swapped image retains the original face’s detailed features, such as skin texture and hair. Higher scores indicate better detail preservation.
Fidelity evaluates the overall visual realism of the face-swapped image. Higher scores indicate more realistic face-swapping effects.
To ensure the objectivity and reliability of the evaluation results, we invited 50 volunteers to subjectively score the face-swapping results. Each volunteer viewed the face-swapped images and rated them based on the five metrics mentioned above. Finally, as shown in
Table 2, we performed statistical analysis on the scores for each metric to evaluate the performance of the face-swapping algorithms across different aspects.
5.3. Ablation Studies
We conducted four different types of ablation experiments to demonstrate the effectiveness of our designed STAM, AdaIN+, and AmazingSeg modules.
5.3.1. CAM+ Upscaling Study
We performed an in-depth exploration of the optimal dimensions for the channel attention mechanism (CAM) upscaling module in the STAM module of AmazingFS. By employing various upscaling strategies and using the structural similarity index (SSIM), pose, and expression as evaluation metrics, we systematically analyzed the impact of different upscaling strategies on model performance.
Table 3 details the results of six experimental groups.
First, Group A adopted the strategy of directly reducing the channel dimensions to 1 × 1 × C/16. The results showed that this method performed the worst, with an SSIM of 0.74, a pose error of 6.29, and an expression error of 28.15. This indicates that simple dimensionality reduction leads to significant information loss, greatly diminishing the model’s performance in facial feature extraction and reconstruction. Specifically, this strategy failed to effectively preserve the details of facial landmarks and contours, resulting in poor overall quality of the face-swapped images. Next, Groups B and C increased the channel dimensions to 1 × 1 × 2C and 1 × 1 × 3C, respectively, and the results showed improved model performance. Group B achieved an SSIM of 0.75, a pose error of 5.26, and an expression error of 26.56, while Group C further improved to an SSIM of 0.79, a pose error of 3.54, and an expression error of 24.78. This demonstrates that increasing the channel dimensions can better capture fine-grained facial details, such as the eyes, nose, and mouth, thereby reducing information loss and enhancing the model’s performance in facial feature generation and alignment. Group D adopted the strategy of increasing the channel dimensions to 1 × 1 × 4C, and the results showed the best performance. This suggests that moderately increasing the channel dimensions can significantly enhance the model’s feature representation capability, making it more effective in handling facial expression details and complex facial poses. However, when the channel dimensions were further increased to 1 × 1 × 5C (Group E) and 1 × 1 × 10C (Group F), the model’s performance did not improve further, and even declined in some metrics. Group E achieved an SSIM of 0.81, a pose error of 3.92, and an expression error of 25.29, while Group F had an SSIM of 0.76, a pose error of 4.18, and an expression error of 26.11. This indicates that excessive increase in channel dimensions can lead to feature redundancy, negatively impacting the model’s performance. Specifically, these strategies may fail to effectively focus on critical facial keypoints, resulting in redundant and unstable feature representations. Ultimately, we set the channel dimension upscaling factor to 4 in the channel attention module. This strategy effectively preserves the original facial features and reduces information loss while significantly enhancing the network’s feature extraction capabilities, thereby improving the authenticity and visual consistency of the face-swapped images.
5.3.2. Attention Mechanisms in Face-Swapping
In this ablation study, we analyzed the impact of different attention mechanisms and feature processing methods adopted in the STAM module on the model’s performance.
Table 4 presents the performance of various experimental groups when applying different attention mechanisms and feature processing methods. The results indicate that Group G, which employs an improved channel attention mechanism (CAM+), spatial attention mechanism (SAM), and feature recalibration method, performed the best with an SSIM of 0.82, a pose error of 2.35, and an expression error of 23.45. In contrast, Group A, which did not use any attention mechanisms or feature processing methods, performed the worst, with an SSIM of 0.72, a pose error of 7.99, and an expression error of 34.02. This demonstrates that attention mechanisms and feature processing methods play a crucial role in enhancing the model’s performance. Although Group F performed well in terms of expression retention, it was behind Group G in pose retention and overall image quality. A comparison between Groups D and E revealed that the parallel use of attention modules outperformed their serial use. This is likely because the parallel structure can more effectively capture and process both local facial details and global information, thereby improving the model’s performance in facial feature extraction and reconstruction. Additionally, Groups D and E, which did not use the feature recalibration module, performed worse than Groups F and G, which did. This further underscores the importance of feature recalibration in enhancing the model’s sensitivity to facial details and improving image reconstruction quality. The feature recalibration module adjusts the weights of feature maps, allowing the model to focus more on key facial features, thus enhancing the realism and visual consistency of the final generated images.
This ablation study demonstrates that employing multiple attention mechanisms and feature processing methods can significantly enhance the performance of face-swapping models, particularly in facial feature generation, alignment, and reconstruction. Specifically, CAM+ and SAM improve the model’s sensitivity and accuracy to facial features by capturing both facial details and global information. Meanwhile, the feature recalibration module further enhances the model’s feature expression capability and focus on critical features, thereby improving the overall quality of face-swapped images. This finding is significant for the further optimization and development of advanced face-swapping technology. By deeply understanding and applying these attention mechanisms and feature processing methods, we can achieve better results in facial feature generation, alignment, and reconstruction, resulting in more natural and realistic face-swapped images.
5.3.3. Enhanced AdaIN Strategies for Face-Swapping
We improved the AdaIN module to enhance the performance of the face-swapping model, with results shown in
Table 5. The results indicate that Group 5, which employed AdaIN, a dynamic parameter adjustment mechanism, multi-scale fusion, and regularization strategies, performed the best. These strategies collectively optimized the model’s performance. Specifically, AdaIN normalizes the source image features and then recalibrates them using the mean and standard deviation of the target image, resulting in face-swapped images that retain the identity features of the source image while adopting the style features of the target image. The dynamic parameter adjustment mechanism enhanced the model’s adaptability, allowing it to better handle different facial features and expression changes. Multi-scale fusion captured details at various scales, ensuring the delicacy and fineness of the image. The regularization strategy maintained the naturalness and consistency of the image, preventing overfitting.
The combination of these strategies resulted in the best performance for Group 5 in terms of quality, naturalness, and consistency. Specifically, Group 5 achieved an identity preservation (ID) score of 95.12, a pose error of 2.35, and an expression error of 23.45. This demonstrates that the comprehensive use of these advanced feature processing and adjustment strategies can significantly enhance the realism and visual consistency of face-swapped images. In contrast, Group 1, which did not use any improvement strategies, performed the worst, with an identity preservation score of 82.18, a pose error of 6.06, and an expression error of 30.26. This further proves the critical role of AdaIN and its improvement strategies in enhancing model performance. Group 2, which used only AdaIN, showed significant improvement, with an identity preservation score increasing to 89.76 and reductions in pose and expression errors. However, it was only after the introduction of the dynamic parameter adjustment mechanism that Group 3′s performance further improved, with an identity preservation score of 92.54 and pose and expression errors reduced to 4.25 and 27.54, respectively. When the multi-scale fusion strategy was added, Group 4′s performance significantly increased, with an identity preservation score reaching 94.08 and further reductions in pose and expression errors to 2.81 and 24.66, respectively. Finally, with the addition of the regularization strategy in Group 5, the model performance reached its optimal level.
This study demonstrates that the improved AdaIN module, combined with dynamic parameter adjustment, multi-scale fusion, and regularization strategies, can significantly enhance the performance of face-swapping models, particularly in facial feature generation and style transfer. By working synergistically, these strategies improve the model’s adaptability and stability in handling complex facial features, resulting in more natural and realistic face-swapped images.
5.3.4. Comprehensive Strategies for Enhanced Face-Swapping
In this research, we conducted a detailed analysis of the impact of different combinations of STAM, AdaIN+, AmazingSeg, and multi-scale strategies on the performance of face-swapping models. The results, shown in
Table 6, revealed that Group 9, which employed all of these strategies comprehensively, performed the best, achieving an SSIM of 0.82, an identity preservation (ID) score of 95.12, and a shape error reduced to 44.61. This indicates that AmazingFS, through the combined use of these improved strategies, significantly enhanced the model’s structural similarity, identity preservation, and shape fidelity.
The STAM module improves the quality and naturalness of face-swapped images by enhancing the model’s ability to capture both details and global information. AdaIN+ effectively transfers style features by dynamically adjusting feature parameters, retaining the identity features of the source image while incorporating the style features of the target image. The AmazingSeg module excels in handling facial occlusions and detail restoration, ensuring consistency and authenticity in generated images even in complex scenarios. The multi-scale strategies capture and fuse features at different scales, enhancing the model’s adaptability and stability when processing images with varying resolutions and detail levels.
The experimental results showed that while the use of a single strategy offered some improvement, the effects were limited. For instance, using STAM, AdaIN+, AmazingSeg, or the multi-scale strategy alone resulted in certain improvements in SSIM and ID, but not to the optimal level. Significant performance enhancement was only achieved when multiple strategies were combined. For example, the group that combined STAM, AdaIN+, and the multi-scale strategy performed well, but not as well as Group 9, which utilized all strategies. These results indicate that the comprehensive application of STAM, AdaIN+, AmazingSeg, and multi-scale strategies maximizes the overall performance of face-swapping models. This not only significantly enhances structural similarity, identity preservation, and shape fidelity, but also improves the naturalness and consistency of the generated images. By deeply understanding and applying these advanced feature processing and adjustment strategies, we can achieve better results in facial feature generation, alignment, and reconstruction, resulting in more natural and realistic face-swapped images.
5.3.5. Analyzing the Role of AmazingSeg
As shown in
Figure 10, the comparison results of face-swapping technology with and without AmazingSeg demonstrate its effectiveness in segmenting the face area more accurately and handling occlusions such as hair, glasses, and microphones. AmazingSeg focuses on learning the segmentation of the face area, making the face-swapping effect more natural and realistic. In the first row of comparison results, without AmazingSeg, there is a noticeable color difference and contour inconsistency between the target face and the source face, especially at the edges of the face and hair area, making the face-swapping effect look unnatural. After using AmazingSeg, the fusion effect of the face significantly improves, with more natural skin tone transitions and more harmonious contours, particularly with finer handling of the edges between the face and hair. The second row shows similar improvements. Without AmazingSeg, the integration of the target face’s eyeglasses and facial features is inconsistent, leading to distortions in the source face’s features and an unrealistic overall effect. After using AmazingSeg, the fusion of the glasses and face appears more natural, and the features of the source face are better preserved, making the face-swapping effect more realistic and natural. The third and fourth rows further illustrate the advantages of AmazingSeg. Without AmazingSeg, there are noticeable traces at the edges of the fusion between the source face and the target face. After using AmazingSeg, the edge transitions are more natural, and the overall effect looks more harmonious. The comparison results in the last row also significantly demonstrate the role of AmazingSeg in improving the face-swapping effect. Without AmazingSeg, the features of the source face are distorted during the face-swapping process, especially in the facial contours and makeup, making the face-swapping effect appear unrealistic. After using AmazingSeg, the features of the source face are well preserved, with natural integration of facial contours and makeup, resulting in a more realistic overall effect.
5.3.6. Analyzing the Role of GAN
In
Figure 11, a comparative analysis of face-swapping effects with and without the use of Generative Adversarial Networks (GAN) is presented. GAN, an advanced deep learning technology, utilizes two neural networks—a generator and a discriminator—that work antagonistically to produce highly realistic images. This technique significantly enhances realism and detail retention in face-swapping tasks. Results without GAN exhibit unnatural facial textures and lighting, along with poor integration. Faces with glasses show poorly fused eyes and facial features, appearing very unnatural. Faces with beards display poor integration of the beard with facial features, and the overall appearance looks mismatched and unrealistic. In contrast, using GAN markedly improves integration, with more harmonious lighting and better fusion of glasses and facial features, resulting in more natural textures overall. GAN-enhanced results show well-aligned facial features and present a more realistic effect even in complex scenarios involving glasses and beards. This experimental comparison demonstrates that GAN methods provide better facial feature fidelity and adaptability in face-swapping tasks. In AmazingFS, employing a GAN framework for face-swapping achieves higher quality and more natural effects.
5.4. Discussion and Prospects
We propose the AmazingFS method, which integrates STAM, AdaIN+, AmazingSeg, and multi-scale strategies to achieve significant performance improvements. AmazingFS offers several key advantages. It excels in image quality, with the combination of the STAM module and multi-scale strategies ensuring that the generated face-swapped images capture both fine details and global information, thereby enhancing the overall naturalness of the images. The AdaIN+ module dynamically adjusts feature parameters, effectively transferring identity features from the source image while incorporating style features from the target image, significantly improving the identity preservation and style consistency of the face-swapped images. Additionally, the AmazingSeg module demonstrates exceptional performance in handling facial occlusions and detail restoration, ensuring consistency and authenticity in generated images even in complex scenarios, and enhancing the model’s adaptability and stability when dealing with complex facial features. As shown in
Figure 12, AmazingFS effectively manages occlusions and meticulously integrates fine details such as the eyes and mouth, demonstrating the method’s robustness and precision. This detailed comparison further validates the superior capability of AmazingFS in maintaining realistic and consistent facial features.
Despite the superior performance of AmazingFS in generating realistic and consistent images, its high computational complexity currently limits real-time face-swapping. Producing face-swapping videos requires several preprocessing and post-processing steps, including face alignment, face detection, face cropping, face blending, and sharpening. These steps are computationally intensive and time-consuming. In the preprocessing phase, face alignment and detection require precise localization and correction of facial features to ensure consistent key points between the source and target images, while face cropping requires accurately extracting the facial region from the image. In the post-processing phase, face blending ensures the seamless integration of the face-swapped image with the background, and sharpening enhances the clarity and detail of the final image. The substantial computational load of these processes limits the feasibility of real-time face-swapping. Therefore, future research should focus on optimizing computational efficiency to achieve faster and real-time face-swapping capabilities.