AmazingFT: A Transformer and GAN-Based Framework for Realistic Face Swapping

Liu, Li; Tong, Dingli; Shao, Wenhua; Zeng, Zhiqiang

doi:10.3390/electronics13183589

Open AccessArticle

AmazingFT: A Transformer and GAN-Based Framework for Realistic Face Swapping

School of Computer and Information and Engineering, Xiamen University of Technology, Xiamen 361024, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(18), 3589; https://doi.org/10.3390/electronics13183589

Submission received: 28 July 2024 / Revised: 19 August 2024 / Accepted: 27 August 2024 / Published: 10 September 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Current face-swapping methods often suffer from issues of detail blurriness and artifacts in generating high-quality images due to the inherent complexity in detail processing and feature mapping. To overcome these challenges, this paper introduces the Amazing Face Transformer (AmazingFT), an advanced face-swapping model built upon Generative Adversarial Networks (GANs) and Transformers. The model is composed of three key modules: the Face Parsing Module, which segments facial regions and generates semantic masks; the Amazing Face Feature Transformation Module (ATM), which leverages Transformers to extract and transform features from both source and target faces; and the Amazing Face Generation Module (AGM), which utilizes GANs to produce high-quality swapped face images. Experimental results demonstrate that AmazingFT outperforms existing state-of-the-art (SOTA) methods, significantly enhancing detail fidelity and occlusion handling, ultimately achieving movie-grade face-swapping results.

Keywords:

face swapping; GAN; transformer

1. Introduction

Face-swapping technology has found widespread application not only in entertainment and social media but also in virtual reality, security, and film production, demonstrating significant potential across these fields. Although Generative Adversarial Networks (GANs) [1] and 3D face reconstruction [2] were once the go-to methods for face-swapping, they are increasingly inadequate for meeting higher technical demands. Their limitations in achieving realistic and natural face-swapping have become more apparent over time.

Three-dimensional face reconstruction technologies like 3D-AwareFS [3] and DiffSwap [4] extract three-dimensional geometric information from two-dimensional images to reconstruct the three-dimensional structure of the face. This process involves capturing the shape, pose, and lighting of the face to generate a 3D model. However, these methods often face challenges in achieving precise reconstruction, which can result in subtle inaccuracies, artifacts, and an overall lack of realism in the final swapped face images. Generative Adversarial Networks comprise a generator and a discriminator that collaborate through adversarial training to generate realistic face-swapping images. The generator learns the features of the input image to produce results resembling the target image, while the discriminator aids in reducing artifacts and enhancing the clarity and naturalness of the generated images. FaceSwap-Gan, DeepFaceLab, Simswap, Inswapper and BlendSwap [5,6,7,8,9] have demonstrated remarkable face-swapping performance, yet even GAN-based technologies, while capable of generating high-fidelity images, still encounter challenges in fully retaining subtle identity details from the source face such as skin tone, facial features, and makeup along with the structural attributes of the target face, including face shape and facial expressions.

To achieve realistic and natural face swapping, this paper introduces AmazingFT. This is an advanced model built on the GAN framework. The core architecture of AmazingFT integrates GAN and Transformer components [10]. It consists of three key modules, the Face Parsing Module, the Amazing Face Feature Transformation Module (ATM), and the Amazing Face Generation Module (AGM). These modules work together to deliver high-fidelity, natural face-swapping results. The Face Parsing Module is an off-the-shelf component. It is responsible for predicting masks to separate the inner face from the background and for extracting facial semantics to guide the Transformer’s training in the ATM. By accurately delineating facial regions and extracting relevant semantic information, this module establishes a solid foundation for subsequent feature transformation and generation. The ATM module receives the source inner face, target inner face, and target face semantics as inputs. First, the ATM learns the semantic correspondence between the source and target faces, and then maps the identity features of the source face to the corresponding regions of the target face. Finally, this approach allows the ATM to achieve high-fidelity transfer of the source face features while maintaining the semantic consistency of the target face. The AGM module slices and transforms multi-scale facial features generated by the ATM to produce high-fidelity swapped face images. By comprehensively utilizing multi-scale features, the AGM ensures that the generated face-swapping images achieve the best performance in both detail and overall effect.

Our goal is to generate film-grade face-swapping results based on the features and semantic information of the source and target faces. To achieve this, we apply a multi-scale feature fusion strategy within the overall GAN framework, enabling the generator to capture complex relationships among facial features at four different scales, ensuring high fidelity and rich detail in the images. The discriminator compares real and generated images to identify unrealistic features and provides feedback that helps the generator refine the results, effectively reducing visual artifacts such as blurring, ghosting, and unnatural edges. The face-swapping effect of AmazingFT is shown in Figure 1.

The main contributions of this paper are as follows:

We propose an innovative face-swapping model, AmazingFT, combining GAN and Transformer concepts by introducing Transformer architecture into the GAN framework.
The model incorporates multi-scale feature transformation technology, integrating facial features at different scales to achieve more refined and comprehensive feature mapping.
We conducted extensive qualitative and quantitative experiments on public datasets (CelebA-HQ [11] and FaceForensics++ [12]), demonstrating that the model outperforms other SOTA methods by achieving more natural face blending, finer detail preservation and higher-quality face-swapping image generation.

2. Related Work

In recent years, deep-learning-based face-swapping technology has become increasingly popular. We have integrated various GAN-based face-swapping methods with Transformer concepts to develop our own AmazingFT model.

2.1. GAN-Based Face Swapping

In the development of face-swapping technology, GAN-based research has made significant progress, though each method has its flaws. FSGAN [13] achieves high-quality face-swapping but its multi-stage processing is complex and computationally expensive, with inadequate detail in extreme conditions. FSGANv2 [14] improves detail and naturalness but at the cost of increased computational complexity and longer training times, with stability issues when handling extreme expressions. E4S [15] ensures high-fidelity transfer through regional GAN inversion but may suffer from detail loss in complex scenes or under challenging lighting conditions. FaceSwap-GAN [5] uses denoising autoencoders and attention mechanisms for more realistic images but still needs improvement in high-resolution and dynamic video scenarios. DeepFaceLab [6] generates realistic face-swapping images using autoencoders and Poisson image editing techniques but struggles with detail preservation and complex backgrounds. SimSwap [7] has a simple structure and efficient training but faces challenges in handling detail and expression consistency in complex scenes. Inswapper [8] focuses on detail and expression consistency yet may experience detail loss under rapidly changing expressions and complex lighting. BlendFace [9] eliminates attribute bias to generate high-fidelity face-swapping images but may encounter attribute recognition issues or over-smoothing in extreme scenarios. These methods consistently exhibit the limitation of producing face-swapping results with varying levels of detail blurriness and artifacts. Our AmazingFT approach not only mitigates these challenges but also enhances detail fidelity and naturalness, resulting in more realistic and convincing face-swapping outcomes.

2.2. Applications of Transformers

Transformer models have made significant progress in deep learning, with applications in natural language processing (NLP), computer vision (CV), and multimodal learning. In NLP, Google’s BERT [16] uses a bidirectional Transformer encoder for pre-training and fine-tuning, significantly improving performance in tasks like question answering and language understanding. OpenAI’s GPT [17] series uses a Transformer decoder to generate coherent natural language text, demonstrating excellent performance. In computer vision, Google’s Vision Transformer [18] divides images into patches and processes them as input sequences, achieving end-to-end image classification with outstanding results on datasets like ImageNet [19]. Facebook’s DETR [20] transforms object detection into a set prediction task, capturing global context information in images and simplifying traditional detection post-processing. In multimodal learning, VideoBERT [21] combines visual and textual information, greatly enhancing the performance of multimodal pre-training models and excelling in video understanding and description generation tasks. These studies underscore the significant impact of Transformer models across various domains. Building on this foundation, our research extends the application of Transformer technology to the face-swapping domain, where it is instrumental in enhancing the realism and fidelity of generated images. By harnessing the advanced capabilities of Transformers, we seek to redefine the potential of face-swapping, achieving results that are more seamless and natural than previously possible.

3. Methods

As shown in Figure 2, the network architecture of AmazingFT consists of three main steps: face parsing, feature transformation and image generation. First, the Face Parsing Module parses the source and target images, generating the target background, target semantics, target inner face and source inner face. These parsing results provide the foundational information for subsequent feature extraction and transformation. Next, in the feature transformation step, the model uses the Amazing Transformation Module to extract important features from the target and source faces using VGG [22] feature extractors and semantic feature extractors. These features are processed through a multi-head attention mechanism and combined via matrix multiplication to obtain the final feature representation, ensuring that the source face features can be effectively mapped onto the target face.

The image generation step is crucial in the AmazingFT architecture. The Amazing Generation Module fuses the multi-scale features transformed by the Amazing Transformation Module (ATM) with the multi-scale features of the target background (B1, B2, B3, B4). This is carried out using a multi-scale generator to produce images of different resolutions (256 × 256, 128 × 128, 64 × 64, 32 × 32). Each resolution’s generated image is validated by a discriminator through Amazing GAN to ensure the authenticity and quality of the generated images. Finally, the output high-resolution swapped face image is compared with the real image, minimizing feature loss. Through this process, the model can efficiently achieve high-quality face swapping, generating realistic and natural face-swapping images.

3.1. Amazing Transformation Module

The main function of the Amazing Transformation Module is to precisely transform and map the source face features onto the target face. Using VGG feature extractors and semantic feature extractors, it extracts detailed and semantic features from both the target and source faces. These features are processed through a multi-head attention mechanism, generating weighted feature representations. These weighted features are integrated through matrix multiplication and fused with the target face’s features, ensuring that the source face’s identity features are accurately mapped to the corresponding areas of the target face, resulting in a high-quality composite feature representation. These feature representations are used in the image generation module to produce multi-scale, high-resolution swapped face images.

Feature Extraction

In this step, the model uses VGG feature extractors and semantic feature extractors to extract important features from the target and source faces. The VGG feature extractor is a pre-trained convolutional neural network that effectively captures both low-level and high-level features of an image. These features include edges, textures, and more complex semantic information such as facial contours and expressions. Simultaneously, the semantic feature extractor specifically extracts semantic features of the target and source faces, including information on key facial regions like the eyes, nose, and mouth’s positions and shapes. These features provide the foundation for subsequent feature transformation and fusion.

Multi-Head Attention Mechanism

The extracted features are fed into the multi-head attention mechanism, where each input feature is mapped to a query

(Q)

, key

(K)

, and value

(V)

vector. The compatibility between these vectors is calculated to generate weighted feature representations. In the multi-head attention mechanism, the query

(Q)

, key

(K)

, and value

(V)

vectors are derived from the input features. Each attention head

i

applies a distinct learnable weight matrix (

W_{i}^{Q}

and

W_{i}^{K}

) to transform the query and key. The dot product of the query vector

Q W_{i}^{Q}

and the key vector

K W_{i}^{K}

is used to compute the compatibility score for each query–key pair, which measures the correlation between the input features. These dot product results are then passed through a softmax function to obtain a probability distribution that reflects the matching probabilities between each query and key, thereby generating the attention weights. These calculated attention weights are used to perform a weighted sum on the value vectors

V

, producing the output for each attention head. The calculation formula for the multi-head attention mechanism on the query (Q) and key (K) is as follows:

{head}_{i} = softmax (\frac{(Q W_{i}^{Q}) {(K W_{i}^{K})}^{T}}{| Q W_{i}^{Q} | | K W_{i}^{K} |})

(1)

The multi-head attention mechanism enables the model to simultaneously focus on different parts of the input features, thereby capturing both local and global feature details. Each attention head operates in a different subspace, allowing the model to consider diverse semantic information and contextual relationships, which enhances its understanding of the input features. The final concatenation and linear transformation steps integrate the information from all attention heads, resulting in a richer and more accurate feature representation.

Finally, the outputs of all attention heads are concatenated and integrated using a linear transformation matrix

W_{0}

to produce the final multi-head attention output

Q M

. Specifically, this means that multiple attention heads (head

^{1}, \dots

, head

_{i}

) independently calculate their attention values, and their outputs are then concatenated to form a matrix

Q_{w}

, which is further processed through the linear transformation matrix

W_{0}

. This approach allows the model to focus on different aspects of the input features across various attention heads, capturing more details and information. The multi-head attention mechanism enables the model to concurrently process different parts of the input features, capturing more details and semantic relationships. This mechanism not only improves computational efficiency but also enhances the model’s ability to capture both global and local features.

Matrix Multiplication and Feature Fusion

After completing the multi-head attention mechanism’s calculations, the generated weighted feature representations are integrated through matrix multiplication. This step ensures that the features of the source face can be accurately mapped to the corresponding areas of the target face, and the generation of the semantic-aware correspondence matrix begins with extracting the source face feature matrix

Q_{M}

and the target face feature matrix

K_{M}

from the multi-head attention mechanism. These matrices contain the feature vectors derived from the source and target faces, respectively. Next, the dot product between the matrix

Q_{M}

and the transpose of the matrix

K_{M}^{T}

is calculated to assess the similarity between each source and target feature. To ensure that the similarity calculation is on a consistent scale, the dot product results are normalized by the L 2 norms of

Q_{M}

and

K_{M}

(i.e., ‖

Q_{M}

‖ and ‖

K_{M}

‖). This normalization step helps mitigate the influence of the feature vector lengths, enhancing the accuracy of the similarity computation. Subsequently, the normalized dot product results are passed through a softmax function, which converts the raw values into a probability distribution, allowing the similarity between each source and target feature to be expressed probabilistically. The resulting matrix,

C

, known as the semantic-aware correspondence matrix, effectively reflects the relationship between the source and target facial features and plays a crucial role in the subsequent feature fusion and image generation processes. The calculation formula is as follows:

C = softmax (\frac{Q_{M} K_{M}^{T}}{| Q_{M} | | K_{M} |})

(2)

By leveraging matrix multiplication and softmax operations, Equation (2) produces a semantic-aware correspondence matrix C, which delineates the matching relationship between the source and target facial features. This process ensures precise feature alignment, enabling the identity features of the source face to be effectively mapped onto the corresponding regions of the target face. Through such fusion, the model can maintain the identity characteristics of the source face while ensuring that the generated face-swapped image is highly consistent with the target face in both geometry and semantics, ultimately resulting in a more natural and realistic image.

3.2. Amazing Generation Module

The Amazing Generation Module (AGM) is responsible for converting the composite feature representations generated by the Amazing Transformation Module into high-quality swapped face images, as shown in Figure 3. This module ensures the authenticity and detail preservation of the generated images through the collaborative work of a series of residual blocks, generation blocks, and feature transfer nodes.

Residual Blocks and Generation Blocks

AGM contains multiple residual blocks (B1, B2, B3, B4). These residual blocks use skip connections to transmit information, ensuring gradient stability in deep networks and preventing feature loss and information decay. Each residual block depends on the output features of the previous residual block, transferring these features to the next residual block through feature transfer nodes (T1, T2, T3, T4), forming a complete feature processing chain. During the feature processing and generation process, features at different scales are fused through concatenation operations. These features include upsampled and downsampled features, ensuring high-quality image generation at different resolutions through a multi-scale generation strategy.

Multi-Scale Generation and Feature Transfer

AGM employs a multi-scale generation strategy to ensure high-quality image generation at different resolutions. Features are processed through multiple upsampling and downsampling operations at different scales. Upsampling gradually increases the resolution of the feature maps, while downsampling is used for feature fusion and refinement. Multi-scale generation ensures that both the details and global structure of the generated images are effectively processed and preserved at different levels. Feature transfer nodes (T1, T2, T3, T4) are responsible for transmitting feature information between different residual blocks. These nodes act as bridges, aggregating and transferring features from different levels, ensuring the model can utilize feature information from all levels when generating images.

Output and Optimization

The features processed by each residual block are passed to the generation blocks, which are responsible for generating the final image output. The figure shows three generation blocks, each generating images of a specific resolution (256 × 256, 128 × 128, 64 × 64, 32 × 32). The generation blocks create high-resolution image outputs through convolution operations and feature concatenation, integrating features from different scales to ensure the generated images are visually natural and realistic.

3.3. Amazing GAN

The key steps of Amazing GAN in this paper involve the collaborative work of multi-scale generators and discriminators to ensure that the generated face-swapping images achieve optimal detail and overall effect at all levels. The multi-scale generators produce images at different resolutions (256 × 256, 128 × 128, 64 × 64, 32 × 32). The generators use progressive upsampling and convolution operations to generate high-resolution face images from low resolution. At each resolution level, the generators perform multiple convolution and upsampling operations to capture and preserve the details of the images. The advantage of multi-scale generation is that it can process image features at different scales, ensuring that the generated images are effectively handled in terms of both detail and overall structure. Lower-resolution generators can focus on the overall structure of the face image, while higher-resolution generators can capture fine textures and details, resulting in detailed and realistic face images.

For each generated resolution image, the discriminator validates it. The discriminator’s task is to evaluate the realism of the generated images and provide feedback to optimize the generator’s output. The discriminator extracts feature from the input image through a series of convolutional layers and outputs a probability value indicating the realism of the face image. The generator and discriminator continually improve each other through adversarial training: the generator attempts to produce more realistic face images to deceive the discriminator, while the discriminator continually enhances its ability to detect fake face images. Through this adversarial mechanism, the generator gradually improves the quality of its generated face images, making them more realistic and natural.

During the face image generation process, the generator continually adjusts its parameters to reduce the difference between the generated images and the real images. This process is achieved by minimizing feature loss; the smaller the difference between the generated image and the real image, the lower the feature loss. Ultimately, the high-resolution face-swapping images generated by the generator will be highly similar to real images visually. Through this method of multi-scale generation and adversarial training, the Amazing GAN module can produce high-quality face-swapping images that are not only rich in detail but also very natural in overall visual effect, ensuring the realism and consistency of the face-swapping effect.

Figure 3. Amazing Generation Module network framework.

3.4. Loss Function

In AmazingFT, we mainly use different loss functions to train the model to achieve high-quality face swapping. Below are the five main loss functions used in AmazingFT and their formulas:

L1 Loss

L_{1} = ‖ I_{R} - I_{T} ‖

(3)

where I_R represents the source face image and I_T represents the target face image. The L₁ loss function plays an important role in preserving feature semantic information and reducing the blurriness of the generated image, ensuring that the final swapped face image meets the expected quality in both detail and overall effect.

Adversarial Loss

L_{a d v} = - E [\log D (G (z))]

(4)

where E represents the mathematical expectation, D represents the discriminator, and G represents the generator. z is a random noise vector. D(G(z)) represents the discriminator’s judgment of the images generated by the generator. The adversarial loss function consists of the generator and the discriminator. The generator’s goal is to generate realistic swapped face images that are as close as possible to real images, while the discriminator’s goal is to distinguish between real and generated images. The generator and discriminator compete through adversarial training, continually improving each other’s performance. The generator aims to maximize the probability that the discriminator will judge its generated images as real, while the discriminator aims to maximize its ability to correctly distinguish between real and generated images.

Perceptual Loss

L_{p e r} = \sum_{t = 0}^{n} ‖ ϕ_{i} (I_{R}) - ϕ_{i} (I_{T}) ‖

(5)

The basic idea of the perceptual loss function is to use a pre-trained convolutional neural network (VGG network) to extract high-level features of images and then calculate the difference between the generated image and the target image in these feature spaces. Compared to traditional pixel-level loss functions (such as mean squared error), the perceptual loss function can better capture the semantic information and structural features of images.

\emptyset_{i}

represents the features extracted by the pre-trained neural network (e.g., VGG) at the i-th layer.

Contextual Loss

L_{context} = \sum_{l} - \log (C X (ϕ_{l} (I_{T} \cdot m_{t g t}), ϕ_{l} (I_{R} \cdot m_{s r c})))

(6)

m_{t g t}

and

m_{s r c}

are the masks of the target face and source face generated by the face parsing network.

C X

represents the contextual similarity score. The contextual loss function is used to align the identity features of the swapped face and the source face in terms of color, texture, etc. This loss function ensures the consistency of visual features between the generated image and the source image by calculating similarity in the feature space.

The total loss function

L_{t o t a l}

can be expressed as follows:

L_{t o t a l} = α L_{1} + β L_{a d v} + γ L_{p e r} + δ L_{c o n t e x t}

(7)

The parameters α, β, γ and δ critically influence the experimental outcomes. Increasing α enhances the model’s effectiveness on the primary task L1, though it may cause the model to deprioritize other loss functions. A higher β value can bolster the model’s robustness, yet an excessive β might undermine the main task’s performance. Adjusting γ can refine feature extraction capabilities, but an overly high γ risks leading to overfitting. Similarly, while increasing δ aids in leveraging contextual information, an excessive δ may induce undesirable dependencies. Consequently, determining the optimal values for α, β, γ and δ necessitates a careful balance among these loss components to achieve superior model performance across diverse scenarios. Based on the configuration detailed in [23], we set β = 0.2, γ = 0.02 and δ = 0.01, subsequently adjusting α to 1.2 through empirical tuning.

4. Experiments

In this chapter, we present our dataset, training details, qualitative results, quantitative results, and ablation studies.

4.1. Datasets and Processing

We selected the publicly available dataset CelebA-HQ for training FaceSwap, DeepFaceLab, SimSwap, Inswapper, BlendSwap, and AmazingFT, and compared the results of the latest models with those of AmazingFT. CelebA-HQ is a high-quality facial image dataset containing 30,000 images with a resolution of 1024 × 1024. These images were carefully selected and post-processed from the original CelebA [4] dataset. During data preprocessing, we performed face detection to ensure each image contained a clearly visible face. Faces were aligned by detecting facial key points to standardize facial poses, ensuring that faces in all images were in the same position and angle. Finally, the images were scaled and cropped to a specified resolution, usually 256 × 256 or 512 × 512, to meet different model input requirements. These preprocessing steps ensured the consistency and quality of the training data, thereby improving model performance and face-swapping effects.

4.2. Training Details

We trained the Amazing generator from scratch, with weights randomly initialized using a normal distribution. We employed the Adam optimizer (β1 = 0.5; β2 = 0.999) with a learning rate of 0.0005. The learning rate decayed exponentially by 0.97 every 100 K steps. All networks were trained on NVIDIA GeForce RTX 4090 GPUs and Intel Core i7-13700HX processors. The batch size was set to 4, and we used a progressive multi-scale approach for training, starting at a resolution of 32 × 32 and ending at 256 × 256, with a total of 200 K iterations. The target (Xt) and source (Xs) images had their brightness, contrast, and saturation randomly adjusted. Each configuration in the ablation experiments was trained for 300 K steps, with the total training iterations controlled at 300 K for ablation studies. For comparative experiments with other recent methods, we maintained the same batch size and training iterations.

4.3. Qualitative Results

In Figure 4, we compare AmazingFT with five other state-of-the-art (SOTA) face-swapping methods (FaceSwap-Gan, DeepFaceLab, SimSwap, Inswapper, BlendSwap). We use FaceForensics++ as the test set, which is a large-scale dataset for facial forgery detection, containing 1000 original video sequences sourced from 977 YouTube videos. These videos feature trackable, primarily frontal faces without occlusions, allowing automatic tampering methods to generate realistic fake videos. In our experiments, we sampled 10 frames from each video for comparison. The results demonstrate that AmazingFT excels in face-swapping performance. The face-swapped images generated by AmazingFT show superior naturalness, consistency, detail preservation, expression conveyance, and lighting matching.

The images generated by AmazingFT are very natural, with no noticeable artifacts or incoherent areas, appearing as if they are part of the original image. The entire face maintains consistency in lighting and skin tone. Features of the source face, such as the eyes, nose, and mouth, are accurately mapped onto the target face and maintain high consistency. The positioning and proportions of facial features blend naturally with the target face, with no noticeable misalignment or distortion. In terms of detail preservation, AmazingFT performs exceptionally well, retaining skin texture, hair, and other fine features. Even at high resolution, the face-swapped images remain delicate and realistic. Furthermore, the images accurately convey the expressions of the source face. Expression changes, whether smiling or frowning, are natural and coherent, with facial expression details well presented. In lighting matching, AmazingFT also performs excellently, seamlessly integrating lighting conditions. The transitions between highlights and shadows are very natural, avoiding noticeable lighting inconsistencies.

In comparison, while BlendSwap performs well, it falls slightly short of AmazingFT in terms of naturalness, detail preservation, and lighting matching. FaceSwap exhibits noticeable artifacts and incoherent areas, DeepFaceLab has deficiencies in expression consistency and lighting matching, SimSwap performs poorly in facial feature consistency and detail handling, and although Inswapper does well in detail preservation, it has shortcomings in expression conveyance and lighting matching, leaving room for significant improvement. These comparisons clearly showcase the outstanding performance of AmazingFT in face-swapping tasks.

We downloaded face videos from YouTube to perform frame-by-frame face-swapping, showcasing five sets of results by extracting the face-swapped results of the 1st frame, 25th frame, 50th frame, 75th frame, and 200th frame. As shown in Figure 5, AmazingFT demonstrates powerful performance in the field of face-swapping. AmazingFT not only generates high-quality, realistic, and natural face-swapping effects but also effectively handles partially occluded faces. By combining Transformer and GAN technologies, AmazingFT excels in both detail processing and overall visual effects. The frame-by-frame face-swapping results displayed (1st frame, 25th frame, 50th frame, 75th frame, and 100th frame) clearly illustrate its stability and consistency, with facial expressions, lighting conditions, and facial details all handled very naturally.

The Transformer, through its multi-head attention mechanism, captures and maintains the complex semantic relationships between the source face and the target face. Observing the frame-by-frame images, the expression and pose changes in each frame are very natural. For instance, in the first column from the 25th frame to the 100th frame, as the expression changes, the features on the target face also change synchronously, demonstrating the model’s strong capability in capturing dynamic features. Through feature fusion, the Transformer ensures that features at different scales are effectively processed and preserved. In each frame, details such as skin texture, hair, and facial shadows are well preserved, making the face-swapping results more realistic.

From the 1st frame to the 100th frame, the generated face-swapped images maintain high quality at different resolutions. The GAN generator uses progressive upsampling and convolution operations to generate high-resolution images from low resolution, ensuring that details at each layer are effectively processed and preserved. For example, observing the second column at the 50th and 75th frames, details such as skin texture and lighting effects are very clear. The discriminator continually optimizes the generator during training, enabling it to produce more realistic and natural images. The frame-by-frame results show that the target face and the source face maintain consistency and realism in the overall structure and detailed features. For instance, in the fourth column at the 25th and 100th frames, the facial features of the target face are highly consistent with the source face, with no noticeable artifacts or unnatural transitions.

4.4. Quantitative Results

Next, we conducted a quantitative comparison on the FaceForensics++ video dataset using the following metrics: SSIM, FID, Pose Error, Expression Error, and Face Shape Error, to further validate the effectiveness of our AmazingFT. For FaceSwap, DeepFaceLab, and SimSwap, we evenly sampled 10 frames from each video, creating a 10 K test set.

SSIM

SSIM measures the similarity between the generated image and the reference image by comparing luminance, contrast, and structural information. It aligns more closely with the human visual system’s perception, reflecting image quality better than simple pixel differences. SSIM calculation involves a local window within the image, computing the similarity in terms of luminance, contrast, and structure for each window, and then averaging the similarity across the entire image. In this paper, SSIM is implemented using the scikit-image library in Python. The generated and reference images are converted to grayscale, and the SSIM function computes the SSIM value between the two images. A higher SSIM value indicates better image quality.

Figure 5. Face-swapping results of AmazingFT with different frame counts in videos.

FID

FID evaluates the quality of generated images by calculating the distribution differences between generated images and real images in the feature space of a pre-trained Inception [24] network. It considers the mean and covariance of the feature distributions; a lower FID value indicates closer distributions and higher image quality. In this paper, FID is implemented using a pre-trained Inception network. The generated and real images are fed into the Inception network to extract features. Then, the mean and covariance of these features are computed, and the FID value is calculated using these statistics. A lower FID value indicates better image quality.

Pose

The Pose metric assesses the consistency of facial pose between the generated image and the target pose by comparing facial orientation, angle, and position features. Higher pose consistency indicates better performance in maintaining the target pose. In this paper, pose evaluation is implemented using Dlib [25] for facial pose estimation. Key facial points of the generated and target images are detected, and their pose parameters (e.g., pitch, yaw, and roll angles) are calculated and compared. A lower Pose value indicates better image quality.

Expression

The Expression metric measures the consistency of facial expressions between the generated image and the source face by comparing facial expression features such as mouth corner lifting and eyebrow raising. In this paper, expression evaluation is implemented using OpenFace [26] for facial expression recognition. Key facial points of the generated and source images are detected, and their expression vectors are calculated and compared. A lower Expression value indicates better image quality.

Shape

The Shape metric evaluates the consistency of facial shape between the generated image and the target shape by comparing facial contours and the positions of key points. Higher shape consistency indicates better performance in maintaining facial shape. In this paper, shape evaluation is typically implemented using facial key point detection algorithms, and we use Dlib for this purpose. Key facial points of the generated and target images are detected, and their Euclidean distances and similarity metrics are calculated to assess shape consistency. A lower Shape value indicates better image quality.

As shown in Table 1, the performance of six different face-swapping methods across multiple evaluation metrics is presented, including FaceSwap, DeepFaceLab, SimSwap, Inswapper, BlendSwap, and AmazingFT (ours). From the table, it is evident that AmazingFT outperforms all other methods in every evaluation metric. It has the highest SSIM value (0.84), indicating the best structural similarity between the generated images and the real images. It also has the lowest FID value (28.45), suggesting the highest quality of generated images with minimal distribution differences from the real images. Additionally, it has the lowest Pose, Expression, and Shape values, which are 2.95, 25.44, and 40.32, respectively, demonstrating superior consistency in pose, expression, and shape of the generated images. In comparison, DeepFaceLab, SimSwap, FaceSwap, Inswapper, and BlendSwap show varying strengths and weaknesses. Overall, AmazingFT leads significantly across all metrics, showcasing its exceptional performance in face-swapping tasks.

In this study, we designed an evaluation system comprising five subjective evaluation metrics to assess the effectiveness of face-swapping algorithms. The five metrics are as follows:

Naturalness: This metric evaluates the degree of naturalness in the face-swapping effect, determining whether the generated face-swapped image appears to be a real face. Images with high naturalness should have no obvious artificial traces, artifacts, or incoherent areas, presenting a smooth and realistic overall visual effect. Consistency: This metric assesses whether the features of the source face are consistent with the target face in the swapped image. This includes the position, proportion, and overall structure of facial features. High consistency in face-swapping should make it difficult for viewers to detect unnatural changes. Detail preservation: This metric evaluates the degree of detail retention in the swapped images, including whether fine features such as skin texture, hair, and wrinkles are preserved. High detail preservation means that the images should appear real and detailed, even at high resolution. Expression conveyance: This metric measures whether the facial expressions of the source face are accurately conveyed in the swapped images. Good expression conveyance should maintain the expression changes in the source face and present the same emotions and expression details on the target face. Lighting consistency: This metric assesses the consistency of lighting in the swapped images, particularly the integration of lighting conditions between the source face and the target face. Good lighting consistency ensures that shadows, highlights, and color tones appear natural and harmonious across different lighting environments.

To ensure the objectivity and reliability of the evaluation results, we invited 60 volunteers to subjectively score the face-swapping outcomes. Each volunteer reviewed the results and rated each image based on the five aforementioned metrics, with scores ranging from 1 to 5. We then conducted statistical analyses on the scores for each metric to assess the performance of the face-swapping algorithms across various aspects. A higher score indicates better performance in the respective metric, while a lower score signifies poorer performance.

Table 2 shows the performance of six different face-swapping methods across five subjective evaluation metrics, including FaceSwap, DeepFaceLab, SimSwap, Inswapper, BlendSwap, and AmazingFT (ours). The scores for each method on Identity, Attribute, Anti-Occlusion, Details, and Fidelity metrics illustrate their performance in various aspects. It is evident from the table that AmazingFT excels in all evaluation metrics, with scores of 4.12 for Identity, 4.08 for Attribute, 4.06 for Anti-Occlusion, 4.33 for Details, and 4.07 for Fidelity. This demonstrates its superior performance in identity recognition, attribute retention, anti-occlusion, detail handling, and fidelity. In contrast, other methods like DeepFaceLab, FaceSwap, SimSwap, Inswapper, and BlendSwap generally score below 4 in these metrics. Overall, AmazingFT leads significantly in comprehensive performance, outperforming other mainstream networks in face-swapping tasks.

4.5. Ablation Study

We conducted three different types of ablation experiments to demonstrate the effectiveness of our designed network architecture.

We evaluated the performance of three different face-swapping model configurations across multiple evaluation metrics: a model without GAN (W/O GAN), a model without Transformer (W/O Transformer), and the complete AmazingFT model, as shown in Table 3. From the table, it is evident that the complete AmazingFT model performs the best across all metrics. It has the highest SSIM value (0.84), indicating the best structural similarity of the generated images; the lowest FID value (28.45), indicating the highest quality of generated images with minimal distribution differences from the real images; and the lowest Pose, Expression, and Shape values, which are 2.95, 25.44, and 40.32, respectively, demonstrating superior performance in pose, expression, and shape consistency of the generated images.

The model without GAN (W/O GAN) performs close to AmazingFT in SSIM and Pose but shows poorer results in FID, Expression, and Shape, indicating that GAN plays a crucial role in enhancing image quality and feature consistency. The model without Transformer (W/O Transformer) has metrics similar to the W/O GAN model but still falls short of the complete AmazingFT model, indicating that the Transformer plays a significant role in feature transformation. The AmazingFT model exhibits the best overall performance, proving the critical roles of GAN and Transformer in improving face-swapping effects.

Table 4 presents the performance of two different face-swapping model configurations across multiple evaluation metrics: a model without multi-scale (No Multi-Scale) and a model with multi-scale (Multi-Scale). The table shows that the Multi-Scale model performs the best across all metrics, with the highest SSIM value (0.84) and the lowest FID value (28.45). The Pose, Expression, and Shape values are also the lowest, at 2.95, 25.44, and 40.32, respectively. In contrast, the No Multi-Scale model performs slightly less well across all metrics. Although its SSIM value is also high (0.83), close to that of the Multi-Scale model, it falls short in other metrics, particularly in FID (29.14) and Expression (26.11), indicating a slight deficiency in image quality and expression consistency. This suggests that the Transformer plays a crucial role in enhancing feature extraction within the multi-scale model, resulting in superior visual effects and consistency in the generated face-swapped images. Overall, the Multi-Scale model demonstrates better comprehensive performance, proving the significant impact of the Transformer in improving face-swapping effects.

Table 5 shows the performance of different groups in the face-swapping task when using varying numbers of multi-scale models. By comparing the performance of different groups, it is evident that Group C, which uses 4 multi-scale models, performs the best in SSIM, Pose, and Expression metrics, proving the correctness of our choice of 4 multi-scale models. Specifically, Group C has the highest SSIM value, the lowest Pose value, and the lowest Expression value, indicating the best expression retention. In contrast, Groups A and B, which use fewer multi-scale models, improve image quality to some extent but do not achieve the same results as Group C. Groups D and E, while further increasing the number of multi-scale models, do not show performance improvements as significant as those of Group C. In fact, their SSIM and Pose metrics even decrease, possibly due to overfitting or increased computational complexity from too many multi-scale models. This ablation experiment verifies that choosing 4 multi-scale models can maintain high image quality while maximizing pose and expression consistency, demonstrating the superiority and rationality of this configuration.

4.6. Analysis of the Roles of GAN and Transformer

AmazingFT demonstrates exceptional performance in the face-swapping task, particularly in detail handling and feature fusion. From the results of the ablation experiments, it is evident that the face-swapping effect significantly deteriorates when either GAN or Transformer is removed, as shown in Figure 6.

In the absence of GAN, the generated images exhibit noticeable distortion and blurriness in both details and overall structure. The face-swapping results appear flat and lack detail, with facial features not being sharp enough. The generated images lack texture and realism in detail, and the overall visual effect is inferior to that of the complete AmazingFT model. Without the Transformer, the generated images show issues in feature mapping and fusion, where the features of the source face are not accurately mapped onto the target face, resulting in unnatural and inconsistent facial features. The subtle differences in facial expressions and structures are not well preserved and reproduced, highlighting the critical role of the Transformer in capturing and transforming features.

Using the complete AmazingFT model, the face-swapping results achieve optimal levels in facial detail, feature mapping, and overall visual effect. The details of the source face, such as the eyes and mouth, are accurately preserved and naturally integrated into the target face. The features of the source face are well-presented and fused on the target face, resulting in highly realistic and natural-looking generated images.

4.7. Facial Detail Performance

The synthesized results of the target face and source face appear very natural and exhibit detailed refinement, as shown in Figure 7. The images demonstrate that the swapped faces maintain the overall facial structure and pose of the target face while successfully incorporating the features of the source face. For instance, in the first set of images, the facial features of the source face, such as the eyes and mouth, are well preserved and presented in the result image, naturally blending with the facial structure of the target face to form a realistic new face.

Close-up details of the eyes and mouth show remarkable refinement. The eye details, such as the eyelids, pupils, and subtle differences in the sclera, are handled very naturally without noticeable edge artifacts. The mouth details are equally well processed, with the shape and texture of the lips being well preserved and seamlessly integrated with the skin tone and texture of the surrounding facial areas. These details highlight the high precision and quality of AmazingFT in handling facial features.

By observing different sets of images, it is evident that regardless of the differences in facial features between the target face and the source face, AmazingFT consistently achieves high-quality face-swapping results while maintaining naturalness and realism. This demonstrates the model’s strong capabilities in feature extraction, feature fusion, and detail processing, enabling it to adapt to various facial features and lighting conditions, thereby generating highly realistic and natural-looking face-swapped images.

4.8. Swapping with the Same Source Faces

By analyzing the face-swapping results of the same source face with different target faces in Figure 8, it is evident that regardless of facial expressions, lighting conditions, or different target faces, the results consistently maintain the primary features of the source face while naturally blending with the facial structure of the target face. In each image, the main features of the source face (such as the eyes, nose, and mouth) are accurately mapped onto the target face, and the expressions and poses appear very natural, demonstrating the model’s strong capabilities in capturing dynamic features and detail processing.

For example, in the first set of results, the eye and mouth features of the source face are reproduced very realistically on the target face. The second set of results maintains the identity features of the source face while ensuring consistent detail processing across different target faces. This indicates that the model can retain the main features of the source face on various target faces while adapting to different facial structures and expressions.

The third set of results showcases the face-swapping effect of the source face under different lighting conditions. The features of the source face are well-preserved and presented on the target face, particularly with natural handling of lighting and shadows. The fourth set of results demonstrates the adaptability of the source face on different target faces, showing that regardless of the target face’s gender, hairstyle, or facial features, the primary features of the source face naturally integrate with the target face.

5. Future Work

Despite the multiple innovative contributions made in this paper, including the long-term feature relationship representation prediction module and the separate prediction modules for video and audio streams, these innovations have been somewhat diminished by the simplicity of the audio stream design. Specifically, the model currently relies on MFCC features, commonly used in speech detection, which may constrain the effectiveness of the predictions and lead to suboptimal results. Future research should focus on improving the feature extraction methods for audio streams, exploring more advanced feature representation technologies to enhance the model’s ability to capture audio features. This would help increase the accuracy and naturalness of face-swapping results, further improving overall performance. Additionally, integrating more sophisticated audio-processing modules could address the limitations in handling complex speech features and compensate for the current method’s shortcomings in audio stream processing.

6. Conclusions

This paper presents an advanced face-swapping model based on Generative Adversarial Networks (GANs) and Transformers—AmazingFT. Through detailed experiments and multiple evaluation metrics, AmazingFT has demonstrated outstanding performance in face-swapping tasks. AmazingFT excels in naturalness, consistency, detail preservation, expression conveyance, and lighting consistency, producing images that are more realistic and natural. We trained the model on the high-quality CelebA-HQ dataset and incorporated meticulous steps for face detection, face alignment, and image preprocessing to ensure the model’s efficiency and stability. By leveraging the collaborative work of multi-scale generators and discriminators, AmazingFT successfully generates high-resolution face-swapping images, showcasing its immense potential and broad prospects for practical applications.

Author Contributions

Conceptualization, D.T. and L.L.; Methodology, D.T. and W.S; Writing—Original Draft, D.T. and W.S.; Writing—Review and Editing, L.L. and D.T.; Visualization, D.T. and W.S.; Supervision, L.L., D.T., W.S. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in [CelebA-HQ] at [https://www.kaggle.com/datasets/lamsimon/celebahq. accessed on 22 May 2024].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative adversarial networks: An overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Nirkin, Y.; Masi, I.; Tuan, A.T.; Hassner, T.; Medioni, G. On face segmentation, face swapping, and face perception. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 98–105. [Google Scholar]
Li, Y.; Ma, C.; Yan, Y.; Zhu, W.; Yang, X. 3d-aware face swapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12705–12714. [Google Scholar]
Zhao, W.; Rao, Y.; Shi, W.; Liu, Z.; Zhou, J.; Lu, J. Diffswap: High-fidelity and controllable face swapping via 3d-aware masked diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8568–8577. [Google Scholar]
Korshunova, I.; Shi, W.; Dambre, J.; Theis, L. Fast face-swap using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3677–3685. [Google Scholar]
Liu, K.; Perov, I.; Gao, D.; Chervoniy, N.; Zhou, W.; Zhang, W. Deepfacelab: Integrated, flexible and extensible face-swapping framework. Pattern Recognit. 2023, 141, 109628. [Google Scholar] [CrossRef]
Chen, R.; Chen, X.; Ni, B.; Ge, Y. Simswap: An efficient framework for high fidelity face swapping. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2003–2011. [Google Scholar]
Frick, R.A.; Steinebach, M. One Detector to Rule Them All? On the Robustness and Generalizability of Current State-of-the-Art Deepfake Detection Methods. Electron. Imaging 2024, 36, 1–6. [Google Scholar] [CrossRef]
Shiohara, K.; Yang, X.; Taketomi, T. Blendface: Re-designing identity encoders for face-swapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 7634–7644. [Google Scholar]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Large-scale celebfaces attributes (celeba) dataset. Retrieved August 2018, 15, 11. [Google Scholar]
Rossler, A.; Cozzolino, D.; Verdoliva, L.; Riess, C.; Thies, J.; Nießner, M. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1–11. [Google Scholar]
Nirkin, Y.; Keller, Y.; Hassner, T. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7184–7193. [Google Scholar]
Nirkin, Y.; Keller, Y.; Hassner, T. FSGANv2: Improved Subject Agnostic Face Swapping and Reenactment. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 560–575. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Li, M.; Zhang, Y.; Liu, Z.; Li, M.; Zhang, Y.; Wang, C.; Zhang, Q.; Wang, J.; Nie, Y. Fine-grained face swapping via regional gan inversion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8578–8587. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, X.; Zheng, Y.; Du, Z.; Ding, M.; Qian, Y.; Yang, Z.; Tang, J. GPT Understands, Too. Available online: https://www.sciencedirect.com/science/article/pii/S2666651023000141?via%3Dihub (accessed on 2 July 2024).
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tao, D. A survey on vision transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 87–110. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Sun, C.; Myers, A.; Vondrick, C.; Murphy, K.; Schmid, C. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7464–7473. [Google Scholar]
Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; Roy, K. Going deeper in spiking neural networks: VGG and residual architectures. Front. Neurosci. 2019, 13, 95. [Google Scholar] [CrossRef] [PubMed]
Zhang, P.; Zhang, B.; Chen, D.; Yuan, L.; Wen, F. Cross-domain correspondence learning for exemplar-based image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5143–5153. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
King, D.E. Dlib-ml: A machine learning toolkit. J. Mach. Learn. Res. 2009, 10, 1755–1758. [Google Scholar]
Baltrušaitis, T.; Robinson, P.; Morency, L.P. Openface: An open source facial behavior analysis toolkit. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]

Figure 1. The face-swapping results generated by AmazingFT. The swapped face results replace the face in the target image with the face from the source image.

Figure 2. Overall architecture of our network.

Figure 4. Comparison experiments of AmazingFT with other SOTA methods.

Figure 6. Comparison experiments of models with and without GAN and Transformer.

Figure 7. Detail rendering of eyes and mouth.

Figure 8. Comparison experiments of the same source face with different target faces.

Table 1. Quantitative experiments of different SOTA methods.

Method	SSIM ↑	FID ↓	Pose ↓	Expression ↓	Shape ↓
FaceSwap	0.73 ± 0.04	35.61 ± 3	4.28 ± 0.3	35.33 ± 1.0	42.61 ± 2
DeepFaceLab	0.75 ± 0.02	32.43 ± 4	4.31 ± 0.2	34.23 ± 0.5	43.67 ± 1
SimSwap	0.79 ± 0.01	33.48 ± 2	4.54 ± 0.1	32.45 ± 1.2	43.78 ± 2
Inswapper	0.74 ± 0.01	31.95 ± 3	3.23 ± 0.3	34.32 ± 2.0	42.92 ± 3
BlendSwap	0.80 ± 0.02	30.28 ± 3	3.04 ± 0.1	25.56 ± 1.2	41.55 ± 1
AmazingFT (ours)	0.84 ± 0.01	28.45 ± 1	2.95 ± 0.1	25.44 ± 0.7	40.32 ± 1

Table 2. Subjective ratings.

Method	Naturalness ↑	Consistency ↑	Detail Preservation ↑	Expression Conveyance ↑	Lighting Consistency ↑
FaceSwap	2.60	3.58	3.83	3.63	2.73
DeepFaceLab	2.79	3.05	4.03	2.65	3.76
SimSwap	3.64	3.96	3.20	3.97	3.41
Inswapper	3.61	3.56	3.92	4.08	2.94
BlendSwap	3.68	3.98	3.82	4.16	3.94
AmazingFT (ours)	4.12	4.08	4.06	4.33	4.07

Table 3. Ablation study of models with and without GAN and Transformer.

Method	SSIM ↑	FID ↓	Pose ↓	Expression ↓	Shape ↓
W/O GAN	0.81 ± 0.01	30.63 ± 1	2.98 ± 0.08	26.42 ± 1.2	42.57 ± 2
W/O Transformer	0.80 ± 0.01	31.22 ± 2	3.01 ± 0.10	27.33 ± 1.6	41.39 ± 1
AmazingFT	0.84 ± 0.01	28.45 ± 1	2.95 ± 0.10	25.44 ± 0.7	40.32 ± 1

Table 4. Ablation study of models with and without multi-scale.

Method	SSIM ↑	FID ↓	Pose ↓	Expression ↓	Shape ↓
No Multi-Scale	0.83 ± 0.01	29.14 ± 2	2.99 ± 0.1	26.11 ± 0.5	40.95 ± 1
Multi-Scale	0.84 ± 0.01	28.45 ± 1	2.95 ± 0.1	25.44 ± 0.7	40.32 ± 1

Table 5. Ablation study on the number of multi-scales.

Group	Number of Multi-Scales	SSIM ↑	Pose ↓	Expression ↓
A	2	0.78 ± 0.03	6.29 ± 0.1	28.69 ± 0.9
B	3	0.78 ± 0.02	5.26 ± 0.2	26.73 ± 0.6
C	4	0.84 ± 0.01	2.95 ± 0.1	25.44 ± 0.7
D	5	0.82 ± 0.01	3.58 ± 0.1	26.45 ± 0.8
E	6	0.79 ± 0.01	4.18 ± 0.2	26.11 ± 0.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Tong, D.; Shao, W.; Zeng, Z. AmazingFT: A Transformer and GAN-Based Framework for Realistic Face Swapping. Electronics 2024, 13, 3589. https://doi.org/10.3390/electronics13183589

AMA Style

Liu L, Tong D, Shao W, Zeng Z. AmazingFT: A Transformer and GAN-Based Framework for Realistic Face Swapping. Electronics. 2024; 13(18):3589. https://doi.org/10.3390/electronics13183589

Chicago/Turabian Style

Liu, Li, Dingli Tong, Wenhua Shao, and Zhiqiang Zeng. 2024. "AmazingFT: A Transformer and GAN-Based Framework for Realistic Face Swapping" Electronics 13, no. 18: 3589. https://doi.org/10.3390/electronics13183589

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AmazingFT: A Transformer and GAN-Based Framework for Realistic Face Swapping

Abstract

1. Introduction

2. Related Work

2.1. GAN-Based Face Swapping

2.2. Applications of Transformers

3. Methods

3.1. Amazing Transformation Module

3.2. Amazing Generation Module

3.3. Amazing GAN

3.4. Loss Function

4. Experiments

4.1. Datasets and Processing

4.2. Training Details

4.3. Qualitative Results

4.4. Quantitative Results

4.5. Ablation Study

4.6. Analysis of the Roles of GAN and Transformer

4.7. Facial Detail Performance

4.8. Swapping with the Same Source Faces

5. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI