Next Article in Journal
Enhanced Garlic Crop Identification Using Deep Learning Edge Detection and Multi-Source Feature Optimization with Random Forest
Previous Article in Journal
A Comprehensive Analysis of Methods for Improving and Estimating Energy Efficiency of Passive and Active Fiber-to-the-Home Optical Access Networks
Previous Article in Special Issue
Comprehensive Review of Open-Source Fundus Image Databases for Diabetic Retinopathy Diagnosis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Mutual Effects of Face-Swap Deepfakes and Digital Watermarking—A Region-Aware Study

by
Tomasz Walczyna
* and
Zbigniew Piotrowski
Electronics and Telecommunications Faculty, Military University of Technology, 00-908 Warsaw, Poland
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(19), 6015; https://doi.org/10.3390/s25196015
Submission received: 2 September 2025 / Revised: 29 September 2025 / Accepted: 29 September 2025 / Published: 30 September 2025
(This article belongs to the Special Issue Digital Image Processing and Sensing Technologies—Second Edition)

Abstract

Highlights

What are the main findings?
  • Region-aware evaluation across visible and invisible watermarks with tunable strength and six face-swap families shows that edits are non-local and non-monotonic—background changes introduced by generators even degrade watermarks that are far from the face, and retention does not vary linearly with embedding strength.
  • A locality-preserving baseline bounds the minimal impact—architectures that better confine edits to the facial region, typically those with segmentation-weighted objectives, preserve background watermark signal more reliably than globally trained GAN pipelines.
What are the implications of the main findings?
  • Classical robustness tests for watermarking are not sufficient on their own—evaluation should include generator-induced transformations from face swap and report region-wise metrics for face and background.
  • Watermark strength and placement should be selected in an architecture-aware manner—in our sweeps, appropriately tuned invisible marks achieved higher background correlation under manipulation than visible overlays at comparable perceptual distortion.

Abstract

Face swapping is commonly assumed to act locally on the face region, which motivates placing watermarks away from the face to preserve the integrity of the face. We demonstrate that this assumption is violated in practice. Using a region-aware protocol with tunable-strength visible and invisible watermarks and six face-swap families, we quantify both identity transfer and watermark retention on the VGGFace2 dataset. First, edits are non-local—generators alter background statistics and degrade watermarks even far from the face, as measured by background-only PSNR and Pearson correlation relative to a locality-preserving baseline. Second, dependencies between watermark strength, identity transfer, and retention are non-monotonic and architecture-dependent. Methods that better confine edits to the face—typically those employing segmentation-weighted objectives—preserve background signal more reliably than globally trained GAN pipelines. At comparable perceptual distortion, invisible marks tuned to the background retain higher correlation with the background than visible overlays. These findings indicate that classical robustness tests are insufficient alone—watermark evaluation should report region-wise metrics and be strength- and architecture-aware.

1. Introduction

Digital image processing and computer vision are increasingly underpinning sensing pipelines, in which images are captured, processed, and authenticated at scale. In this context, multimedia security—including watermarking—intersects with CV tasks such as detection and generative manipulation [1,2,3,4,5,6]. A controlled, region-aware study of two watermark types across six face-swap families, including a locality-preserving baseline, to quantify both identity transfer and watermark retention is conducted.
Beyond achieving this objective, four specific contributions are provided to the multimedia security community:
  • A two-sided, region-aware evaluation protocol that quantifies both identity transfer and watermark retention;
  • Empirical evidence that generator edits are non-local and that the relationship between watermark strength, identity transfer, and retention is non-monotonic, challenging the common assumption that placing a mark away from the face suffices;
  • An architecture-aware analysis showing that methods which better confine edits to the facial region—typically those leveraging segmentation-weighted objectives—preserve background watermark signal more reliably than globally trained GAN pipelines;
  • Practical guidance for robustness evaluation in sensing workflows, indicating when tuned invisible marks retain more background correlation than visible overlays at comparable perceptual impact.
Classical watermark robustness studies primarily evaluate compression, resampling, and noise [7,8,9,10], whereas face-swap research focuses on optimizing identity transfer and realism [11,12,13,14,15,16]. At the same time, watermarking has been explored in broader multimedia domains, including 3D cultural heritage protection, where subtle geometric alterations serve as a robust and imperceptible watermarking strategy [17]. Such works highlight the cross-domain importance of balancing imperceptibility, robustness, and usability—challenges that are echoed in the image-based setting studied here. Recent studies have also begun to explore the interaction between watermarking and generative models such as GANs and diffusion networks. For example, adversarial attacks based on generative losses have been shown to reduce watermark readability. At the same time, steganography and color conversion approaches highlight how global changes in image statistics can compromise embedded signals, as also demonstrated by [18], who proposed a concealed attack using generative models and perceptual loss, showing that GAN-based attacks can actively compromise the readability of watermarks. Similarly, diffusion-based pipelines introduce multi-scale transformations that may unintentionally distort watermark energy across both local and background regions. These findings motivate a region-aware analysis of watermark robustness under modern generative transformations. Prior work rarely couples region-aware analysis with visible vs. invisible marks; this study addresses this gap with controlled sweeps on VGGFace2 and a locality baseline [19]. Still images and face swap are taken as the manipulation type, which aligns with common image-centric sensing scenarios.
To enable controlled comparisons, parameterizable variants of the watermarking and face swap methods implemented in experiments. Although other manipulation families (reenactment, lip animation, full-face synthesis) may also affect watermarking, face swap is selected as the primary case study to establish a precise reference and a dedicated baseline for localized edits.
Next, the watermarking and face-swap methods, the region-aware protocol, and the results are outlined, followed by implications for robustness evaluation.

2. Materials and Methods

2.1. Watermark

A digital watermark is identification information intentionally incorporated into an image, sound, video, or document that remains associated with the file during processing. Its basic functions include confirming the source or owner of the content, tracking its distribution, and, in some systems, user authorization. A watermark can be visible, such as a semi-transparent logo that discourages unauthorized use, or hidden, embedded in the spatial or frequency domain in a way that is invisible to the viewer but can be read after typical editing operations such as compression, scaling, or cropping [10,20].
The key features of a well-designed watermark include: robustness, imperceptibility to the end user, capacity to store additional bits, and security against counterfeiting. In practice, this requires a compromise: the more robust the mark, the greater the interference with the data and the potential deterioration in quality; the more discreet it is, the more difficult it is to ensure its readability after aggressive processing. In the case of publicly published content, a hybrid approach is often employed—a visible logo serves as a deterrent against simple copying. At the same time, a hidden identifier facilitates the enforcement of rights in the event of a dispute [21].
This analysis considers two extreme variants of watermarking—visible and invisible—as they represent two basic content protection strategies, directly noticeable or completely invisible to the end user. The research does not focus on preserving the semantic content encoded in the watermark, but on assessing its resistance to DeepFake-type modifications.

2.1.1. Visible Watermark

The analysis employed an explicit watermark in the form of a QR code spanning the entire frame. This solution ensures the uniform distribution of the mark’s pixels in the image and eliminates the risk of omitting any area when assessing the marking’s impact. During preliminary experiments, alternative visible patterns such as uniform overlays and minor localized marks were also tested. However, these patterns proved less consistent for systematic evaluation: homogeneous marks often interfered with face detection and alignment. In contrast, localized marks were difficult to analyze statistically because their placement did not always overlap with the manipulated region. By contrast, the QR code provides both global coverage and interleaved transparent areas, making it a balanced choice that captures the types of distortions observed with other patterns while remaining analyzable across all scenarios.
For clarity, a single example—a synthetic face image—is presented in three variants: (a) reference image without a watermark, (b) the difference between the image with a watermark and the reference image, and (c) a composition of both images with a selected level of transparency (Figure 1a–c). This presentation enables the evaluation of the degree to which an explicit watermark affects the image’s details, even before DeepFake methods are applied.
The impact of transparency on image distortion was determined using two commonly used quality metrics: PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity Index) [22]. Figure 2 illustrates the dependence of these metrics on the opacity parameter, which ranges from 0 to 1, corresponding to the level of watermark visibility. The observed curves illustrate a decrease in image quality as the visibility of the mark increases. These graphs serve as a reference point in both subsequent chapters and the experimental part, where the impact of different transparency levels on the effectiveness of DeepFake methods will be analyzed.

2.1.2. Invisible Watermark

The second option analyzed is a hidden watermark embedded using a neural network. Its purpose is to remain completely invisible to the recipient while remaining resistant to typical image processing operations. Many methods of this type have been described in the literature; however, in most cases, it is not possible to precisely control the strength of the interference [7,8]. In studies focused solely on assessing the effectiveness of watermark reading or its impact on a selected task (e.g., classification) [9], such a limitation may be acceptable. However, in a broader analysis—especially when the goal is to generalize the results to an entire group of algorithms (in our case, local face replacement)—it can significantly complicate interpretation.
For this reason, a proprietary model has been developed that allows for smooth adjustment of the watermark signal amplitude—from virtually undetectable to deliberately visible. This allows for a precise examination of the relationship between the strength of the mark and its susceptibility to local modifications, such as DeepFakes.
The designed architecture is based on the classic encoder–decoder approach. The encoder receives an image, to which it matches a watermark, and a message to be embedded. The generated watermark, controlled by the “watermark strength” parameter, is then added to the original image. The resulting image can be manipulated in any way (e.g., face swap), and its degraded version is sent to the decoder, whose task is to recover the encoded message. The invisible watermark used here is purposefully a controllable test instrument rather than a proposed state-of-the-art algorithm. Its novelty for this paper lies in its practicality: the strength parameter is continuously tunable, which enables calibrated sweeps that isolate how watermark energy interacts with generator-induced transforms. This controllability is required to compare visible vs. invisible marks under identical experimental conditions and is not intended as a claim of algorithmic novelty in watermarking.
The encoder was built based on a modified U-Net architecture [23], equipped with FiLM (Feature-wise Linear Modulation) layers [24], which enable the entry of message information at different resolution levels. Residual connections have also been added [25] between successive U-Net levels, which improves gradient flow—a crucial aspect in architectures where part of the cost function is calculated only after the decoder. The encoder output is transformed by a tanh function (with a range of −1 to 1) and then scaled by the “watermark power” parameter (default: 0.1). The default value of the “watermark strength” parameter = 0.1 was adopted experimentally as the midpoint of the range [−1, 1] used in the training process. It ensured a clear yet moderate level of interference with the image, allowing for tests of both greater subtlety and higher visibility of the marker.
To increase the generality of the model and avoid situations where the network hides the watermark only in selected locations, a set of random perturbations was used during training: Gaussian noise, motion blur, Gaussian blur, brightness and contrast changes, resized crop, and random erasing. These were not intended to teach resistance to specific attacks (e.g., face swap) but to force the even distribution of the watermark throughout the image.
The decoder is based on the ResNet architecture [26], whose task is to reduce a tensor containing a degraded image to a vector representing the encoded message.
The learning process involved two primary components of the cost function. The first concerned the correctness of message reading by the decoder:
L m s g =   B C E W i t h L o g i t s L o s s z ,   m
where z—output tensor from the decoder, m—coded message.
The second component was responsible for minimizing the visibility of the watermark. For this purpose, a combination of mean square error (MSE) and LPIPS metrics was used [27], better reflecting the difference between images as perceived by humans:
L v i s =   M S E x ^ ,   x + 0.2 ·   L P I P S ( x ^ ,   x )
where x ^ —image with watermark, x —original image.
All images were scaled to a resolution of 128 × 128 pixels. The adopted resolution of 128 × 128 pixels is lower than typical in practical applications. This limitation resulted from the computational requirements associated with training multiple deepfake networks and a watermarking model in real time, given the available hardware resources. For the comparative analysis, maintaining consistent experimental conditions was prioritized over achieving absolute image resolution. The message length was fixed at 64 bits, generated randomly during training. VGGFace2 [28] served as the dataset, containing photos of different people, which ensured consistency with the rest of the experiments.
In the case of this model, the key indicator in the context of comparative analysis is not the fact that the message was read correctly (full effectiveness was achieved during training), but the impact of the “strength” parameter of the watermark on its actual visibility and level of interference with the image.
Figure 3 illustrates an example of an image in the default configuration (watermark strength = 0.1) along with its corresponding difference map. Figure 4 illustrates the impact of the “strength” parameter on PSNR and SSIM metrics compared to the original image [8].

2.2. Face Swap

Face swap is a class of generative algorithms whose goal is to insert a source face into a target image or video in a way that is believable to a human observer, with no visible traces of modification [11,29]. The typical process includes: face detection and alignment, extraction of its semantic representation (embedding), reconstruction or conditional generation of a new texture, and applying it with a blending mask to the output frame [11].
Although theoretically the modification should be limited to the face area, in practice many models—especially those based on generative adversarial networks (GANs)—also affect the background, lighting, and global color statistics. There are many reasons for this behavior, ranging from the nature of the cost functions used to the specifics and complexity of the model architecture.
In the context of watermarking, this leads to two significant consequences. First, even local substitution can unintentionally distort the signal hidden throughout the image, reducing the effectiveness of invisible watermarking techniques. Second, suppose the visible watermark is located near a face or its pattern resembles an image artifact. In that case, the model may attempt to “correct” it, resulting in reduced legibility of the mark.
For further analysis, popular end-to-end networks such as SimSwap [12] and FaceShifter [13] were selected, as well as newer designs incorporating additional segmentation models, key point generation (Ghost [14], FastFake [15]), or closed solutions—InsightFace [16]. Each of these methods controls the scope of editing differently and achieves a different compromise between photorealism and precise control of the modification region.
In some cases, it was necessary to reimplement or adapt the models to meet the established experiment criteria, which may result in slight differences from the results presented by the authors of the original algorithms. Where possible, the same architectures and cost functions as in the original implementations were retained.
To establish a reference point, a proprietary reference method was also developed, based on classic inpainting in the face segmentation mask area. This method edits only the face region, preserving the background pixels, which allows for estimating the minimal impact of a perfectly localized face swap on the watermark. A comparison of these approaches enables the determination of the extent to which modern, intensely trained models interfere with image content outside the target modification area and allows the theoretical reasons for this behavior to be presented.
While our experiments operate on still images, this setup mirrors the inference mode of many production face-swap pipelines, which process frames independently and compose them into a video. Temporal stabilization (when present) is typically added as a post hoc stage or auxiliary loss and does not alter the core per-frame identity transfer mechanism analyzed in this study. The non-local background effects reported in the results are therefore expected to persist in video, with temporal consistency potentially modulating their magnitude rather than their nature.

2.2.1. SimSwap

SimSwap [12] is one of the first publicly available architectures that enable identity swapping for arbitrary pairs of faces without requiring the network to be retrained. It combines the simplicity of a single encoder–decoder–GAN setup with the ability to work in many-to-many mode.
The key element of SimSwap is the ID Injection module. After encoding the target frame, the identity vector from the source—obtained from a pre-trained ArcFace model [30]—is injected into the deep layers of the generator using Adaptive Instance Normalization (AdaIN) blocks [31].
To preserve the facial expressions, pose, and lighting of the target image, the authors introduced Weak Feature Matching Loss, which compares the deep output representations of the discriminator between the target image and the reference image. This function promotes the consistency of visual attributes by treating the discriminator as a measure of realism rather than identity consistency.
Identity is enforced through a cost function based on the cosine distance between ArcFace embedding vectors. The realism of the generated images is improved by classic hinge-GAN loss and gradient penalty [32]. Additional Reconstruction Loss is activated when the source and target images depict the same person—in this case, the network learns to minimize changes in the image.
In practice, this combination of cost functions means that modifications are concentrated mainly in the face area, while the background and clothing elements remain largely unaffected. However, the lack of an explicit segmentation mask means that subtle color corrections may occur throughout the frame when there are substantial exposure changes or low contrast.

2.2.2. FaceShifter

FaceShifter [13] is a two-stage face swap network designed to preserve the identity of the source without requiring training for each pair. In the first phase (AEI-Net), the following components are combined: an identity embedding obtained from ArcFace [30] and multi-level attribute maps generated by a U-Net encoder.
Integration is achieved using the Adaptive Attentional Denormalization (AAD) mechanism, which dynamically determines whether a given feature fragment should originate from the embedding ID or the attribute maps. In addition to identity loss, adversarial loss, and reconstruction loss, the cost function also uses attribute loss, which enforces attribute consistency between the target image and the replaced image.
The lack of an explicit segmentation mask means that, in cases of significant differences in lighting or color, AAD can also modify the background, which, from a watermarking perspective, increases the risk of distorting the invisible watermark. At the same time, precise attention masks within AAD keep the primary energy of changes within the face.

2.2.3. Ghost

Authors of GHOST [14] presented a comprehensive, single-shot pipeline that covers all stages—from face detection to generation and super-resolution. However, only the GAN core is relevant in the context of this analysis. The basic architecture is a variation in AEI-Net known from FaceShifter, but with several significant modifications.
Similarly to FaceShifter, the identity vector obtained from the ArcFace model is injected into the generator using Adaptive Attentional Denormalization (AAD) layers. A new feature is the use of an additional network targeting the eye region, along with a redesigned cost function—specifically, eye loss—which enables the stable reproduction of gaze direction in the generated image.
The second improvement is the adaptive blending mechanism, which dynamically expands or narrows the face mask based on the differences between the landmarks of the source and target images. This solution enhances the fit of the face shape and edges, thereby increasing the realism of the generated image.
In this study, the adaptive blending and super-resolution stages were omitted to focus solely on the analysis of pixel destruction introduced by the generator itself. Furthermore, some of the elements introduced in GHOST, such as adaptive blending, are not differentiable, which could disrupt the training process if a labeling model is to be used, treating face swaps as noise in the learning process.

2.2.4. FastFake

Fast Fake [15] is one of the newer examples of a lightweight GAN-based face swap, where the priority is fast and stable learning on small datasets, rather than achieving photographic perfection in each frame. The core of the model is a generator with Adaptive Attentional Denormalization (AAD) blocks, borrowed from FaceShifter [13]. Still, the entire architecture was designed in the spirit of FastGAN [33], featuring fewer channels, a skip-layer excitation mechanism [34], and a discriminator capable of reconstructing images, which helps limit the phenomenon of mode collapse.
The key difference from the previously discussed models lies in the way segmentation is utilized. The authors include a mask from the BiSeNet network [35] only at the loss calculation stage—pixels outside the face area are sent to the reconstructive MSE, and features obtained from the parser are additionally blurred and compared with analogous maps of the generated image. As a result, the generator learns to ignore the background, because any unjustified change in color increases the loss value. During inference, the mask is no longer used, keeping the computation flow clean and fast.
From the perspective of analyzing the impact of DeepFake on watermarking, this approach has significant implications. The scope of FastFake interference is even narrower than in SimSwap or FaceShifter—global color statistics change minimally, which potentially favors the protection of watermarks placed outside the face area. In theory, the GAN cost function should interfere to some extent with the component that enforces background preservation. However, it cannot be ruled out that the generator will still harm unusual elements of the image, such as watermarks.
Thanks to its low data requirements and fast learning process, FastFake is a representative example of the “economical” branch of face swap methods, which will be compared with other models in terms of their impact on the durability and legibility of watermarks later in this article.

2.2.5. InsightFace

The InsightFace Team [16] has not published a formal article describing the Inswapper module; however, this model is widely used in open-source tools, including Deep-Live-Cam [36], and functions as an informal “market standard” in the field of face swapping.
Similarly to the previously discussed methods, Inswapper uses a pre-trained ArcFace model to determine the target’s identity. Although the implementation details are not fully known, a significant difference is the surrounding pipeline: InsightFace provides a complete SDK with its own face detection module and predefined cropping, which also includes arms and a portion of the background. If the detector does not detect a face or rates its quality below a certain threshold, the frame remains unchanged. In the context of watermarking, this means that elements outside the detected face mask can remain completely intact. This feature is also valuable for experiments—it allows the assessment of whether the degradation of the watermark is significant enough to prevent the image from being used by popular face swap algorithms.
For this paper, the analysis is limited to the generator block and the mandatory face detector, omitting subsequent stages of the pipeline, such as skin smoothing and super-resolution. In this context, Inswapper serves as a realistic but minimal attack: any violation of the watermark is solely the result of identity substitution—provided that the face is detected and passed on for processing—which reflects a typical use case in popular consumer tools.

2.2.6. Baseline

The last face swap algorithm analyzed is a proprietary reference method explicitly developed for this comparison. Although it does not achieve SOTA results on its own, it stands out with its local face replacement range and an interesting approach to separating information from sources of the same type. Unlike the previously discussed GAN models, the training process uses elements characteristic of currently popular diffusion models [37].
The algorithm consists of three main components:
  • U-Net—typical for diffusion models, responsible for removing the noise.
  • Identity encoder—compresses input data into a one-dimensional hidden space; receives a photo of the same person, but in a different shot, pose, or lighting.
  • Attribute encoder—also compresses data into a hidden space, but accepts the target image in its original form.
At the input, U-Net receives an image with a noisy face area (following the diffusion model approach) and a set of conditions: noise level, attribute vector, and identity vector.
The goal of the model is to recreate the input image based on additional information provided through conditioning. During inference, when a face of another person is fed to the identity encoder, the U-Net generates an image with the identity swapped, while preserving the pose, facial expressions, and lighting resulting from the attribute vector.
The key challenge is to motivate the model to utilize information from the identity encoder, rather than solely from the attribute encoder, which, in the absence of constraints, could contain all the data necessary for reconstruction. To prevent this, three modifications to the attribute vector were applied:
  • Masking—randomly zeroing fragments of a vector, which forces the model to draw information from the identity encoder, as they may be insufficient on their own.
  • Dropout—increases the dispersion of information in the vector, preventing data concentration in rarely masked fragments.
  • Normal distribution constraint (KL divergence loss)—inspired by the VAE approach [38]; forces the elements of the attribute vector to carry a limited amount of information about the details of a specific image.
Thanks to these modifications, the model distributes information more evenly in the hidden space and obtains most of the identification features from the identity encoder.
During inference, both the noise step (typical for diffusion models) and the masking level of the attribute vector can be controlled. Only one diffusion step was applied in the study—subsequent steps would improve the visual quality of the face swap without significantly affecting the hidden watermark.
Masking levels within the Baseline method were selected based on a series of preliminary tests with a trained network. The parameters were selected to ensure that the effect of masking on the extent of image modification was subjectively visible, while maintaining a comparable quality of the generated face. This differentiation enabled the analysis of how varying degrees of attribute isolation impact the degradation of the watermark.
The ability to control the noise level enables the generation of multiple test samples for various initial settings. In addition to analyzing the impact on the watermark, this approach can be used to augment training data for face-swap-resistant tagging systems.

2.2.7. Examples of Implemented Deepfakes

Images from the VGGFace2 [28] dataset were used for unit testing, shown in Figure 5:
(a)
Image of the target face—the one that will be replaced,
(b)
Source identity for face swap algorithms.
Figure 5. Reference images used in unit and comparative tests: (a) target image, (b) source image.
Figure 5. Reference images used in unit and comparative tests: (a) target image, (b) source image.
Sensors 25 06015 g005
Table 1 presents the sample face replacement results obtained for all models discussed (SimSwap, FaceShifter, GHOST, FastFake, InsightFace, and Baseline), along with heat maps illustrating the changes made. For the baseline model, three variants are presented, corresponding to different levels of attribute vector masking (high, medium, low).
The table shows that some deepfake algorithms also introduce changes in the background of the image. This is particularly evident in the SimSwap and FaceShifter methods, where the right side of the heat maps indicates significant modifications outside the face area.

2.3. Experiments

Several research scenarios were conducted as part of the analysis. Due to the intertwining nature of the individual experiments, they were divided into two main blocks: visible tagging and hidden tagging. Each block broadly covers analogous types of analysis, specifically examining the impact of tagging on face swapping on a global scale, broken down into face and background areas.
First, the results for visible tagging will be presented, followed by those for hidden tagging. In both cases, individual examples of the impact of tagging with different parameter configurations are presented, followed by a statistical overview based on a test set.
The primary metrics used in the analyses are the following:
  • ArcFace and CurricularFace [39] distance—the cosine distance between feature vectors extracted by the models, allowing for assessing whether the persons depicted in the compared images are recognized as the same.
  • Pearson correlation—Pearson correlation coefficient between the image with the watermark and the image after applying face swap to the material containing the watermark.
  • PSNR (Peak Signal-to-Noise Ratio)—used additionally in local analyses for a selected area (background).
The study analyzed two fundamental aspects:
The impact of watermarks on face swaps:
  • Heatmaps of differences between the image after face swap performed on material with a watermark and the image after face swap performed on material without a watermark were compared.
  • The ArcFace and CurricularFace distance between these two variants was calculated to assess the impact of marking on identity recognition.
  • Watermark retention:
  • The Pearson correlation coefficient was calculated between the original image with the watermark and the image after face swapping was performed on the same image.
  • Correlation maps and heat maps showing the distribution of changes in the image were generated.
  • In local analyses, the background area was examined separately by calculating the PSNR for this region to estimate the impact of face swap outside the face area.
All experiments were performed on the VGGFace2 dataset. The VGGFace2 dataset includes photographs with varying lighting conditions, quality, compression, and cropping, which allowed the natural incorporation of the diversity typical of images obtained from the web into the experiments. VGGFace2 was chosen because it is widely used to train the face-swap families evaluated in this study, ensuring consistency between the training distribution of the generators and the test distribution of the watermarking experiments. Using a different dataset could have introduced additional variability due to distribution shifts, potentially confounding the interactions between the watermark and the generator. The focus was therefore on establishing controlled, region-aware comparisons under conditions representative of mainstream face-swap pipelines. Extending the evaluation to additional datasets, including those with different demographics and capture conditions, remains an important direction for future work to assess generalizability further. For each combination of parameters, calculations were performed on 1000 examples, allowing for statistically significant comparisons. The results are compared with a local baseline, which allows the quantitative demonstration of the non-locality of editing outside the face area.

3. Results

3.1. Visible Watermark

This section presents the results for a visible watermark, unrelated to any neural network training process, and its interaction with deepfake algorithms.

3.1.1. The Impact of Watermarks on the Face Swap Algorithm

The first study analyzed the impact of introducing a visible watermark on the performance of face swap algorithms, both in terms of generation quality and differences compared to the unmarked variant. Since this is a visible mark, a natural effect is a decrease in metrics such as PSNR as the opacity parameter increases.
Table 2 presents examples illustrating the differences between images after face swapping is performed on material with and without a watermark, along with corresponding heatmaps that show areas of change depending on the opacity value. In this section, the focus is on the face region. Differences in the heatmaps inside the mask indicate whether the visible mark (QR pattern) interferes with identity transfer or whether the generator collapses. A firmly structured QR pattern acts like an occlusion; if its imprint is clearly visible in the face area after swapping, the method either learned to “correct” occlusions in a specific way or partially failed to perform a stable swap. By contrast, a more uniform heatmap confined to facial boundaries suggests that the method edits locally without amplifying the mark.
The qualitative patterns group the methods. SimSwap, FaceShifter, and GHOST show larger, face-internal differences that grow with opacity, which means the watermark perturbs their swap pipeline and can trigger failures (including mode collapse in some settings). FaceShifter is the most sensitive: identity consistency drops already for small occlusions, consistent with its reconstruction/attribute losses—these favor realism but can overfit and become brittle when the face is partially masked. GHOST is strongly tied to ArcFace-style identity losses, which helps enforce identity but also makes it prone to “forcing” identity while drifting the background. FastFake and InsightFace behave differently: FastFake tends to preserve non-face regions (segmentation-weighted objective during training) and confines edits, whereas InsightFace often avoids modifying low-quality/occluded faces because the detector rejects them, so at high opacity, the frame can remain unchanged. The Baseline is intentionally invasive inside the face mask and therefore “cuts through” the watermark in that region while leaving the background intact; it serves as the lower-bound reference for locality.
Figure 6 presents the results in the form of a numerical metric—the cosine distance between feature vectors obtained from the ArcFace and CurricularFace models. A higher value of this metric means greater identity similarity between images. Results from CurricularFace closely mirror those of ArcFace; for brevity, ArcFace curves are discussed, and it is noted that CurricularFace leads to the same qualitative conclusions.
Overall, the joint reading of Table 2 (spatial differences) and Figure 6 (identity distance) is essential: heatmaps reveal where the watermark perturbs the pipeline (i.e., face vs. background), while the identity metric shows how these perturbations translate into recognition outcomes.

3.1.2. Watermark Resistance

Another analysis concerned the behavior of the watermark after processing by face swap. The Pearson correlation between the original image with the watermark and the image after face swap was measured (Figure 7a). A high correlation value indicates that a significant portion of the image remained consistent, and the watermark was preserved.
Table 3 supplements these graphs with visual correlation maps and heatmaps. In the upper rows, colors represent local Pearson correlation: dark red indicates perfect correlation (watermark preserved), while cooler or lighter tones indicate reduced similarity. In the lower rows, the heatmaps show pixel-level differences: colder areas mean fewer changes, while warmer areas highlight more substantial modifications.
An ideal outcome would combine high correlation in the background (dark red outside the face) with visible changes only in the swapped region, similar to the Baseline. In such a case, the watermark is preserved outside the swapped face while the face is modified as expected. For example, at medium opacity (e.g., 20%), FastFake’s maps clearly reveal facial outlines in the correlation map. Still, the rest of the image remains dark red—indicating that the watermark survives in the background.
In practice, deviations from this “ideal” appear. SimSwap, FaceShifter, and Ghost exhibit broader color variation in the upper rows and warmer areas in the lower rows, which extend into the background. This indicates that watermark information is degraded even in regions not directly edited, which is consistent with their training setup (no explicit segmentation or background-preservation loss). InsightFace is distinct: at low opacity, it behaves similarly to SimSwap; however, once the opacity exceeds approximately 40%, the detector rejects the face, and no swap is performed. Therefore, the PSNR for those images is infinity. This results in the truncated red line in Figure 7b: at 75% and 100% opacity, the images remain unchanged, yielding artificially high correlation and PSNR values.
Figure 7b further highlights differences between methods. The baseline remains constant because it only modifies the face area, leaving the background unchanged, so it is not included. FastFake achieves the highest background PSNR among GAN-based models, confirming its ability to preserve non-face areas. SimSwap and FaceShifter consistently degrade background quality, while Ghost exhibits intermediate behavior. InsightFace is truncated for the reason explained above.
Overall, Table 3 and Figure 7 should be interpreted jointly. Strong results appear as dark red correlation in the background with localized warmer areas in the face (Baseline, FastFake). Poorer results appear as irregular, widespread variation in both rows (SimSwap, FaceShifter, Ghost), where the watermark is degraded across the whole frame.

3.2. Hidden Watermark

This section presents the results for a watermark trained using a neural network, designed to remain invisible to the viewer.

3.2.1. The Impact of Watermarks on the Face Swap Algorithm

As in the case of visible watermarks, the first stage of the analysis involved comparing the impact of introducing a hidden watermark on the performance of face swap algorithms. Table 4 presents example heatmaps illustrating the differences between images after face swap with and without a hidden watermark. Unlike the visible case, the hidden watermark does not produce a clear visual pattern; therefore, the distortions are subtler and must be interpreted in conjunction with the identity metrics.
Figure 8 reports cosine distance for ArcFace and CurricularFace embeddings. The two metrics are consistent. Higher values indicate stronger identity preservation.
Several patterns emerge when reading Table 4 and Figure 8 together. SimSwap and FaceShifter maintain identity transfer more consistently than in the visible case and do not collapse in the unit examples. However, their heatmaps still show non-local changes extending into the background, which reduces correlation and partly explains fluctuations in the metric curves. GHOST behaves differently: here, the watermark induces global degradation, visible as widespread heatmap differences, and its identity curve drops accordingly. This reflects its strong dependence on ArcFace-style identity losses, which can compromise stability in favor of identity.
FastFake again shows the most localized edits. In Table 4, the heatmaps reveal clearer preservation of the background compared to other GAN families, with distortions mostly confined to facial features. Interestingly, its ArcFace metric reaches relatively high values even when subjective quality suggests identity loss, highlighting a gap between numerical embeddings and visual perception. This indicates that background preservation (beneficial for watermark retention) can sometimes distort identity embeddings, creating a mismatch between subjective human judgment and model-based evaluation.
InsightFace differs from the visible case: since the hidden watermark does not strongly occlude facial regions, the detector continues to operate across the opacity range. Its identity metric gradually declines, possibly reflecting detector uncertainty, but no abrupt failure is observed.
In summary, hidden watermarks are less likely to induce catastrophic failures than visible ones. Still, the trade-off shifts: methods such as FastFake preserve the background more effectively (favoring watermark survival) but may produce ArcFace distances that overestimate identity consistency. SimSwap and FaceShifter remain more sensitive to background distortions, while GHOST is most vulnerable to global degradation.

3.2.2. Watermark Resistance

As in the case of the visible mark, the resistance of the hidden watermark to face swap processing was assessed using Pearson correlation and background-only PSNR. Figure 9 shows the global trends, while Table 5 provides spatial detail through correlation maps (upper rows) and heatmaps of differences (lower rows). In the correlation maps, similar to before, dark red indicates perfect similarity (with the watermark preserved), while cooler tones signal degraded correlation. In the heatmaps, colder regions indicate small changes, while warmer regions highlight more significant modifications.
The ideal outcome is similar to the visible case: high background correlation with localized changes confined to the swapped face. FastFake comes closest to this behavior. At moderate opacity levels (e.g., 20–30%), its maps clearly show facial outlines but preserve most of the background as dark red, meaning the hidden mark survives outside the swap region.
SimSwap and FaceShifter behave less favorably. Their correlation maps reveal widespread cooler areas that extend into the background, suggesting that the hidden watermark is degraded globally rather than just in the manipulated region. This is consistent with their training setups, which lacked explicit segmentation-based objectives to constrain edits.
GHOST exhibits the most severe background degradation. Its heatmaps display strong warm patches across the frame, showing that the generator alters even regions untouched by the swap mask.
InsightFace shows a different pattern. Unlike in the visible case, it operates across the entire opacity range, so its curves in Figure 9 remain continuous. Its correlation maps reveal that the method detects and processes a slightly different area, which affects interpretation. Still, its behavior is relatively stable compared to SimSwap or GHOST.
Overall, the combination of Table 5 and Figure 9 confirms that FastFake provides the most localized interference, closer to the Baseline, while SimSwap, FaceShifter, and especially GHOST spread degradation across the image. InsightFace remains stable but reflects the influence of its detector. These findings reinforce that the survival of hidden watermarks depends strongly on the architecture: methods with segmentation-aware objectives or detector-based constraints tend to preserve the background signal more effectively, while globally trained GANs degrade it more widely.

4. Discussion

The results indicate that invisible watermarks, when trained for robustness to common degradations, generally retain more of their signal after face swap than visible marks. Importantly, the dominant effect is not confined to the face region. Several families of face-swap models introduce non-local changes that also degrade watermarks placed far from the manipulated area. This challenges the common assumption that placing a mark away from the face is sufficient for protection. Dependencies between watermark strength, retention, and identity transfer are also non-monotonic and architecture-dependent. In some models, stronger marks simultaneously harm both transfer quality and retention, while in others, weaker marks are unintentionally smoothed out. Segmentation-weighted models better confine edits to the face and therefore preserve background watermarks more reliably, similar to how robust feature learning improves parsing in complex scenes [40].
These findings refine the evaluation of robustness. Classical tests, such as blur, noise, or resampling, approximate only part of the distortions caused by modern generators. Nonlinear, model-specific transforms produce irregular behaviors that monotonic assumptions cannot capture. Robustness evaluation must therefore be region-aware, strength-aware, and architecture-aware. Allocating watermark energy near facial boundaries or with high contrast carries a high risk of being reinterpreted as artifacts by the generator. Likewise, tuning strength requires careful consideration to avoid detector rejection or global statistical shifts.
This study has limits that point to clear directions for future work. The analysis focused on still images at a resolution of 128 × 128 to keep training multiple face-swap families and the watermarking model computationally feasible. Many pipelines in practice operate at 256 × 256 or higher and often incorporate super-resolution or enhancement modules, which can also affect watermark retention. Extending the experiments to higher resolutions, as well as to video, will help test whether the same non-local effects persist under more realistic conditions. In videos, temporal consistency could either help recovery by exploiting correlations across frames or further suppress weak marks by enforcing smoothness. Other manipulation families, such as reenactment, lip-sync, or full-face synthesis, should also be investigated. Broader datasets, additional recognition backbones, and recovery metrics such as bit error rate (BER) under dedicated training with specific model-induced degradations would also be valuable to investigate. Since the invisible watermark was not trained against face-swap transformations, the BER evaluation in the present setup would fail trivially. Future work should therefore couple BER analysis with watermark models explicitly trained for resistance to deepfake-induced perturbations.
For practitioners, the most actionable heuristic is to train watermarking systems with deepfake generators explicitly modeled as perturbations. In practice, this means augmenting training with generative noise or embedding a generator-in-the-loop, so that the decoder learns to recover messages under realistic non-local edits rather than only under classical distortions. Simple placement strategies such as “away from the face” are not sufficient. The results show that robustness depends strongly on the generator family: segmentation-weighted architectures preserve background signal better, while globally trained GANs often induce drift across the whole frame. Beyond the swap itself, additional stages such as super-resolution and enhancement can introduce comparable degradations. Robustness should therefore be validated across the entire pipeline rather than only the identity transfer step.
Deepfake generation and digital watermarking sit at the intersection of multimedia security, intellectual property protection, and public trust in visual media. While our contribution is methodological, the underlying technologies can be misused for disinformation, identity abuse, or unauthorized redistribution. At the same time, watermarking provides a means to enhance attribution and accountability. It is therefore emphasized that research in this area should be accompanied by guidelines for responsible use, including transparency in reporting, restrictions on dataset access, and a clear separation between forensic evaluation and generative enhancement.
Finally, even lightweight architectures such as FastFake, which appear perceptually stable to the human eye, introduced measurable statistical shifts in the analysis. This shows that human inspection alone is insufficient. Future work should aim to establish standardized, region-aware benchmarks that integrate deepfake generators into robustness testing pipelines. Such benchmarks would ensure that watermarking methods are evaluated under realistic adversarial conditions and would guide practitioners toward generator-aware placement and strength strategies.

Author Contributions

Investigation, T.W.; methodology, T.W.; supervision, Z.P.; validation, T.W.; resources T.W.; writing—original draft, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research work was funded by the project of the Military University of Technology titled: “New Neural Network Architectures for Signal and Data Processing in Radiocommunications and Multimedia.” Project No. UGB 22-054/2025.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The dataset underlying the experiments is VGGFace2, which the authors cannot redistribute due to license terms. Upon reasonable request, we will provide per-image and aggregated metrics, lists of VGGFace2 image identifiers used, configuration files and scripts to reproduce the experiments, and trained checkpoints of the baseline watermarking and face-swap components developed in this work, subject to third-party license restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AADAdaptive Attentional Denormalization
AdaINAdaptive Instance Normalization
AEI-NetAdaptive Embedding Integration Network
ArcFaceAdditive angular margin face recognition model
BCEBinary Cross-Entropy
BCEWithLogitsLossBinary Cross-Entropy with logits
BiSeNetBilateral Segmentation Network
CVComputer Vision
DIPDigital Image Processing
FiLMFeature-wise Linear Modulation
GANGenerative Adversarial Network
IDIdentity embedding
KLKullback–Leibler divergence
LPIPSLearned Perceptual Image Patch Similarity
MSEMean Squared Error
PSNRPeak Signal-to-Noise Ratio
QRQuick Response code
ResNetResidual Network
SDKSoftware Development Kit
SSIMStructural Similarity Index Measure
SOTAState of the Art
U-NetU-shaped convolutional neural network
VAEVariational Autoencoder
VGGFace2VGG Face dataset (version 2)

References

  1. Qureshi, A.; Megías, D.; Kuribayashi, M. Detecting Deepfake Videos Using Digital Watermarking. In Proceedings of the 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, 14–17 December 2021; pp. 1786–1793. [Google Scholar]
  2. Duszejko, P.; Walczyna, T.; Piotrowski, Z. Detection of Manipulations in Digital Images: A Review of Passive and Active Methods Utilizing Deep Learning. Appl. Sci. 2025, 15, 881. [Google Scholar] [CrossRef]
  3. Mahmud, B.U.; Sharmin, A. Deep Insights of Deepfake Technology: A Review. DUJASE 2023, 5, 13–23. [Google Scholar]
  4. Westerlund, M. The Emergence of Deepfake Technology: A Review. Technol. Innov. Manag. Rev. 2019, 9, 40–53. [Google Scholar] [CrossRef]
  5. Amerini, I.; Barni, M.; Battiato, S.; Bestagini, P.; Boato, G.; Bruni, V.; Caldelli, R.; De Natale, F.; De Nicola, R.; Guarnera, L.; et al. Deepfake Media Forensics: Status and Future Challenges. J. Imaging 2025, 11, 73. [Google Scholar] [CrossRef] [PubMed]
  6. Lai, Z.; Yao, Z.; Lai, G.; Wang, C.; Feng, R. A Novel Face Swapping Detection Scheme Using the Pseudo Zernike Transform Based Robust Watermarking. Electronics 2024, 13, 4955. [Google Scholar] [CrossRef]
  7. Zhu, J.; Kaplan, R.; Johnson, J.; Li, F.-F. HiDDeN: Hiding Data with Deep Networks. In Proceedings of the Computer Vision—ECCV 2018: 15th European Conference, Munich, Germany, 8–14 September 2018. [Google Scholar]
  8. Tancik, M.; Mildenhall, B.; Ng, R. StegaStamp: Invisible Hyperlinks in Physical Photographs. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  9. Yao, Y.; Grosz, S.; Liu, S.; Jain, A. Hide and Seek: How Does Watermarking Impact Face Recognition? arXiv 2024, arXiv:2404.18890. [Google Scholar] [CrossRef]
  10. Begum, M.; Uddin, M.S. Digital Image Watermarking Techniques: A Review. Information 2020, 11, 110. [Google Scholar] [CrossRef]
  11. Walczyna, T.; Piotrowski, Z. Quick Overview of Face Swap Deep Fakes. Appl. Sci. 2023, 13, 6711. [Google Scholar] [CrossRef]
  12. Chen, R.; Chen, X.; Ni, B.; Ge, Y. SimSwap: An Efficient Framework for High Fidelity Face Swapping. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA, 12 October 2020; pp. 2003–2011. [Google Scholar]
  13. Li, L.; Bao, J.; Yang, H.; Chen, D.; Wen, F. FaceShifter: Towards High Fidelity and Occlusion Aware Face Swapping. arXiv 2020, arXiv:1912.13457. [Google Scholar] [CrossRef]
  14. Groshev, A.; Maltseva, A.; Chesakov, D.; Kuznetsov, A.; Dimitrov, D. GHOST—A New Face Swap Approach for Image and Video Domains. IEEE Access 2022, 10, 83452–83462. [Google Scholar] [CrossRef]
  15. Walczyna, T.; Piotrowski, Z. Fast Fake: Easy-to-Train Face Swap Model. Appl. Sci. 2024, 14, 2149. [Google Scholar] [CrossRef]
  16. Deepinsight/Insightface 2025. Available online: https://github.com/deepinsight/insightface (accessed on 1 September 2025).
  17. Vasiljević, I.; Obradović, R.; Đurić, I.; Popkonstantinović, B.; Budak, I.; Kulić, L.; Milojević, Z. Copyright Protection of 3D Digitized Artistic Sculptures by Adding Unique Local Inconspicuous Errors by Sculptors. Appl. Sci. 2021, 11, 7481. [Google Scholar] [CrossRef]
  18. Li, Q.; Wang, X.; Ma, B.; Wang, X.; Wang, C.; Gao, S.; Shi, Y. Concealed Attack for Robust Watermarking Based on Generative Model and Perceptual Loss. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5695–5706. [Google Scholar] [CrossRef]
  19. Zhao, Y.; Wang, C.; Zhou, X.; Qin, Z. DARI-Mark: Deep Learning and Attention Network for Robust Image Watermarking. Mathematics 2023, 11, 209. [Google Scholar] [CrossRef]
  20. Kaczyński, M.; Piotrowski, Z. High-Quality Video Watermarking Based on Deep Neural Networks and Adjustable Subsquares Properties Algorithm. Sensors 2022, 22, 5376. [Google Scholar] [CrossRef] [PubMed]
  21. Wadhera, S.; Kamra, D.; Rajpal, A.; Jain, A.; Jain, V. A Comprehensive Review on Digital Image Watermarking. arXiv 2022, arXiv:2207.06909. [Google Scholar] [CrossRef]
  22. Horé, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  23. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
  24. Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. arXiv 2017, arXiv:1709.07871. [Google Scholar] [CrossRef]
  25. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  26. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  27. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  28. Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A Dataset for Recognising Faces across Pose and Age. In Proceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018. [Google Scholar]
  29. Binderiya Usukhbayar Deepfake Videos: The Future of Entertainment 2020. Available online: https://www.researchgate.net/publication/340862112_Deepfake_Videos_The_Future_of_Entertainment (accessed on 1 September 2025).
  30. Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5962–5979. [Google Scholar] [CrossRef]
  31. Huang, X.; Belongie, S. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  32. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, PMLR, Sydney, Australia, 17 July 2017; pp. 214–223. [Google Scholar]
  33. Liu, B.; Zhu, Y.; Song, K.; Elgammal, A. Towards Faster and Stabilized GAN Training for High-Fidelity Few-Shot Image Synthesis. arXiv 2021, arXiv:2101.04775. [Google Scholar]
  34. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  35. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. arXiv 2018, arXiv:1808.00897. [Google Scholar]
  36. Estanislao, K. Hacksider/Deep-Live-Cam 2025. Available online: https://github.com/hacksider/Deep-Live-Cam (accessed on 1 September 2025).
  37. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. arXiv 2020, arXiv:2006.11239. [Google Scholar] [CrossRef]
  38. Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends® Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
  39. Huang, Y.; Wang, Y.; Tai, Y.; Liu, X.; Shen, P.; Li, S.; Li, J.; Huang, F. CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition. arXiv 2020, arXiv:2004.00288. [Google Scholar] [CrossRef]
  40. Liu, Y.; Wang, C.; Lu, M.; Yang, J.; Gui, J.; Zhang, S. From Simple to Complex Scenes: Learning Robust Feature Representations for Accurate Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5449–5462. [Google Scholar] [CrossRef]
Figure 1. A single example of implemented explicit watermarking: (a) image without a watermark, (b) difference map (image with watermark minus image without watermark), (c) image with watermark (50% opacity).
Figure 1. A single example of implemented explicit watermarking: (a) image without a watermark, (b) difference map (image with watermark minus image without watermark), (c) image with watermark (50% opacity).
Sensors 25 06015 g001
Figure 2. Impact of the transparency level of an explicit watermark on image distortion—comparison of an image with a watermark and a reference image: (a) PSNR, (b) SSIM.
Figure 2. Impact of the transparency level of an explicit watermark on image distortion—comparison of an image with a watermark and a reference image: (a) PSNR, (b) SSIM.
Sensors 25 06015 g002
Figure 3. An example of implemented hidden watermarking: (a) image without watermark, (b) difference map, (c) image with watermark (strength 0.1).
Figure 3. An example of implemented hidden watermarking: (a) image without watermark, (b) difference map, (c) image with watermark (strength 0.1).
Sensors 25 06015 g003
Figure 4. The influence of the “strength” parameter of a hidden watermark on image distortion—comparison of an image with a watermark with the original image: (a) PSNR, (b) SSIM.
Figure 4. The influence of the “strength” parameter of a hidden watermark on image distortion—comparison of an image with a watermark with the original image: (a) PSNR, (b) SSIM.
Sensors 25 06015 g004
Figure 6. ArcFace (a) and CurricularFace (b) cosine distance between the face-swapped image with a watermark and the face-swapped image without a watermark.
Figure 6. ArcFace (a) and CurricularFace (b) cosine distance between the face-swapped image with a watermark and the face-swapped image without a watermark.
Sensors 25 06015 g006
Figure 7. (a) Pearson correlation, (b) PSNR background between the image with the watermark and the image after face swap—visible watermark case.
Figure 7. (a) Pearson correlation, (b) PSNR background between the image with the watermark and the image after face swap—visible watermark case.
Sensors 25 06015 g007
Figure 8. ArcFace (a) and CurricularFace (b) cosine distance between the face-swapped image with a watermark and the face-swapped image without a watermark—hidden watermark case.
Figure 8. ArcFace (a) and CurricularFace (b) cosine distance between the face-swapped image with a watermark and the face-swapped image without a watermark—hidden watermark case.
Sensors 25 06015 g008
Figure 9. (a) Pearson correlation, (b) PSNR of the background between the image with the watermark and the image after face swap—hidden watermark case.
Figure 9. (a) Pearson correlation, (b) PSNR of the background between the image with the watermark and the image after face swap—hidden watermark case.
Sensors 25 06015 g009
Table 1. Sample face swap results and corresponding heat maps for the tested models.
Table 1. Sample face swap results and corresponding heat maps for the tested models.
SimSwapFaceShifterGhostFastFakeInsightFaceBaseline
High Mask
Baseline
Medium Mask
Baseline
Low Mask
Face SwappedSensors 25 06015 i001Sensors 25 06015 i002Sensors 25 06015 i003Sensors 25 06015 i004Sensors 25 06015 i005Sensors 25 06015 i006Sensors 25 06015 i007Sensors 25 06015 i008
HeatmapSensors 25 06015 i009Sensors 25 06015 i010Sensors 25 06015 i011Sensors 25 06015 i012Sensors 25 06015 i013Sensors 25 06015 i014Sensors 25 06015 i015Sensors 25 06015 i016
Table 2. Heatmaps showing differences between images after face swap with a watermark and images after face swap without a watermark—visible watermark case.
Table 2. Heatmaps showing differences between images after face swap with a watermark and images after face swap without a watermark—visible watermark case.
Image with WatermarkSimSwapFace
Shifter
GhostFast
Fake
Insight
Face
Baseline
High Mask
Baseline
Medium Mask
Baseline
Low Mask
5%Sensors 25 06015 i017Sensors 25 06015 i018Sensors 25 06015 i019Sensors 25 06015 i020Sensors 25 06015 i021Sensors 25 06015 i022Sensors 25 06015 i023Sensors 25 06015 i024Sensors 25 06015 i025
Sensors 25 06015 i026Sensors 25 06015 i027Sensors 25 06015 i028Sensors 25 06015 i029Sensors 25 06015 i030Sensors 25 06015 i031Sensors 25 06015 i032Sensors 25 06015 i033
10%Sensors 25 06015 i034Sensors 25 06015 i035Sensors 25 06015 i036Sensors 25 06015 i037Sensors 25 06015 i038Sensors 25 06015 i039Sensors 25 06015 i040Sensors 25 06015 i041Sensors 25 06015 i042
Sensors 25 06015 i043Sensors 25 06015 i044Sensors 25 06015 i045Sensors 25 06015 i046Sensors 25 06015 i047Sensors 25 06015 i048Sensors 25 06015 i049Sensors 25 06015 i050
20%Sensors 25 06015 i051Sensors 25 06015 i052Sensors 25 06015 i053Sensors 25 06015 i054Sensors 25 06015 i055Sensors 25 06015 i056Sensors 25 06015 i057Sensors 25 06015 i058Sensors 25 06015 i059
Sensors 25 06015 i060Sensors 25 06015 i061Sensors 25 06015 i062Sensors 25 06015 i063Sensors 25 06015 i064Sensors 25 06015 i065Sensors 25 06015 i066Sensors 25 06015 i067
30%Sensors 25 06015 i068Sensors 25 06015 i069Sensors 25 06015 i070Sensors 25 06015 i071Sensors 25 06015 i072Sensors 25 06015 i073Sensors 25 06015 i074Sensors 25 06015 i075Sensors 25 06015 i076
Sensors 25 06015 i077Sensors 25 06015 i078Sensors 25 06015 i079Sensors 25 06015 i080Sensors 25 06015 i081Sensors 25 06015 i082Sensors 25 06015 i083Sensors 25 06015 i084
50%Sensors 25 06015 i085Sensors 25 06015 i086Sensors 25 06015 i087Sensors 25 06015 i088Sensors 25 06015 i089Sensors 25 06015 i090Sensors 25 06015 i091Sensors 25 06015 i092Sensors 25 06015 i093
Sensors 25 06015 i094Sensors 25 06015 i095Sensors 25 06015 i096Sensors 25 06015 i097Sensors 25 06015 i098Sensors 25 06015 i099Sensors 25 06015 i100Sensors 25 06015 i101
75%Sensors 25 06015 i102Sensors 25 06015 i103Sensors 25 06015 i104Sensors 25 06015 i105Sensors 25 06015 i106Sensors 25 06015 i107Sensors 25 06015 i108Sensors 25 06015 i109Sensors 25 06015 i110
Sensors 25 06015 i111Sensors 25 06015 i112Sensors 25 06015 i113Sensors 25 06015 i114Sensors 25 06015 i115Sensors 25 06015 i116Sensors 25 06015 i117Sensors 25 06015 i118
100%Sensors 25 06015 i119Sensors 25 06015 i120Sensors 25 06015 i121Sensors 25 06015 i122Sensors 25 06015 i123Sensors 25 06015 i124Sensors 25 06015 i125Sensors 25 06015 i126Sensors 25 06015 i127
Sensors 25 06015 i128Sensors 25 06015 i129Sensors 25 06015 i130Sensors 25 06015 i131Sensors 25 06015 i132Sensors 25 06015 i133Sensors 25 06015 i134Sensors 25 06015 i135
Table 3. Local Pearson correlation and heatmaps of differences between the image with the watermark and the image after face swap—visible watermark case.
Table 3. Local Pearson correlation and heatmaps of differences between the image with the watermark and the image after face swap—visible watermark case.
SimSwapFace ShifterGhostFast FakeInsight FaceBaseline
High Mask
Baseline
Medium Mask
Baseline
Low Mask
0%Sensors 25 06015 i136Sensors 25 06015 i137Sensors 25 06015 i138Sensors 25 06015 i139Sensors 25 06015 i140Sensors 25 06015 i141Sensors 25 06015 i142Sensors 25 06015 i143
Sensors 25 06015 i144Sensors 25 06015 i145Sensors 25 06015 i146Sensors 25 06015 i147Sensors 25 06015 i148Sensors 25 06015 i149Sensors 25 06015 i150Sensors 25 06015 i151
5%Sensors 25 06015 i152Sensors 25 06015 i153Sensors 25 06015 i154Sensors 25 06015 i155Sensors 25 06015 i156Sensors 25 06015 i157Sensors 25 06015 i158Sensors 25 06015 i159
Sensors 25 06015 i160Sensors 25 06015 i161Sensors 25 06015 i162Sensors 25 06015 i163Sensors 25 06015 i164Sensors 25 06015 i165Sensors 25 06015 i166Sensors 25 06015 i167
10%Sensors 25 06015 i168Sensors 25 06015 i169Sensors 25 06015 i170Sensors 25 06015 i171Sensors 25 06015 i172Sensors 25 06015 i173Sensors 25 06015 i174Sensors 25 06015 i175
Sensors 25 06015 i176Sensors 25 06015 i177Sensors 25 06015 i178Sensors 25 06015 i179Sensors 25 06015 i180Sensors 25 06015 i181Sensors 25 06015 i182Sensors 25 06015 i183
20%Sensors 25 06015 i184Sensors 25 06015 i185Sensors 25 06015 i186Sensors 25 06015 i187Sensors 25 06015 i188Sensors 25 06015 i189Sensors 25 06015 i190Sensors 25 06015 i191
Sensors 25 06015 i192Sensors 25 06015 i193Sensors 25 06015 i194Sensors 25 06015 i195Sensors 25 06015 i196Sensors 25 06015 i197Sensors 25 06015 i198Sensors 25 06015 i199
30%Sensors 25 06015 i200Sensors 25 06015 i201Sensors 25 06015 i202Sensors 25 06015 i203Sensors 25 06015 i204Sensors 25 06015 i205Sensors 25 06015 i206Sensors 25 06015 i207
Sensors 25 06015 i208Sensors 25 06015 i209Sensors 25 06015 i210Sensors 25 06015 i211Sensors 25 06015 i212Sensors 25 06015 i213Sensors 25 06015 i214Sensors 25 06015 i215
50%Sensors 25 06015 i216Sensors 25 06015 i217Sensors 25 06015 i218Sensors 25 06015 i219Sensors 25 06015 i220Sensors 25 06015 i221Sensors 25 06015 i222Sensors 25 06015 i223
Sensors 25 06015 i224Sensors 25 06015 i225Sensors 25 06015 i226Sensors 25 06015 i227Sensors 25 06015 i228Sensors 25 06015 i229Sensors 25 06015 i230Sensors 25 06015 i231
75%Sensors 25 06015 i232Sensors 25 06015 i233Sensors 25 06015 i234Sensors 25 06015 i235Sensors 25 06015 i236Sensors 25 06015 i237Sensors 25 06015 i238Sensors 25 06015 i239
Sensors 25 06015 i240Sensors 25 06015 i241Sensors 25 06015 i242Sensors 25 06015 i243Sensors 25 06015 i244Sensors 25 06015 i245Sensors 25 06015 i246Sensors 25 06015 i247
100%Sensors 25 06015 i248Sensors 25 06015 i249Sensors 25 06015 i250Sensors 25 06015 i251Sensors 25 06015 i252Sensors 25 06015 i253Sensors 25 06015 i254Sensors 25 06015 i255
Sensors 25 06015 i256Sensors 25 06015 i257Sensors 25 06015 i258Sensors 25 06015 i259Sensors 25 06015 i260Sensors 25 06015 i261Sensors 25 06015 i262Sensors 25 06015 i263
Table 4. Heatmaps of differences between images after face swap with a watermark and images after face swap without a watermark—hidden watermark case.
Table 4. Heatmaps of differences between images after face swap with a watermark and images after face swap without a watermark—hidden watermark case.
Image with WatermarkSimSwapFace
Shifter
GhostFast
Fake
Insight
Face
Baseline
High Mask
Baseline
Medium Mask
Baseline
Low Mask
5%Sensors 25 06015 i264Sensors 25 06015 i265Sensors 25 06015 i266Sensors 25 06015 i267Sensors 25 06015 i268Sensors 25 06015 i269Sensors 25 06015 i270Sensors 25 06015 i271Sensors 25 06015 i272
Sensors 25 06015 i273Sensors 25 06015 i274Sensors 25 06015 i275Sensors 25 06015 i276Sensors 25 06015 i277Sensors 25 06015 i278Sensors 25 06015 i279Sensors 25 06015 i280
10%Sensors 25 06015 i281Sensors 25 06015 i282Sensors 25 06015 i283Sensors 25 06015 i284Sensors 25 06015 i285Sensors 25 06015 i286Sensors 25 06015 i287Sensors 25 06015 i288Sensors 25 06015 i289
Sensors 25 06015 i290Sensors 25 06015 i291Sensors 25 06015 i292Sensors 25 06015 i293Sensors 25 06015 i294Sensors 25 06015 i295Sensors 25 06015 i296Sensors 25 06015 i297
20%Sensors 25 06015 i298Sensors 25 06015 i299Sensors 25 06015 i300Sensors 25 06015 i301Sensors 25 06015 i302Sensors 25 06015 i303Sensors 25 06015 i304Sensors 25 06015 i305Sensors 25 06015 i306
Sensors 25 06015 i307Sensors 25 06015 i308Sensors 25 06015 i309Sensors 25 06015 i310Sensors 25 06015 i311Sensors 25 06015 i312Sensors 25 06015 i313Sensors 25 06015 i314
30%Sensors 25 06015 i315Sensors 25 06015 i316Sensors 25 06015 i317Sensors 25 06015 i318Sensors 25 06015 i319Sensors 25 06015 i320Sensors 25 06015 i321Sensors 25 06015 i322Sensors 25 06015 i323
Sensors 25 06015 i324Sensors 25 06015 i325Sensors 25 06015 i326Sensors 25 06015 i327Sensors 25 06015 i328Sensors 25 06015 i329Sensors 25 06015 i330Sensors 25 06015 i331
50%Sensors 25 06015 i332Sensors 25 06015 i333Sensors 25 06015 i334Sensors 25 06015 i335Sensors 25 06015 i336Sensors 25 06015 i337Sensors 25 06015 i338Sensors 25 06015 i339Sensors 25 06015 i340
Sensors 25 06015 i341Sensors 25 06015 i342Sensors 25 06015 i343Sensors 25 06015 i344Sensors 25 06015 i345Sensors 25 06015 i346Sensors 25 06015 i347Sensors 25 06015 i348
75%Sensors 25 06015 i349Sensors 25 06015 i350Sensors 25 06015 i351Sensors 25 06015 i352Sensors 25 06015 i353Sensors 25 06015 i354Sensors 25 06015 i355Sensors 25 06015 i356Sensors 25 06015 i357
Sensors 25 06015 i358Sensors 25 06015 i359Sensors 25 06015 i360Sensors 25 06015 i361Sensors 25 06015 i362Sensors 25 06015 i363Sensors 25 06015 i364Sensors 25 06015 i365
100%Sensors 25 06015 i366Sensors 25 06015 i367Sensors 25 06015 i368Sensors 25 06015 i369Sensors 25 06015 i370Sensors 25 06015 i371Sensors 25 06015 i372Sensors 25 06015 i373Sensors 25 06015 i374
Sensors 25 06015 i375Sensors 25 06015 i376Sensors 25 06015 i377Sensors 25 06015 i378Sensors 25 06015 i379Sensors 25 06015 i380Sensors 25 06015 i381Sensors 25 06015 i382
Table 5. Local Pearson correlation and heatmaps of differences between the image with the watermark and the image after face swap—hidden watermark.
Table 5. Local Pearson correlation and heatmaps of differences between the image with the watermark and the image after face swap—hidden watermark.
SimSwapFace ShifterGhostFast FakeInsight FaceBaseline
High Mask
Baseline
Medium Mask
Baseline
Low Mask
0%Sensors 25 06015 i383Sensors 25 06015 i384Sensors 25 06015 i385Sensors 25 06015 i386Sensors 25 06015 i387Sensors 25 06015 i388Sensors 25 06015 i389Sensors 25 06015 i390
Sensors 25 06015 i391Sensors 25 06015 i392Sensors 25 06015 i393Sensors 25 06015 i394Sensors 25 06015 i395Sensors 25 06015 i396Sensors 25 06015 i397Sensors 25 06015 i398
5%Sensors 25 06015 i399Sensors 25 06015 i400Sensors 25 06015 i401Sensors 25 06015 i402Sensors 25 06015 i403Sensors 25 06015 i404Sensors 25 06015 i405Sensors 25 06015 i406
Sensors 25 06015 i407Sensors 25 06015 i408Sensors 25 06015 i409Sensors 25 06015 i410Sensors 25 06015 i411Sensors 25 06015 i412Sensors 25 06015 i413Sensors 25 06015 i414
10%Sensors 25 06015 i415Sensors 25 06015 i416Sensors 25 06015 i417Sensors 25 06015 i418Sensors 25 06015 i419Sensors 25 06015 i420Sensors 25 06015 i421Sensors 25 06015 i422
Sensors 25 06015 i423Sensors 25 06015 i424Sensors 25 06015 i425Sensors 25 06015 i426Sensors 25 06015 i427Sensors 25 06015 i428Sensors 25 06015 i429Sensors 25 06015 i430
20%Sensors 25 06015 i431Sensors 25 06015 i432Sensors 25 06015 i433Sensors 25 06015 i434Sensors 25 06015 i435Sensors 25 06015 i436Sensors 25 06015 i437Sensors 25 06015 i438
Sensors 25 06015 i439Sensors 25 06015 i440Sensors 25 06015 i441Sensors 25 06015 i442Sensors 25 06015 i443Sensors 25 06015 i444Sensors 25 06015 i445Sensors 25 06015 i446
30%Sensors 25 06015 i447Sensors 25 06015 i448Sensors 25 06015 i449Sensors 25 06015 i450Sensors 25 06015 i451Sensors 25 06015 i452Sensors 25 06015 i453Sensors 25 06015 i454
Sensors 25 06015 i455Sensors 25 06015 i456Sensors 25 06015 i457Sensors 25 06015 i458Sensors 25 06015 i459Sensors 25 06015 i460Sensors 25 06015 i461Sensors 25 06015 i462
50%Sensors 25 06015 i463Sensors 25 06015 i464Sensors 25 06015 i465Sensors 25 06015 i466Sensors 25 06015 i467Sensors 25 06015 i468Sensors 25 06015 i469Sensors 25 06015 i470
Sensors 25 06015 i471Sensors 25 06015 i472Sensors 25 06015 i473Sensors 25 06015 i474Sensors 25 06015 i475Sensors 25 06015 i476Sensors 25 06015 i477Sensors 25 06015 i478
75%Sensors 25 06015 i479Sensors 25 06015 i480Sensors 25 06015 i481Sensors 25 06015 i482Sensors 25 06015 i483Sensors 25 06015 i484Sensors 25 06015 i485Sensors 25 06015 i486
Sensors 25 06015 i487Sensors 25 06015 i488Sensors 25 06015 i489Sensors 25 06015 i490Sensors 25 06015 i491Sensors 25 06015 i492Sensors 25 06015 i493Sensors 25 06015 i494
100%Sensors 25 06015 i495Sensors 25 06015 i496Sensors 25 06015 i497Sensors 25 06015 i498Sensors 25 06015 i499Sensors 25 06015 i500Sensors 25 06015 i501Sensors 25 06015 i502
Sensors 25 06015 i503Sensors 25 06015 i504Sensors 25 06015 i505Sensors 25 06015 i506Sensors 25 06015 i507Sensors 25 06015 i508Sensors 25 06015 i509Sensors 25 06015 i510
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Walczyna, T.; Piotrowski, Z. Mutual Effects of Face-Swap Deepfakes and Digital Watermarking—A Region-Aware Study. Sensors 2025, 25, 6015. https://doi.org/10.3390/s25196015

AMA Style

Walczyna T, Piotrowski Z. Mutual Effects of Face-Swap Deepfakes and Digital Watermarking—A Region-Aware Study. Sensors. 2025; 25(19):6015. https://doi.org/10.3390/s25196015

Chicago/Turabian Style

Walczyna, Tomasz, and Zbigniew Piotrowski. 2025. "Mutual Effects of Face-Swap Deepfakes and Digital Watermarking—A Region-Aware Study" Sensors 25, no. 19: 6015. https://doi.org/10.3390/s25196015

APA Style

Walczyna, T., & Piotrowski, Z. (2025). Mutual Effects of Face-Swap Deepfakes and Digital Watermarking—A Region-Aware Study. Sensors, 25(19), 6015. https://doi.org/10.3390/s25196015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop