RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup

El Hajjar, Tamara; Lansari, Mohammed; Bellafqira, Reda; Coatrieux, Gouenou; Kapusta, Katarzyna; Kallas, Kassem

doi:10.3390/make7020032

Open AccessArticle

RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup

by

Tamara El Hajjar

^1,†

,

Mohammed Lansari

^1,2,*,†

,

Reda Bellafqira

^1,3

,

Gouenou Coatrieux

^1,3

,

Katarzyna Kapusta

² and

Kassem Kallas

³

¹

IMT Atlantique, Inserm UMR 1101, 29200 Brest, France

²

CortAIx Labs, Thales, 91120 Palaiseau, France

³

National Institute of Health and Medical Research (Inserm), UMR 1101 Latim, 29238 Brest, France

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mach. Learn. Knowl. Extr. 2025, 7(2), 32; https://doi.org/10.3390/make7020032

Submission received: 12 February 2025 / Revised: 12 March 2025 / Accepted: 26 March 2025 / Published: 30 March 2025

(This article belongs to the Section Privacy)

Download

Browse Figures

Versions Notes

Abstract

:

Due to their considerable costs, deep neural networks (DNNs) are valuable assets that need to be protected in terms of intellectual property (IP). From this statement, DNN watermarking gains significant interest since it allows DNN owners to prove their ownership. Various methods that embed ownership information in the model behavior have been proposed. They need to fill several requirements, among them the security, which represents an attacker’s difficulty in breaking the watermarking scheme. There is also the robustness requirement, which quantifies the resistance against watermark removal techniques. The problem is that the proposed methods generally fail to meet these necessary standards. This paper presents RoSe-Mix, a robust and secure deep neural network watermarking technique designed for black-box settings. It addresses limitations in existing DNN watermarking approaches by integrating key features from two established methods: RoSe, which uses cryptographic hashing to ensure security, and Mixer, which employs image Mixup to enhance robustness. Experimental results demonstrate that RoSe-Mix achieves security across various architectures and datasets with a robustness to removal attacks exceeding 99%.

Keywords:

DNN watermarking; intellectual property; black-box setting

1. Introduction

Recent advances in machine learning (ML) have revolutionized many fields, like natural language processing [1], time series analysis [2], and computer vision [3]. Deep neural networks (DNNs) are the ML models that contribute the most to these advancements. However, their development necessitates important and huge resources, including large and high-quality datasets that are often difficult to obtain due to confidentiality constraints, expensive hardware, like GPUs, and the rare expertise of specialists for data annotation and of data science engineers to design well-adapted model architectures and hyper-parameters [4].

As stated, training a DNN is a costly task, also making them valuable assets that need to be protected in terms of intellectual property (IP). To safeguard the latter, the most promising solution is DNN watermarking [5,6,7]. This allows the owner to embed a secret in the model in such a way that it proves its ownership. DNN watermarking is classified into two settings: white-box and black-box. These techniques are defined by the level of access required by the watermark extractor. In the white-box technique [5,8,9,10,11,12,13,14], internal parameters that directly correspond to the weights of the DNN models are accessible. In the black-box technique, only the final output of the DNN model is observed. In this case, the watermark is extracted by querying a set of trigger inputs to the DNN model [15,16,17,18,19,20,21,22]. This set is called a trigger set and consists of crafted inputs and outputs to embed a backdoor in the model to prove ownership. However, these watermarking modulations have two main limitations. First, the set of samples created to embed the watermark is the same as those used to perform the verification. If the verifier is malicious, they can potentially steal this set and claim model ownership. Second, the choice of target labels is generally naive, leading to a minimal amount of work for a potential usurper to reverse engineer the input–output pairs a posteriori and then build an illegitimate IP proof. In this paper, we present RoSe-Mix, a black-box watermarking technique that takes advantage of two existing black-box watermarking techniques, RoSe [23] and Mixer [24], to exponentially increase the necessary work for an attacker to claim an illegitimate model. Our contributions are summarized as follows:

We present the strengths of RoSe [23] and Mixer [25] and demonstrate how their combination can enhance the security and robustness of the watermark while being merged into RoSe-Mix.
We provide a security analysis of this method by formalizing the complexity for an attacker to break the watermarking scheme, demonstrating that breaking the watermarking protocol is exponentially hard with respect to the size of the selected secret key.
We conduct experiments on several datasets and models to evaluate the performance of our proposed method.

Section 2 provides an overview of DNN watermarking techniques. Section 3 describes our motivation to merge RoSe and Mixer and presents RoSe-Mix. Finally, Section 4 presents experimental results using different datasets and models.

2. Related Works

In this section, we introduce the notion of DNN watermarking by defining its requirements and properties. We then present state-of-the-art methods, particularly focusing on RoSe [23] and Mixer [25], the two key methods used in our proposal.

DNN watermarking is a technique that consists of embedding a secret in a model. This secret can then be extracted later and used by its owner to identify or track the model’s illegal copies. To be usable in a real scenario, the watermarking modulation should respect several properties [26]. The first one concerns the model’s fidelity, which quantifies the performance degradation on the main task of the model after watermark embedding. The second property is the robustness of the watermark, which defines whether the watermark resists various model modifications (intentional or not). Finally, security corresponds to the difficulty for an attacker to estimate the secret key linked to the watermark.

The first DNN watermarking modulation was proposed by Uchida et al. [5] and consists of embedding a binary string in the model parameters using a regularization term and a secret key. When the model owner wants to extract the secret binary string, it is mandatory to have full access to the model parameters. This constraint categorizes the method as a white-box technique [8,10,11,12,27,28]. On the other hand, black-box techniques rely on the fact that only the inputs and outputs of the model are accessible during the verification phase. This scenario fits well with machine learning as a service application where DNNs are available through an API.

Relying on this assumption, black-box watermarking techniques hide the secret in the model’s behavior. Adi et al. [15] were the first to propose such a technique. Their main contribution is the definition of the trigger set

S_{e}

, which is a set of crafted inputs,

\tilde{X}

, and outputs,

\tilde{Y}

, used to create a backdoor in the model. This backdoor is then used by the owner to recognize its possibly stolen model. In their paper,

\tilde{X}

is composed of unrelated images (i.e., out-of-distribution images), while

\tilde{Y}

is randomly selected from the set of classes,

C

. Several methods have been developed focusing on how to build the trigger set. We can also cite Zhang et al. [17], who used training images with additional patterns (text or noise) to trigger the model. Other methods craft

S_{e}

using different techniques, such as adversarial examples [18] or invisible masks [16]. While the majority of existing methods focus on the machine-learning aspect of the problem, i.e., how to embed a watermark that respects the fidelity and robustness requirements, few are interested in the security side.

To improve the security of the watermark, Kallas et al. [23] proposed a black-box watermarking protocol called RoSe. This method uses one-way hashing functions and the injection of image-label pairs into the DNN during training, with these pairs serving as the trigger set for verifying ownership of a possibly stolen model. This method demonstrates strong security guarantees since breaking the watermarking scheme is NP-hard. Figure 1 illustrates this process for the MNIST dataset.

In their subsequent work, they introduce Mixer [25], a black-box watermarking technique where the triggers used during training differ from those used for ownership verification. To achieve this, they use the principle of Mixup, a data augmentation technique originally proposed by Zhang et al. [24], where new training samples are created by combining pairs of samples and their corresponding labels. By doing so, the method breaks the dependency on a fixed trigger set and its limited size. Figure 2 shows the Mixer process to generate one sample with

k = 2

.

3. Proposed Method

In this section, we discuss the limitations of RoSe and Mixer, which have motivated the development of our new proposal, RoSe-Mix. Rather than directly address all the limitations, the new method takes advantage of the strengths of both techniques to improve security.

3.1. Motivation

Although RoSe and Mixer offer strong protection against usurpers, where any drop in watermark recovery accuracy is closely tied to a drop in the main task accuracy, each method has inherent drawbacks:

Trigger set vulnerability: RoSe’s main limitation lies in its trigger set, which acts as the Owner’s key. Since the same trigger set is used for both watermark embedding and verification, it is vulnerable to a potentially untrustworthy verifier. Furthermore, increasing the number of watermark samples from the training data can degrade the model’s performance on the main task, limiting the scalability of RoSe [23].
Domain space overlap: In Mixer, the verification trigger set differs from the trigger samples used during watermark embedding. However, since both sets are derived from the same data domain, represented as a sphere in the feature space, there remains a risk that an attacker could generate keys $λ$ and $μ$ that are very similar to the Owner’s keys. This could place the attacker in the same domain space, making it difficult for the verifier to distinguish between the legitimate Owner and the attacker in case of a dispute [25].

Although some of the limitations persist, our proposed method does not aim to resolve all of them. Instead, it combines the strong security features of both RoSe and Mixer, capitalizing on their complementary strengths to enhance the security robustness of the watermarking process. By merging their protective mechanisms, the new approach offers a higher level of security against potential attacks.

3.2. RoSe-Mix: A Hybrid Watermarking Approach

In this section, we introduce our proposed method, which combines the secret-keyed hash function from RoSe with the Dirichlet distribution from Mixer to generate the mixing coefficients, denoted by

λ

and

μ

. These coefficients are then used to create trigger samples and their corresponding labels, leveraging the strengths of both methods. This fusion improves the security of the watermark generation, allowing for ownership verification using unseen samples.

The generation of the trigger set begins by selecting a set of

n_{e}

images, namely

{x_{1}, x_{2}, \dots, x_{n_{e}}}

, from the training set

D_{train} = {(x_{i}, y_{i})}_{i = 1}^{n} \subset X \times C

, where n is the number of training samples,

X

the input space, and

C = {1, \dots, c}

the set of classes. These images are concatenated and hashed using a cryptographic hash function, H, with the secret key,

s k

, resulting in a unique bitstream:

bitstream = H (x_{1} ∥ x_{2} ∥ \dots ∥ x_{n_{e}}; s k),

(1)

This bitstream is then transformed into a numeric seed by applying the function

O_{H} (\cdot)

:

seed = O_{H} (bitstream),

(2)

Using this seed, a deterministic sequence of random numbers is generated, which is then used to form the vector

α = (α_{1}, \dots, α_{k})

. This vector serves as a parameter for a Dirichlet distribution that generates the mixing coefficients

λ = (λ_{1}, \dots, λ_{k})

:

λ \sim Dirichlet (α),

(3)

The same procedure is repeated to generate the coefficients in

μ

. These coefficients are used to generate the new pairs of samples by merging them according to the Mixup principle described in Mixer [25]. For inputs, we have the following rule to generate a key input,

{\tilde{x}}_{s}

:

{\tilde{x}}_{s} = clip (\sum_{i = 1}^{k} λ_{i} x_{i} + x_{overlay}),

(4)

where

x_{overlay}

is an additive overlay added to the Mixup sample to enhance secrecy, and

clip

ensures that the pixel values remain within valid limits. These Mixer samples form the set of inputs

\tilde{X} = {{\tilde{x}}_{s}}_{s = 1}^{n_{e}}

. Figure 3 shows an example of samples from the trigger set generated as described previously. As an example, Figure 3d is a mixed sample between a car,

x_{1}

, and a horse,

x_{2}

, where

λ_{1} > λ_{2}

.

Next, the corresponding labels for the Mixer samples are generated. For each Mixer sample,

{\tilde{x}}_{s}

, the true labels,

y_{i}

, of the constituent images,

x_{i}

, are converted to their one-hot encoded forms,

y_{i}

. The key label,

{\tilde{y}}_{s}

, for the Mixer sample is a combination of these one-hot encoded labels, weighted by the

μ_{i}

coefficients:

{\tilde{y}}_{s} = \sum_{i = 1}^{k} μ_{i} y_{i},

(5)

where

\sum_{i = 1}^{k} μ_{i} = 1

. The set of all generated labels,

\tilde{Y}

, corresponds to the Mixer samples in

\tilde{X}

and forms the trigger set

S_{e} = {({\tilde{x}}_{s}, {\tilde{y}}_{s})}_{s = 1}^{n_{e}}

.

Once the Mixer trigger set,

S_{e}

, is created, it is combined with the training dataset,

D_{train}

, during the training phase of the model. It can also be applied with fine-tuning in the case where we distribute the model among several clients (called traitor tracking in watermarking [30]). By including these watermarked samples in the training process, the model learns to associate specific patterns with the watermarked outputs, embedding the watermark in its behavior. This process ensures that the model is capable of recognizing these mixed-up patterns, even for unseen data during verification.

To proceed with the verification, the owner provides to the verifier the secret key,

s k

, the hash function, H, and the vectors

λ

and

μ

. The verifier uses this information to regenerate Mixer samples and queries the model

F : X \to C

with the newly generated set

S_{d}

of size

n_{d}

. The outputs of the model are evaluated based on the ratio

ρ_{n_{d}}

, which is the proportion of correctly predicted labels:

ρ_{n_{d}} = \frac{| {{\tilde{x}}_{s} \in S_{d} ∣ F ({\tilde{x}}_{s}) = {\tilde{y}}_{s}} |}{n_{d}} .

(6)

If

ρ_{n_{d}}

exceeds a predefined threshold,

τ

, the model is verified to be watermarked, confirming ownership.

3.3. Security Analysis

In this section, we analyze and derive an approximation of the computational work required by an attacker to break the RoSe-Mix watermarking scheme.

The main goal of the attacker is to find or approximate all secrets to build the ownership proof. These secrets are divided into two parts, which correspond to the work of each original method:

a.: The bitstream (in Equation (1)) that is used to generate the seed in the RoSe method (We can approximate the work required for RoSe-Mix by adopting the RoSe analysis, where, instead of guessing the correct labels, the attacker must guess the correct bitstream and thereby the seed. This assumes that the function $O_{H}$ is known).
b.: $(λ, μ, x_{overlay})$ , which are the parameters used in the Mixer method.

In the rest of this section, we define the work required for RoSe and Mixer separately and merge the effort to give a lower bound of the resources needed for an attacker to break the watermarking scheme.

3.3.1. RoSe

In the RoSe method with the highest level of security, the attacker’s objective is, using a fixed generated key,

s k^{'}

, to generate adversarial examples (by modifying the last significant bits of the images) that give a good bitstream to obtain

ρ_{n_{a}} \geq τ

.

The computational effort required for this attack involves two primary components:

a.: Generating black-box adversarial examples.
b.: Regenerating hash-based triggers for verification.

The approximate lower bound of the computational effort for an usurper to claim the model is estimated as:

W_{R o S e} = R \frac{2 t ω_{F}}{{log}_{2} (c)} + R (2^{R} - 1) \frac{ω_{H} + ω_{F}}{{log}_{2} (c)},

(7)

where t is the number of iterations used to generate an adversarial sample, c is the number of classes,

ω_{F}

is the cost of a forward pass,

ω_{H}

is the cost of a hash computation, and the term R represents the rarity, which grows as a function of probability

P (ρ_{n_{a}} \geq τ)

. The details of this estimation can be found in [23].

3.3.2. Mixer

In contrast, the Mixer method requires the attacker to predict the correct mixing vectors,

λ

and

μ

, which are drawn from a Dirichlet distribution, but before this, the attacker must also guess the number of classes, k, to mix from the total set of c possible classes.

The probability of successfully guessing the correct number of classes k from c possible classes is

\frac{1}{(\binom{c}{k})}

, and the probability of predicting the mixing coefficients in

λ

and

μ

within the tolerance

δ

, is determined by integrating the Dirichlet density function over the

δ

-neighborhood around the actual values of

λ

and

μ

. This probability can be approximated by considering the space of possible values for

λ

and

μ

as being divided into

N^{k}

discrete steps. Therefore, the probability of correctly predicting

λ

within this range is:

P_{values} = \int_{\begin{matrix} \hat{λ} : \\ | λ_{i} - {\hat{λ}}_{i} | \leq δ \end{matrix}} f (\hat{λ}; α) d \hat{λ},

(8)

where

λ

is the true vector,

\hat{λ}

the guessed vector, and

f (\hat{λ}; α)

is the Dirichlet density function. The same formula can be applied to

μ

. The overall probability of success for the attacker is the product of the probability of correctly guessing the number of classes and the probability of predicting the mixing coefficients:

P_{overall} = \frac{1}{(\binom{c}{k})} \times P_{values} .

(9)

Now we transform these probabilities into work. The first step is to guess the positions of the k non-zero elements among the c possible positions. The number of possible combinations is given by the binomial coefficient

(\binom{c}{k})

.

For the number of trials to guess the values, once the positions are guessed, the attacker needs to guess the values of the k non-zero elements within the specified tolerance

δ

. The Dirichlet distribution generates continuous values, so an exact match is nearly impossible, but within a tolerance

δ

, we can estimate the number of trials as follows.

If we assume that

λ

and

μ

are discretized into N steps, where

δ = 1 / N

represents the tolerance within which the attacker must guess the correct coefficients, the work is equal to

N^{k}

. Finally, the total work to find

λ

for the Mixer method is given by:

W_{λ} = (\binom{c}{k}) \times N^{k},

(10)

where

(\binom{c}{k})

is the binomial coefficient that represents the number of ways to choose k classes from

C

, and

N^{k}

represents the number of possible combinations for the coefficients in

λ

within the tolerance

δ

. The same work can be calculated for

μ

:

W_{μ} = (\binom{c}{k}) \times N^{k} .

(11)

3.3.3. RoSe-Mix

In the RoSe-Mix method, the attacker faces a combined challenge. They must not only regenerate the bitstream as in RoSe, but also correctly predict the mixing coefficients of

λ

and

μ

as in Mixer. This significantly increases the complexity of the attack. To break RoSe-Mix, the attacker must (1) generate a trigger set that gives the good bitstream, (2) guess the correct number of classes k, and (3) predict the coefficients of

λ

and

μ

within a tolerance

δ

.

Given this, the total work required for the RoSe-Mix method is the product of the work required for RoSe (Equation (7)) and the work required to guess the correct mixing coefficients and the number of classes (Equations (10) and (11)). Therefore, the total work for the attacker is:

W_{RoSe - Mix} = \underset{W_{μ} + W_{λ}}{\underset{︸}{2 \times (\binom{c}{k}) \times N^{k}}} \times W_{RoSe} .

(12)

This work is significantly larger than either RoSe or Mixer alone because the attacker must now handle both the complexity of generating adversarial examples for the hash-based triggers and guessing the correct mixing coefficients from a Dirichlet distribution. The number of possible combinations for

λ

and

μ

grows exponentially with k and N, while the work to generate adversarial examples grows with t, the number of iterations needed to generate an adversarial example, and R, the rarity.

4. Experimental Results

4.1. Settings

4.1.1. Dataset and Models

To assess our proposed method’s performance and general applicability, we conducted experiments on four datasets with several models. Respectively, we use LetNet5 [31] with MNIST [32] and FashionMNIST [33], VGG19 [34] with CIFAR-10 [29], and ResNet50 [35] with GTSRB [36]. We have also used a ResNet50 pre-trained on ImageNet [37] with CIFAR-10 for the transfer-learning experiments. Models for MNIST and FashionMNIST are trained over 100 epochs; CIFAR10, GTSRB, and transfer-learning models are trained over 200 epochs. The last model is used to test our method by embedding via fine-tuning. A batch size of 64 is used for all models. Datasets are split 80% training, 10% validation, and 10% fine-tuning.

4.1.2. Metrics

The performance of the proposed approach is evaluated using four metrics: TA,

R e c_{T r}

,

R e c_{T s}

, and USR. TA, test accuracy, represents the model’s accuracy on the original task.

R e c_{T r}

is the recovery rate of watermark samples

S_{e}

used for the embedding, while

R e c_{T s}

measures the recovery rate on the unseen set

S_{d}

. Finally, the usurper success rate (USR) reflects the correct watermark recovery rate for samples generated by a usurper, based on 500 random fake keys. Relying on [38], we also define the threshold

τ = 25 %

, which gives us a reliable ownership demonstration with confidence greater than

1 - 2^{- 64}

for

k = 500

.

4.2. Modulation Parameters

In Section 3.2, we described our proposed method and defined some parameters for which, in this subsection, we perform experiments to determine optimal values for k (the number of classes used to generate Mixup watermark samples) and

n_{e}

(the size of the trigger dataset). We evaluated the impact of k and

n_{e}

on the performance of the model for each dataset. To find the best k, we tested values in the range

k \in {2, \dots, c}

, with

n_{e}

fixed at 500, selecting the k that produced the highest TA and

R e c_{T r}

. After identifying the optimal k, we plotted the curves of

n_{e}

to further assess its influence on performance, ultimately selecting the value that resulted in the best TA and

R e c_{T r}

for each dataset.

The results are shown in Figure 4 for k and Figure 5 for

n_{e}

. For the k plots, we omitted some datasets with similar behavior to conserve space. The watermark recovery rates,

R e c_{T r}

and

R e c_{T s}

, are measured based on the sizes of the watermark datasets,

n_{e}

and

n_{d}

, respectively. For

R e c_{T r}

, we performed experiments with

n_{e} \in {100, 300, 500, 700, 1000, 3000, 5000, 7000, 10, 000}

. The TA of each model without a watermark can be found in Figure Table 1. To determine the optimal

n_{e}

, we plotted the accuracy of the model as a function of

n_{e}

and selected the value that produced the highest TA and

R e c_{T r}

. The best k was 2 for MNIST, FashionMNIST, and CIFAR-10, and 7 for GTSRB. Additionally, starting from

n_{e} = 500

, both TA and

R e c_{T r}

showed significant improvement and stabilized across all datasets. However, increasing the value of

n_{e}

significantly decreases TA probably because the task related to the watermark takes more importance compared to the main task in the learning process, leading to a task interference [39] in which we note a decrease in performance on the main task for complex problems (e.g., Figure 5c). As a result, we used

n_{e} = 500

for the remaining experiments.

4.3. Fidelity

In this experiment, we evaluate the fidelity of the model, i.e., if the watermark embedding process degrades the performance TA. In Table 1, the first two columns show that embedding the watermark has minimal impact on the performance of the model for the main task, as the difference in TA between the host and watermarked models is negligible. Both the trigger set

S_{e}

and

S_{d}

yield recovery rates (

R e c_{T r}

and

R e c_{T s}

) of at least 88% for the watermarked model. In contrast, when the host model is tested with those trigger sets, the recovery rates remain very low, indicating the watermark’s integrity.

An additional observation is that the TA of the watermarked model sometimes surpasses that of the host model, which can be attributed to the data augmentation effect of the Mixup samples. The high recovery rates for

R e c_{T r}

and

R e c_{T s}

also suggest that the model is learning a secret manifold rather than a collection of isolated points. This eliminates the need for the verifier to use the original trigger samples for watermark embedding, making the watermarking method even more efficient.

4.4. Robustness

The watermark must also be robust against various attacks, such as fine-tuning, quantization, and pruning. Since RoSe-Mix operates in a black-box setting, its performance is evaluated using the standard recovery metrics: test accuracy (TA), training set watermark recovery Rate (

R e c_{T r}

), and test set watermark recovery rate (

R e c_{T s}

). The recovery metric is commonly used in black-box watermarking methods, but we extend it to training and testing. Together, the recovery and the test accuracy metrics reflect the watermark retrieval success based on the model’s behavior (output to triggers) rather than its internal parameters. Unlike white-box watermarking, where bit error rate (BER) or weight perturbation analysis might be applicable, black-box watermarking relies on behavioral patterns observable from the model’s predictions. This ensures that the verification process remains practical and does not require access to the model’s internal structure. Additionally, we incorporate dynamic quantization, full unsigned 8-bit quantization, and full signed 8-bit quantization into our robustness evaluation to ensure a comprehensive assessment of RoSe-Mix against commonly used model compression techniques. We can also highlight that each watermarked model has a better TA compared to its version without a watermark. This effect is often observed in backdoor injection [15], which acts as learning with noisy-labeled samples [40].

We begin our analysis with Table 1, which shows that RoSe-Mix demonstrates high robustness to fine-tuning and quantization attacks, as test accuracy (TA) and watermark recovery rates (

R e c_{T r}

and

R e c_{T s}

) remain almost unaffected across most datasets. For instance, in MNIST, FashionMNIST, and CIFAR-10, $R e c_{T s}$ remains larger than 88% across all tested attack scenarios, highlighting that RoSe-Mix preserves the watermark even under significant post-processing. The slight increase in

R e c_{T r}

for FashionMNIST under quantization (by 0.8%) suggests that the Mixup augmentation process improves the generalization of the model rather than degrading its fidelity or performance of the primary task.

The impact of JPEG compression is more significant. JPEG compression at a quality factor of 55 leads to a performance drop of 2.58%, 6.21%, and 3.65% in TA for CIFAR-10, transfer learning, and GTSRB, respectively. Additionally, GTSRB experiences the largest

R e c_{T r}

and

R e c_{T s}

reductions (7.2% and 8.6%), indicating that lower-resolution distortions can interfere with the embedded watermark. However, even in this extreme scenario, the watermark recovery rate remains sufficiently high (>

τ

), ensuring that ownership verification is still feasible.

The robustness of RoSe-Mix can be attributed to two primary factors:

a.: Mixup-based embedding: Unlike traditional backdoor-based watermarking schemes, RoSe-Mix does not rely on a fixed set of trigger inputs. Instead, Mixup continuously generates new trigger samples, preventing an adversary from isolating and removing the watermark through fine-tuning or pruning. This also explains why $R e c_{T s}$ remains significantly higher than TA under aggressive pruning—the watermark is encoded as a distributed manifold rather than discrete patterns.
b.: Cryptographic hashing mechanism: The use of one-way hash functions ensures that, even if an attacker attempts to regenerate trigger samples, they cannot feasibly recover the correct key-label mappings without access to the private key. This is evident from the minimal USR values in Table 2, showing that unauthorized attempts to extract the watermark result in high failure rates.

Pruning evaluations (Figure 6) further validate RoSe-Mix’s robustness. While MNIST and FashionMNIST experience 30% and 90% TA reductions at a pruning rate of $0.4$ , their respective watermark recovery rates remain higher than 80% for both. In contrast, CIFAR-10 suffers greater TA drops (to 55% and 40%) at pruning rates of 0.1, yet

R e c_{T s}

remains higher than or comparable to TA. This suggests that while pruning reduces the model’s expressiveness, it does not completely remove the watermark. The interplay between Mixup-generated triggers and deep feature representations ensures that the watermark remains embedded even when substantial portions of the network are removed.

Finally, any drop in

R e c_{T r}

and

R e c_{T s}

closely follows a drop in the accuracy of the main task of the model, ensuring that an attacker cannot remove the watermark without significantly deteriorating the usability of the model. This property makes RoSe-Mix highly resistant to watermark removal attacks, since the attacker must sacrifice the core performance of the model to erase its ownership markers.

4.5. Security

Let us now consider a scenario where a usurper attempts to steal the watermarked model by forging a new trigger set, assuming they know the algorithm. To do this, the usurper must first generate the correct values for

λ

and

μ

, and then craft adversarial examples and fine-tune the model on them. For the sake of analysis, we make the unrealistic assumption that the usurper knows several factors: the statistical distribution used to generate

λ

and

μ

, as well as the watermark generation algorithm—though without knowing the exact values of the parameters. This allows us to evaluate the usurper’s performance in the extreme worst-case scenario. We then measure the usurper success rate (USR), which is the watermark recovery rate of the usurper over 1000 random fake keys, with the results presented in Table 2.

Compared to Mixer [25], the usurper success rate (USR) in RoSe-Mix is approximately 2.45 times lower. This is due to the increased difficulty in guessing the multiple secrets involved in the proposed method, as reflected by the higher amount of work required in RoSe-Mix, expressed in Equation (12), which exceeds the effort needed for Mixer or RoSe alone. Even when the usurper is unrealistically granted several secret factors, the watermark recovery rate remains low, making it challenging to influence the verifier’s decision regarding model ownership.

5. Conclusions

In this paper, we introduced RoSe-Mix, a black-box DNN watermarking technique that effectively combines the security of cryptographic hashing from RoSe with the robustness of image Mixup from Mixer. The use of cryptographic hashing ensures the security of the watermark, while the Mixup-based trigger generation enhances robustness against various attacks, such as pruning, quantization, and adversarial modifications. We demonstrated that the increased computational complexity for attackers attempting to break the watermarking scheme makes RoSe-Mix more secure than previous methods. Additionally, the USR is reduced by at least 2.5× compared to Mixer. Our experimental results, conducted on architectures such as ResNet and VGG using datasets like CIFAR-10 and ImageNet, show the effectiveness of RoSe-Mix, highlighting its potential as a practical solution for protecting intellectual property in deep-learning models. We have demonstrated that our method can achieve at least 100% Rec with degradation in TA of 10% under the most severe attack. Future work could explore the application of RoSe-Mix to other domains, such as natural language processing or time series analysis. Additionally, investigating the performance of RoSe-Mix under more advanced attack scenarios, like model extraction or evasion attacks, could further validate its security properties.

Author Contributions

Conceptualization, T.E.H. and K.K. (Kassem Kallas); methodology, T.E.H. and K.K. (Kassem Kallas); software, T.E.H., M.L. and K.K. (Kassem Kallas); validation, T.E.H., M.L. and K.K. (Kassem Kallas); formal analysis, T.E.H. and K.K. (Kassem Kallas); investigation, T.E.H., M.L. and K.K. (Kassem Kallas); resources, T.E.H., M.L. and K.K. (Kassem Kallas); data curation, T.E.H. and K.K. (Kassem Kallas); writing—original draft preparation, T.E.H., M.L., R.B. and K.K. (Kassem Kallas); writing—review and editing, T.E.H., M.L., R.B., G.C. and K.K. (Kassem Kallas); visualization, T.E.H., M.L. and K.K. (Kassem Kallas); supervision, G.C., R.B., K.K. (Katarzyna Kapusta) and K.K. (Kassem Kallas); project administration, G.C., K.K. (Katarzyna Kapusta) and K.K. (Kassem Kallas); funding acquisition, G.C., K.K. (Katarzyna Kapusta) and K.K. (Kassem Kallas). All authors have read and agreed to the published version of the manuscript.

Funding

This work is funded by the European Union under Grant Agreement 101070222. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission (granting authority). Neither the European Union nor the granting authority can be held responsible for them. This work is also supported by the CYBAILE industrial chair, which is led by Inserm with the support of the Brittany Region Council. And the French government grants managed by the Agence Nationale de la Recherche under the France 2030 program under the reference ANR-22PESN-0006 (PEPR digital health TracIA project).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data used in this article are publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Azzouzi, M.E.; Coatrieux, G.; Bellafqira, R.; Delamarre, D.; Riou, C.; Oubenali, N.; Cabon, S.; Cuggia, M.; Bouzillé, G. Automatic de-identification of French electronic health records: A cost-effective approach exploiting distant supervision and deep learning models. BMC Med Informat. Decis. Mak. 2024, 24, 54. [Google Scholar]
Mohammadi Foumani, N.; Miller, L.; Tan, C.W.; Webb, G.I.; Forestier, G.; Salehi, M. Deep learning for time series classification and extrinsic regression: A current survey. ACM Comput. Surv. 2024, 56, 1–45. [Google Scholar]
Nafea, A.A.; Alameri, S.A.; Majeed, R.R.; Khalaf, M.A.; AL-Ani, M.M. A Short Review on Supervised Machine Learning and Deep Learning Techniques in Computer Vision. Babylon. J. Mach. Learn. 2024, 2024, 48–55. [Google Scholar]
Buchholz, K. The Extreme Cost of Training AI Models. 2024. Available online: https://www.statista.com/chart/33114/estimated-cost-of-training-selected-ai-models/ (accessed on 3 March 2025).
Uchida, Y.; Nagai, Y.; Sakazawa, S.; Satoh, S. Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, Bucharest, Romania, 6–9 June 2017; pp. 269–277. [Google Scholar]
Sun, Y.; Liu, L.; Yu, N.; Liu, Y.; Tian, Q.; Guo, D. Deep Watermarking for Deep Intellectual Property Protection: A Comprehensive Survey. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4697020 (accessed on 25 March 2025). [CrossRef]
Lansari, M.; Bellafqira, R.; Kapusta, K.; Thouvenot, V.; Bettan, O.; Coatrieux, G. When federated learning meets watermarking: A comprehensive overview of techniques for intellectual property protection. Mach. Learn. Knowl. Extr. 2023, 5, 1382–1406. [Google Scholar] [CrossRef]
Darvish Rouhani, B.; Chen, H.; Koushanfar, F. Deepsigns: An end-to-end watermarking framework for ownership protection of deep neural networks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI, USA, 13–17 April 2019; pp. 485–497. [Google Scholar]
Fan, L.; Ng, K.W.; Chan, C.S. Rethinking deep neural network ownership verification: Embedding passports to defeat ambiguity attacks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Wang, T.; Kerschbaum, F. Riga: Covert and robust white-box watermarking of deep neural networks. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 993–1004. [Google Scholar]
Bellafqira, R.; Coatrieux, G. Diction: Dynamic robust white box watermarking scheme. arXiv 2022, arXiv:2210.15745. [Google Scholar]
Lv, P.; Li, P.; Zhang, S.; Chen, K.; Liang, R.; Ma, H.; Zhao, Y.; Li, Y. A robustness-assured white-box watermark in neural networks. IEEE Trans. Dependable Secur. Comput. 2023, 20, 5214–5229. [Google Scholar]
Chen, H.; Liu, C.; Zhu, T.; Zhou, W. When deep learning meets watermarking: A survey of application, attacks and defenses. Comput. Stand. Interfaces 2024, 89, 103830. [Google Scholar]
Lansari, M.; Bellafqira, R.; Kapusta, K.; Kallas, K.; Thouvenot, V.; Bettan, O.; Coatrieux, G. FedCrypt: A Dynamic White-Box Watermarking Scheme for Homomorphic Federated Learning. TechRxiv 2024. [Google Scholar] [CrossRef]
Adi, Y.; Baum, C.; Cisse, M.; Pinkas, B.; Keshet, J. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In Proceedings of the 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD, USA, 15–17 August 2018; pp. 1615–1631. [Google Scholar]
Guo, J.; Potkonjak, M. Watermarking deep neural networks for embedded systems. In Proceedings of the 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), San Diego, CA, USA, 5–8 November 2018; pp. 1–8. [Google Scholar]
Zhang, J.; Gu, Z.; Jang, J.; Wu, H.; Stoecklin, M.P.; Huang, H.; Molloy, I. Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, Incheon, Republic of Korea, 4–8 June 2018; pp. 159–172. [Google Scholar]
Le Merrer, E.; Perez, P.; Trédan, G. Adversarial frontier stitching for remote neural network watermarking. Neural Comput. Appl. 2020, 32, 9233–9244. [Google Scholar] [CrossRef]
Yadollahi, M.M.; Shoeleh, F.; Dadkhah, S.; Ghorbani, A.A. Robust black-box watermarking for deep neural network using inverse document frequency. In Proceedings of the 2021 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Virtual, 25–28 October 2021; pp. 574–581. [Google Scholar]
Wang, Y.; Wu, H. Protecting the intellectual property of speaker recognition model by black-box watermarking in the frequency domain. Symmetry 2022, 14, 619. [Google Scholar] [CrossRef]
Gloaguen, T.; Jovanović, N.; Staab, R.; Vechev, M. Black-box detection of language model watermarks. arXiv 2024, arXiv:2405.20777. [Google Scholar]
Leroux, S.; Vanassche, S.; Simoens, P. Multi-bit Black-box Watermarking of Deep Neural Networks in Embedded Applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2121–2130. [Google Scholar]
Kallas, K.; Furon, T. Rose: A robust and secure dnn watermarking. In Proceedings of the 2022 IEEE International Workshop on Information Forensics and Security (WIFS), Online, 12–16 December 2022; pp. 1–6. [Google Scholar]
Zhang, H. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Kallas, K.; Furon, T. Mixer: Dnn watermarking using image mixup. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Boenisch, F. A systematic review on model watermarking for neural networks. Front. Big Data 2021, 4, 729663. [Google Scholar]
Oh, G.; Kim, S.; Cho, W.; Lee, S.; Chung, J.; Song, D.; Yu, Y. SEAL: Entangled White-box Watermarks on Low-Rank Adaptation. arXiv 2025, arXiv:2501.09284. [Google Scholar]
Downer, J.; Wang, R.; Wang, B. Watermarking Graph Neural Networks via Explanations for Ownership Protection. arXiv 2025, arXiv:2501.05614. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.utoronto.ca/~kriz/learning-features-2009-TR.pdf (accessed on 25 March 2025).
Liang, J.; Wang, R. Fedcip: Federated client intellectual property protection with traitor tracking. arXiv 2023, arXiv:2306.01356. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
LeCun, Y.; Cortes, C. MNIST Handwritten Digit Database. 2010. Available online: https://www.semanticscholar.org/paper/The-mnist-database-of-handwritten-digits-LeCun-Cortes/dc52d1ede1b90bf9d296bc5b34c9310b7eaa99a2 (accessed on 25 March 2025).
Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Stallkamp, J.; Schlipsing, M.; Salmen, J.; Igel, C. The German Traffic Sign Recognition Benchmark: A multi-class classification competition. In Proceedings of the IEEE International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 1453–1460. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Szyller, S.; Atli, B.G.; Marchal, S.; Asokan, N. Dawn: Dynamic adversarial watermarking of neural networks. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 4417–4425. [Google Scholar]
Pascal, L.; Michiardi, P.; Bost, X.; Huet, B.; Zuluaga, M.A. Maximum roaming multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 9331–9341. [Google Scholar]
Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; Volume 26. [Google Scholar]

Figure 1. RoSe method applied to the MNIST dataset for trigger set generation. The four images at the top of the figure are the inputs sampled from the training set with their native labels,

y_{i}

. The images are concatenated and given to a hash function, H, parametrized by a secret key,

s k

. The output is a pseudo-random sequence that generates the new labels,

{\tilde{y}}_{i}

, forming the key images and labels.

Figure 1. RoSe method applied to the MNIST dataset for trigger set generation. The four images at the top of the figure are the inputs sampled from the training set with their native labels,

y_{i}

. The images are concatenated and given to a hash function, H, parametrized by a secret key,

s k

. The output is a pseudo-random sequence that generates the new labels,

{\tilde{y}}_{i}

, forming the key images and labels.

Figure 2. Mixer method applied to the MNIST dataset for trigger generation with

k = 2

. From a Dirichlet distribution parameterized by

α

, two vectors are generated,

λ

and

μ

, that are used to mix up, respectively,

(x_{i}, x_{j})

and

(y_{i}, y_{j})

to generate

(\tilde{x}, \tilde{y})

.

Figure 2. Mixer method applied to the MNIST dataset for trigger generation with

k = 2

. From a Dirichlet distribution parameterized by

α

, two vectors are generated,

λ

and

μ

, that are used to mix up, respectively,

(x_{i}, x_{j})

and

(y_{i}, y_{j})

to generate

(\tilde{x}, \tilde{y})

.

Figure 3. Examples of trigger set samples created using RoSe-Mix. All samples are built from the CIFAR10 dataset [29], where

k = 2

. The small squares represent the overlay,

x_{overlay}

.

Figure 3. Examples of trigger set samples created using RoSe-Mix. All samples are built from the CIFAR10 dataset [29], where

k = 2

. The small squares represent the overlay,

x_{overlay}

.

Figure 4. TA,

R e c_{T r}

, and

R e c_{T s}

computed according to different values of k for FashionMNIST and MNIST.

Figure 4. TA,

R e c_{T r}

, and

R e c_{T s}

computed according to different values of k for FashionMNIST and MNIST.

Figure 5. TA,

R e c_{T r}

, and

R e c_{T s}

computed according to different values of

n_{e}

.

Figure 5. TA,

R e c_{T r}

, and

R e c_{T s}

computed according to different values of

n_{e}

.

Figure 6. Comparison of Pruning method for MNIST, FashionMNIST, and CIFAR10.

Table 1. RoSe-Mix performance metrics: TA,

R e c_{T r}

, and

R e c_{T s}

under various attacks.

Table 1. RoSe-Mix performance metrics: TA,

R e c_{T r}

, and

R e c_{T s}

under various attacks.

Metric	Host DNN	Watermarked DNN	Fine-Tune	Dyn. Quant.	Full Uint8. Quant.	Full Int8. Quant.	Float16 Quant.	JPEG55
MNIST
TA	98.64	99.11	99.11	99.14	99.14	99.14	99.11	99.09
$R e c_{T r}$	13.54	100	100	100	100	100	100	100
$R e c_{T s}$	12.50	100	100	100	100	100	100	100
FashionMNIST
TA	88.44	88.63	88.63	88.62	88.62	88.62	88.63	87.73
$R e c_{T r}$	8.33	89.4	89.4	90.2	90.2	90.2	89.4	88.6
$R e c_{T s}$	7.8	88	88	88	88	88	88	88.4
Cifar10
TA	70.52	78	78	78.04	78.04	78.04	77.99	75.42
$R e c_{T r}$	10.42	100	100	100	100	100	100	100
$R e c_{T s}$	7.29	96.8	96.8	96.8	96.8	96.8	96.8	96.4
Transfer Learning
TA	86.54	86.57	71.8	86.63	86.63	86.63	86.56	80.36
$R e c_{T r}$	11.46	100	67	100	100	100	100	100
$R e c_{T s}$	9.38	100	70.2	100	100	100	100	100
GTSRB
TA	78.89	90.1	80.84	90.08	90.08	90.08	90.1	86.45
$R e c_{T r}$	2.2	100	89.4	100	100	100	100	92.8
$R e c_{T s}$	1.4	88	94.8	88.6	88.6	88.6	88.2	79.4

Table 2. USR of RoSe-Mix compared to Mixer.

Dataset	USR (%)
Dataset	Mixer	RoSe-Mix
MNIST	51.1	20.78
FashionMNIST	39.3	15.32
Cifar10	38.4	15.0
Transfer Learning	39.5	14.7
GTSRB	12.2	4.69

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

El Hajjar, T.; Lansari, M.; Bellafqira, R.; Coatrieux, G.; Kapusta, K.; Kallas, K. RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup. Mach. Learn. Knowl. Extr. 2025, 7, 32. https://doi.org/10.3390/make7020032

AMA Style

El Hajjar T, Lansari M, Bellafqira R, Coatrieux G, Kapusta K, Kallas K. RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup. Machine Learning and Knowledge Extraction. 2025; 7(2):32. https://doi.org/10.3390/make7020032

Chicago/Turabian Style

El Hajjar, Tamara, Mohammed Lansari, Reda Bellafqira, Gouenou Coatrieux, Katarzyna Kapusta, and Kassem Kallas. 2025. "RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup" Machine Learning and Knowledge Extraction 7, no. 2: 32. https://doi.org/10.3390/make7020032

APA Style

El Hajjar, T., Lansari, M., Bellafqira, R., Coatrieux, G., Kapusta, K., & Kallas, K. (2025). RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup. Machine Learning and Knowledge Extraction, 7(2), 32. https://doi.org/10.3390/make7020032

Article Menu

RoSe-Mix: Robust and Secure Deep Neural Network Watermarking in Black-Box Settings via Image Mixup

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Motivation

3.2. RoSe-Mix: A Hybrid Watermarking Approach

3.3. Security Analysis

3.3.1. RoSe

3.3.2. Mixer

3.3.3. RoSe-Mix

4. Experimental Results

4.1. Settings

4.1.1. Dataset and Models

4.1.2. Metrics

4.2. Modulation Parameters

4.3. Fidelity

4.4. Robustness

4.5. Security

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI