Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification

Li, Jiachen; Gong, Xiaojin

doi:10.3390/s25020552

Open AccessArticle

Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification

by

Jiachen Li

and

Xiaojin Gong

^*

College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(2), 552; https://doi.org/10.3390/s25020552

Submission received: 29 December 2024 / Revised: 17 January 2025 / Accepted: 17 January 2025 / Published: 18 January 2025

(This article belongs to the Special Issue Image Feature Extraction for Computer Vision Tasks in Sensor Systems and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Domain-generalizable re-identification (DG Re-ID) aims to train a model on one or more source domains and evaluate its performance on unseen target domains, a task that has attracted growing attention due to its practical relevance. While numerous methods have been proposed, most rely on discriminative or contrastive learning frameworks to learn generalizable feature representations. However, these approaches often fail to mitigate shortcut learning, leading to suboptimal performance. In this work, we propose a novel method called diffusion model-assisted representation learning with a correlation-aware conditioning scheme (DCAC) to enhance DG Re-ID. Our method integrates a discriminative and contrastive Re-ID model with a pre-trained diffusion model through a correlation-aware conditioning scheme. By incorporating ID classification probabilities generated from the Re-ID model with a set of learnable ID-wise prompts, the conditioning scheme injects dark knowledge that captures ID correlations to guide the diffusion process. Simultaneously, feedback from the diffusion model is back-propagated through the conditioning scheme to the Re-ID model, effectively improving the generalization capability of Re-ID features. Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies further validate the effectiveness of the proposed approach, providing insights into its robustness.

Keywords:

generalizable person re-identification; diffusion model; prompt learning; CLIP

1. Introduction

Person re-identification (Re-ID) aims to match a query person’s image across different cameras based on the similarity of feature representations. Although supervised Re-ID methods based on convolutional neural network (CNN) [1,2,3,4] and visual transformer [5] have made significant advancements, their performance dramatically degenerates when applied to out-of-distribution (OOD) data that are dissimilar to training scenes. To address this problem, domain-generalizable (DG) Re-ID has garnered increasing interest in recent years. In DG Re-ID, a model is trained on one or multiple source domains and then tested on completely different and unseen domains.

Numerous DG Re-ID methods have been developed so far. Existing studies concentrate on domain-invariant and domain-specific feature disentanglement [6,7,8], normalization and domain alignment [9,10,11,12,13,14,15], or employ meta-learning [10,16,17,18] and other techniques like semantic expansion [19,20] and sample generation [21] to enhance the generalization capability of Re-ID models. Although various techniques have been designed, almost all methods learn feature representations within discriminative or contrastive learning frameworks, which are considered unable to prevent shortcut learning [22], leading to suboptimal performance.

Recently, diffusion models, such as Imagen [23] and Stable Diffusion [24], have demonstrated remarkable capabilities in image synthesis and other generative tasks. Moreover, their potential for representation learning has been increasingly recognized in recent studies [25,26,27]. On the one hand, pre-training on extensive multi-modal data equips diffusion models with rich semantic information, enabling exceptional generalization and robustness in out-of-distribution scenarios [28]. On the other hand, the denoising process inherently promotes the learning of meaningful semantic representations [29]. Inspired by these observations, this work explores leveraging a diffusion model to enhance the generalization ability of representations initially learned within discriminative and contrastive learning frameworks, thereby improving domain-generalizable person Re-ID.

To this end, we propose a generalizable Re-ID framework comprising a baseline discriminative and contrastive Re-ID model, a generative diffusion model, and a conditioning scheme that bridges the two models. Unlike existing diffusion-based representation learning techniques, which either jointly train a denoising decoder and a classifier with a shared encoder [26,30] or intertwine feature extraction and feature denoising [31], our approach opts to preserve the integrity of the pre-trained diffusion model, as shown in Figure 1a–d. This choice allows the semantic knowledge embedded in the diffusion model to be effectively transferred to the Re-ID baseline. Furthermore, unlike SODA [27] and DIVA [32], which adopt an instance-level conditioning mechanism, we introduce a new conditioning scheme that is ID-wise and explicitly aware of ID correlations.

More specifically, as recognized in knowledge distillation [33] and illustrated in Figure 1e, dark knowledge embedded in the logits of classifiers captures the relationships among IDs, reflecting nuanced similarities and differences that go beyond hard class labels. Therefore, we design a correlation-aware conditioning scheme that integrates classification probabilities with learnable ID-wise prompts as the guidance of the diffusion model. This conditioning scheme makes the diffusion model less sensitive to intra-ID variances and background interference compared to instance-level conditioning and more expressive and adaptable than one-hot ID labels, enabling it to better capture complex inter-ID relationships and improve generalizable Re-ID performance. Additionally, we employ LoRA [34] adapters to enable the parameter-efficient fine-tuning (PEFT) of the diffusion model alongside the full fine-tuning of the Re-ID model. This approach allows the diffusion model to adapt effectively and efficiently to Re-ID data while preserving the knowledge embedded in the pre-trained diffusion model, thereby mitigating the risk of catastrophic forgetting often associated with full model fine-tuning [35].

The main contributions of this work are summarized as follows:

We investigate the feasibility of leveraging a pre-trained diffusion model as an expert to enhance generalizable feature learning for DG Re-ID by collaboratively training a discriminative Re-ID model and efficiently fine-tuning a generative diffusion model;
We propose a simple yet effective correlation-aware conditioning scheme that combines the dark knowledge embedded in ID classification probabilities with learnable ID-wise prompts to guide the diffusion model, unleashing its generalization knowledge to the discriminative Re-ID model through gradient feedback;
Extensive experiments on both single-source and multi-source DG Re-ID tasks demonstrate the effectiveness of our approach, achieving state-of-the-art performance. Additionally, plenty of ablation studies are conducted to provide a comprehensive analysis of the proposed method.

2. Related Work

2.1. Generative Diffusion Models

Diffusion models [36,37], which simulate a Markov chain to learn the transition from noise to a real data distribution, have shown remarkable performance in generation tasks. Representative diffusion models include Imagen [23], stable diffusion [24], and DiT [38]. Imagen predicts noise in a pixel space and generates high-resolution outputs using super-resolution modules. In contrast, stable diffusion and DiT denoise images in latent spaces, significantly reducing computation costs. Specifically, stable diffusion maps an image into the latent space via a pre-trained variational autoencoder (VAE) and predicts noise with a U-Net structure [39] containing cross-attention modules to fuse conditions. DiT further replaces the U-Net with visual transformers and improves the condition injection with the adaLN-zero strategy for scalable high-quality image generation. Considering computational efficiency, we choose to adopt stable diffusion in this work.

2.2. Diffusion Models for Representation Learning

Although diffusion models are primarily designed for generation tasks, their ability to learn semantic representations has also been recognized in recent years [29]. For example, Baranchuk et al. [25] and DDAE [40] leverage the intermediate activations of pre-trained diffusion models as features for segmentation and classification, respectively. HybViT [26] and JDM [30] jointly learn discriminative and generative tasks with a shared encoder to enhance feature representation. SODA [27] turns diffusion models into strong self-supervised representation learners by imposing a bottleneck between an encoder and a denoising decoder. DIVA [32] employs the feedback of a frozen pre-trained diffusion model to boost the fine-grained perception capability of CLIP [41] via a post-training approach. Additionally, diffusion models are exploited as zero-shot classifiers [42,43] by estimating noise given the class names, such as conditions, exhibiting great generalization robustness in out-of-distribution scenarios [28]. Inspired by these studies, we explore the utilization of a pre-trained diffusion model to enhance representation learning for the generalizable Re-ID tasks.

2.3. Diffusion Models for Person Re-ID

Diffusion models have also been applied to various person Re-ID tasks. For instance, VI-Diff [44] employs a diffusion model to enhance visible-infrared Re-ID by generating new samples across modalities, thereby reducing the annotation cost of paired images. Diverse person [45] proposes a diffusion-based framework to edit original dataset images with attribute texts, efficiently generating high-quality text-based person search datasets. PIDM [46] also focuses on new data generation, using body pose and image style as guidance. Asperti et al. [47] decouple the person ID from other factors like poses and backgrounds to control new image sample generation. These works share a common characteristic of modifying existing data or generating new data for Re-ID related tasks. Additionally, DenoiseReID [31] unifies feature extraction and feature denoising to improve feature discriminative capabilities for Re-ID. PISL [48] proposes a spatial diffusion model to refine patch sampling to enhance unsupervised Re-ID. PSDiff [49] formulates the person search as a dual denoising process from noisy boxes and Re-ID embeddings to ground truths. In contrast to these, we focus on generalizable representation learning assisted by the feedback back-propagated from a pre-trained diffusion model.

2.4. Generalizable Person Re-ID

Generalizable person Re-ID has been extensively studied over the past years. Existing methods can be roughly categorized into the following groups: domain-invariant and specific feature disentanglement [6,7,8], normalization and domain alignment [9,10,11,12,13,14,15], learning domain-adaptive mixture-of-experts [13,50,51], meta-learning [10,16,17,18], semantic expansion [19,20], large-scale pre-training [52,53,54], and so on. While various mechanisms have been designed, most of these methods learn feature representations within discriminative [6,9] or contrastive learning [16,52] frameworks. In contrast, we aim to leverage a pre-trained generative diffusion model to enhance the domain-invariant feature learning for more robust generalizable Re-ID.

3. Diffusion Preliminaries

In this section, we briefly recap the preliminaries of classical diffusion models [36,37]. The diffusion models are generative models defined on a Markov chain, where the forward and reversed processes are modeled using a forward diffusion kernel (FDK)

q (x_{t} | x_{t - 1})

and a learnable reverse diffusion kernel (RDK)

p_{θ} (x_{t - 1} | x_{t})

. In the forward process, a real sample

x_{0}

is gradually disturbed towards a final state

x_{T}

that is quite close to a pure Gaussian noise via a FDK. In the reverse process the RDK is trained to denoise from

x_{T}

to

x_{0}

. The real distribution of

x_{0}

can be constructed using the integral over each possible path

d x_{1 : T}

with an optional condition

c

as guidance:

p_{θ} (x_{0} | c) = \int_{x_{1 : T}} p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1} | x_{t}, c) d x_{1 : T} .

(1)

To estimate the denoising model parameter

θ

, the negative log-likelihood loss

- log p_{θ} (x_{0} | c)

should be minimized, but the integral is infeasible. Thus, the variational lower bound

L_{E L B O}

is optimized as an alternative:

L_{E L B O} = E_{q} [\frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{0 : T}, c)}] \geq - log p_{θ} (x_{0} | c),

(2)

where

L_{E L B O}

can be further expanded and simplified to show its essence [36,37], which actually learns to predict the added noise at each timestep t by the mean square error:

L_{m s e} = E_{t} [| | ϵ_{t} - ϵ_{θ} (x_{t}, c) {| |}^{2}],

(3)

where

x_{t}

is the noisy input, which follows the following equation:

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ_{t} .

(4)

Noise

ϵ_{t}

is sampled from the isotropic Gaussian distribution

N (0, I)

. t is sampled from a series of timesteps

{1, 2, \dots, T}

, which controls the strength of noise by selecting scheduled diffusion rate

{\bar{α}}_{t}

.

4. The Proposed Method

As illustrated in Figure 2, the overall framework comprises a baseline Re-ID model, a pre-trained diffusion model, and a correlation-aware conditioning scheme that bridges the two models. The Re-ID model learns feature representations by optimizing a discriminative ID loss and a prototypical contrastive loss. The ID classification probabilities generated from the Re-ID model are used to inject the dark knowledge [33] of different IDs into a correlation-aware condition that guides the diffusion process. Simultaneously, the gradients of the diffusion model are back-propagated through the condition to the Re-ID model, transferring generalization knowledge to enhance Re-ID feature learning. During test time, only the image encoder of the Re-ID model is used for feature extraction.

4.1. The Baseline Re-ID Model

The baseline Re-ID model comprises an image encoder

E_{ψ}

, together with a classifier supervised by discriminative ID loss

L_{i d}

and additional prototypical contrastive loss (PCL)

L_{p c l}

. Our image encoder is constructed based on the pre-trained CLIP [41] image encoder, appended with a batch normalization neck (BNNeck) [1]. Leveraging the CLIP encoder enables our model to acquire a certain level of generalization abilities, attributed to its extensive language-image pre-training.

Formally, given an input image

x

and its feature

z = E_{ψ} (x) \in R^{1 \times d}

, we define

L_{i d}

and

L_{p c l}

as follows:

L_{i d} = - \sum_{j = 1}^{N} q_{j} log \frac{exp (z w_{j}^{⊤})}{\sum_{k = 1}^{N} exp (z w_{k}^{⊤})},

(5)

L_{p c l} = - log \frac{exp (z M {[y]}^{⊤} / τ)}{\sum_{k = 1}^{N} exp (z M {[k]}^{⊤} / τ)},

(6)

where N is the number of IDs,

w_{j} \in R^{1 \times d}

denotes the weights of the j-th ID in the classifier,

q_{j}

denotes the smoothed ground-truth label, and

τ

is a temperature factor. Moreover,

M \in R^{N \times d}

represents the prototypical memory bank. Each entry in

M

is initialized with the feature centroid of images belonging to the corresponding ID at the beginning of every epoch. Subsequently, it is updated in a moving average manner with momentum

γ

:

M [y] \leftarrow γ M [y] + (1 - γ) z_{h a r d},

(7)

in which y denotes the ID label of the image

x

, and

z_{h a r d}

is the hardest sample [55] of the corresponding ID within a batch.

Then, the baseline Re-ID model learns feature representations by optimizing the following loss:

L_{R e I D} = L_{i d} + L_{p c l} .

(8)

4.2. The Generative Diffusion Model

Our work aims to leverage both the semantic knowledge acquired from a pre-trained diffusion model and the assistance provided by the denoising process to enhance the feature learning capabilities of the baseline Re-ID model’s encoder. To this end, rather than directly utilizing intermediate activations of the diffusion model as features [25,40] or training a denoising decoder alongside the Re-ID model’s classifier using a shared encoder like [26,30], we opt to employ a complete pre-trained diffusion model and adapt it to Re-ID data using LoRA [34].

More specifically, we adopt the pre-trained stable diffusion model [24] in our work. This diffusion model employs a variational autoencoder (VAE) composed of an encoder

E_{v a e}

and a decoder

D_{v a e}

to map an input image into a latent space, facilitating a more efficient diffusion process. Moreover, it integrates cross-attention layers into a U-Net [39] architecture

E_{θ}

to denoise latent features, enabling the incorporation of various types of conditions. To effectively adapt the diffusion model to Re-ID data while preserving the generalization capabilities acquired during pre-training, we employ LoRA [34] adapters for fine-tuning.

LoRA [34] adapters are only applied to the transformation matrices in the attention layers, including the query, key, value and output transformation matrices in attention computation, and the linear transformation matrices in feed-forward networks, as shown in Figure 2. Formally, the LoRA [34] adapters introduce low-rank projection matrices

A \in R^{d_{i n} \times r}

and

B \in R^{r \times d_{o u t}}

to create modifications on original output features as follows:

h^{'} = h W + \frac{1}{r} h A B,

(9)

where

h \in R^{1 \times d_{i n}}

and

h^{'} \in R^{1 \times d_{o u t}}

denote the input and output features, respectively.

W \in R^{d_{i n} \times d_{o u t}}

denotes each possible original transformation matrices mentioned before. r is called rank, which controls the size of the low-dimension space.

d_{i n}

and

d_{o u t}

are the dimensions of input and output features, respectively, where

r ≪ min (d_{i n}, d_{o u t})

. Throughout the entire training process, we freeze the diffusion model while keeping the

A

and

B

matrices of LoRA [34] adapters trainable. The purposes of utilizing LoRA [34] adapters in the diffusion model are two-fold: (1) it reduces computational overhead compared with other fine-tuning methods and (2) mitigates the risk of the catastrophic forgetting [35] of learned knowledge from pre-training. We will further discuss the effectiveness of LoRA [34] adapters in Section 5.4.

When an image

x

is input to the image encoder of the Re-ID model, the image is also resized to an image

x_{0}

to match the input size of the diffusion model. Then,

x_{0}

is fed into the VAE encoder

E_{v a e}

to produce a latent feature

F_{0} = E_{v a e} (x_{0})

. With a random noise

ϵ_{t}

sampled from the isotropic Gaussian distribution

N (0, I)

, the noisy feature

F_{t}

at the timestep t is obtained, as mentioned in Equation (4):

F_{t} = \sqrt{{\bar{α}}_{t}} F_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ_{t} .

(10)

Afterward,

F_{t}

and its corresponding condition

c

(to be introduced in the following subsection) are forwarded to U-Net

E_{θ}

to minimize the noise estimation error expectation, as mentioned in Equation (3) with new notation

L_{d i f}

:

L_{d i f} = E_{t} [| | ϵ_{t} - E_{θ} (F_{t}, c) {| |}^{2}] .

(11)

4.3. The Correlation-Aware Conditioning Scheme

We design a conditioning scheme to bridge the Re-ID model and the diffusion model, enabling mutual interaction. This scheme enables the use of information from the Re-ID model to guide the diffusion process while simultaneously enabling the feedback from the diffusion model to be back-propagated to improve the Re-ID model. A straightforward conditioning scheme is to take the instance feature encoded by the Re-ID model as the condition, similarly to SODA [27] and DIVA [32]. However, such instance-level features are sensitive to intra-ID variations and background changes, making them less robust to domain shifts and resulting in suboptimal generalization performance.

Therefore, we opt to design the condition in an ID-wise manner. Notably, the ID classification probabilities produced by the baseline Re-ID model not only indicate the ID class to which an image belongs but also encapsulate dark knowledge about the correlations among different IDs, which has been shown to enhance generalization capabilities [56,57,58]. Building on this insight, we incorporate the classification probabilities along with a set of learnable ID prompts to define the condition.

Specifically, we create a set of learnable ID prompts

E_{i d} = [e_{1}, . . ., e_{j}, . . ., e_{N}] \in R^{N \times d}

. The prompt

e_{j} \in R^{1 \times d}

corresponds to the j-the ID, where d is the dimension of each prompt. Then, the condition

c \in R^{1 \times d}

for the image instance

x

is generated by a linear combination of all ID prompts weighted by the classification probability of the input instance:

c = \sum_{j = 1}^{N} p_{j} e_{j} = \sum_{j = 1}^{N} \frac{exp (z w_{j}^{⊤} / τ_{c})}{\sum_{k = 1}^{N} exp (z w_{k}^{⊤} / τ_{c})} e_{j},

(12)

where

p_{j}

denotes the probability of the instance, with

x

being classified into the j-th ID class, as defined on the right side of the equation.

z

and

w_{j}

are the image feature and the j-th classifier, respectively, as defined in Section 4.1.

τ_{c}

is a temperature factor that regulates the probability distribution. Unlike the linear combination used in MVI²P [59], which aims to integrate multiple discriminative feature maps for Re-ID training, our approach generates conditions that guide the diffusion model.

We refer to this design as the correlation-aware conditioning scheme. The correlation-aware condition

c

aims to describe current image instance

x

with all possible IDs. It improves the robustness of representation learning via combining the ID-wise prompts with the ID classification probabilities. Despite the simplicity of this design, our experiments demonstrate that it outperforms more intricate alternative designs.

4.4. The Entire Training Loss

Standing on a comprehensive view, the total loss

L_{t o t a l}

in our framework is a composite of the Re-ID loss

L_{R e I D}

and the diffusion loss

L_{d i f}

, which is formulated as follows:

L_{t o t a l} = L_{R e I D} + λ L_{d i f},

(13)

where

λ

is a balancing factor.

To this end, we have introduced all related losses in both discriminative and generative objectives. Re-ID loss

L_{R e I D}

integrates ID loss

L_{i d}

in Equation (5) and prototypical contrastive loss

L_{p c l}

in Equation (6). These losses are designed to optimize the discriminative capabilities of the Re-ID model. Diffusion loss

L_{d i f}

in Equation (11) focuses on minimizing the noise estimation error expectation in the latent space during the diffusion process. This loss ensures that the diffusion model is able to effectively contribute to the learning of generalizable features guided by our proposed correlation-aware conditioning scheme. By combining

L_{R e I D}

and

L_{d i f}

,

L_{t o t a l}

improves discriminative learning with generative capabilities, enhancing the Re-ID model’s performances across diverse domains.

5. Experiments

5.1. Datasets and Evaluation Protocols

We conduct experiments on the following datasets: Market1501 [60], DukeMTMC-reID [61], MSMT17 [62], and CUHK03-NP [63], abbreviated as MA, D, MS, and C3, respectively. Table 1 presents detailed information of each dataset.

The performance of generalizable Re-ID is evaluated using both single-source and multi-source generalization protocols. In the single-source protocol, the Re-ID model is trained on one dataset and tested on another target dataset. For example, we denote the experiment as MS→MA when training on MSMT17 [62] and testing on Market1501 [60], and likewise for others. In the multi-source protocol, a leave-one-out strategy is employed, where one dataset is used for testing while the training sets of remaining datasets are used for training. In both protocols, we adopt the mean average precision (mAP) and cumulative matching characteristic (CMC) at Rank-1 (R1) as evaluation metrics without applying re-ranking post-processing [63].

5.2. Implementation Details

We implement our model in PyTorch 1.13.1 [64] and conduct experiments on an NVIDIA RTX A6000 GPU. We adopt the image encoder of the pre-trained CLIP ViT-B-16 [41] for our baseline Re-ID model, in which the patch projection layer is frozen for stability while the other parameters are trainable. The input image size for the Re-ID model is

256 \times 128

, and the dimension of the encoded feature is 512. The momentum

γ

utilized for updating the prototypical memory bank is set to 0.2, and the temperature factor

τ

in the PCL loss is set to 0.01. For the diffusion model, we adopt the pre-trained weight stable-diffusion-v1-5 [24] on Huggingface. The input image size for diffusion is

128 \times 128

. The rank r of the LoRA adapters is set to 8 for all single-source experiments except for those trained on MSMT17 [62]. For single-source experiments trained on MSMT17 [62] and all multi-source experiments, r is set to 32. In the correlation-aware conditioning scheme, the prompt dimension is set to 768 to match the diffusion model. The temperature factor

τ_{c}

is closely related to the size of the training datasets. Accordingly, it is set to 0.1 for training on CUHK03-NP [63], 0.6 for training on Market1501 [60] and DukeMTMC-reID [61], and 1.0 for training on MSMT17 [62] and multi-source settings. Moreover, balancing factor

λ

in the entire loss is set to 1.

During training, random horizontal flipping, cropping, and erasing [65] augmentations are used for the input of the Re-ID model, while no data augmentation is used for the diffusion model. We train our entire framework for 60 epochs using the Adam optimizer [66] with a base learning rate of

5 \times 10^{- 6}

regulated by a step scheduler, which starts at a learning rate of

5 \times 10^{- 7}

with a linear warmup in the first 10 epochs. The learning rate is multiplied with a factor of 0.1 at the 30th and 50th epoch. We adopt a weight decay of

1 \times 10^{- 4}

. The batch size is set to 64, employing a PK-sampling strategy [67] with randomly selected 16 IDs and 4 samples per ID.

5.3. Comparison with State-of-the-Arts

5.3.1. Single-Source DG Re-ID

We first compare the proposed method, named DCAC, with several representative and state-of-the-art methods using the single-source protocol. We conduct all possible single-source generalization experiments with four datasets for comprehensive evaluations. The results are separately presented in Table 2 and Table 3. The source domain is selected from Market1501 [60] and DukeMTMC-reID [61] in Table 2 and from MSMT17 [62] and CUHK03-NP [63] in Table 3. The results show that some previous methods, such as TransMatcher [68] and MDA [17], exhibit strong performance in MS→MA generalization. However, their performance significantly deteriorates when applied to the more challenging MA→MS generalization task. In contrast, our approach demonstrates more balanced improvements in all single-source generalization tests without requiring a larger input size as in TransMatcher [68] or test-time updating as in MDA [17].

5.3.2. Multi-Source DG Re-ID

More experiments are carried out using the multi-source protocol to further validate the effectiveness of the proposed approach. The results are presented in Table 4. Although some latest works like UDSX [20] and SALDG [51] demonstrate better performances when MA is selected as the target domain, their improvements are limited when generalizing to other target domains, where the average performances are 43.4% on mAP and 61.7% on Rank-1 for UDSX [20] and 40.0% on mAP and 58.6% on Rank-1 for SALDG [51], respectively. In contrast, our approach presents more balanced enhancements across all four target domains, with an average mAP of 46.4% and Rank-1 of 63.9%, surpassing the performance of UDSX [20] by a fair margin of 3.0% on mAP and 2.2% on Rank-1. In particular, with respect to the most challengeable dataset MS, our approach greatly outperforms the current best performance by 5.9% on mAP and 9.1% on Rank-1. These results show the effectiveness of our approach on multi-source DG Re-ID with state-of-the-art performances.

5.4. Ablation Studies

5.4.1. Effectiveness of the CLIP-Based Re-ID Model

Our baseline Re-ID model is built upon the pre-trained CLIP image encoder and fine-tuned on Re-ID datasets using both a discriminative ID loss and a prototypical contrastive loss. Benefiting from pre-training on extensive text-image paired data, the CLIP encoder equips our baseline model with a certain level of generalization capability, as demonstrated in Table 2 and Table 3.

5.4.2. Effectiveness of the Diffusion Model Assistance

We conduct a series of experiments to validate the effectiveness of the diffusion model for learning generalizable representations. To investigate whether the knowledge learned from pre-training or the denoising process itself is beneficial, we carry out a comparison among the following model variants: (1) using the baseline Re-ID model without diffusion, (2) using the pre-trained diffusion model while keeping it frozen, (3) using the pre-trained diffusion model with LoRA for adaptation, (4) using the pre-trained diffusion model with only the output blocks trainable, (5) using the pre-trained diffusion model with only the middle and output blocks trainable, (6) using the pre-traind diffusion model with all parameters trainable, and (7) using a randomly initialized diffusion model with all parameters trained from scratch. The results are presented in Table 5.

According to the results, we observe that fully freezing the diffusion model prevents it from effectively enhancing generalization abilities and may even slightly harm it. We attribute this to the domain gap between the diffusion model’s pre-training dataset and the Re-ID dataset. Since the diffusion model lacks specific knowledge about Re-ID, it provides invalid feedback.

A classical solution for adapting downstream knowledge is to freeze shallow blocks but train the deep blocks of the model, which is denoted as partial fine-tuning. The diffusion U-Net contains input, middle, and output blocks. We gradually unfreeze each block of the U-Net, denoted as partial fine-tuning 1 and 2, where in 1 the output blocks are trainable, and in 2, both the middle and output blocks are trainable. According to the results, partial fine-tuning presents a certain level of improvement on generalization and achieves the best on the MS→MA Rank-1 metric, with the middle and output blocks trainable. But it fails to maintain its advantage on more challengeable MA→MS generalization, in which the source domain is limited with fewer IDs and samples, and it is more likely to be overfitted. This reveals that partial fine-tuning is unable to effectively preserve pre-trained generalized knowledge during downstream adaptation.

When the pre-trained diffusion model is fine-tuned with all trainable parameters (from all input, middle, and output blocks) or when a randomly initialized diffusion model is trained without pre-training, both variant models adapt sufficiently to the Re-ID data, resulting in better performance compared to a fully frozen model. However, these models either suffer from the significant forgetting of pre-trained knowledge [35] or lack any pre-training knowledge, leading to only limited improvement.

In contrast, the approach utilizing LoRA, which only fine-tunes the low-rank adapters of the attention layers of the pre-trained diffusion model on Re-ID data, achieves balanced and significant enhancement in generalization ability across different target domains. This result highlights that the synergy between pre-trained knowledge and the diffusion process contributes most effectively to improving generalization, and illustrates that the LoRA-based fine-tuning best fits our framework.

5.4.3. Ablations on Computational Overhead

Table 6 studies the computational overhead of different variants of the diffusion model’s fine-tuning, including the major approaches mentioned in Section 5.4.2 with a batch size of 64. In the training stage, the baseline model presents the optimal efficiency with 140.3 ms latency and 1.46 TFLOPs in forward propagation and 7.58 GB memory consumption with 85.94 M trainable parameters, but it suffers from limited generalization performance. Frozen fine-tuning even fails to surpass the baseline on either generalization performance or computational efficiency, with a slight increase in trainable parameters due to the incorperation of learnable ID prompts. In addition, other fine-tuning methods like partial and full fine-tuning do not demonstrate effective enhancements on generalization, although more parameters are allowed to be optimized. Note that the TFLOPs remain unchanged as 8.50 for frozen, full, and partial fine-tuning, since the count of floating point operations in forward propagation is not interfered by parameter freezing or not.

Differently, our LoRA-based strategy achieves a great tradeoff between the computational overhead and the generalization performance, enabling it to be be trained with acceptable cost growth due to newly introduced adapters for the best performance. In the inference stage, only the Re-ID image encoder

E_{ψ}

with the updated parameters is required to extract person features; thus, the overhead of the diffusion model is dropped. Moreover, the cost can be further reduced without gradient computation and the classifiers in training.

5.4.4. Effectiveness of the Conditioning Scheme

In our design, we claim that the proposed correlation-aware conditioning scheme is the most appropriate mechanism to guide generalization feedback from the pre-trained diffusion model, where each condition is generated by linear combination of multiple learnable ID-wise prompts weighted by the classification probabilities.

In Table 7, we compare different conditioning schemes. It is obvious that the instance-wise condition only provides a tiny contribution to generalization improvements. To further validate that the correlation among IDs, i.e., dark knowledge, is the key for transfering generalization knowledge from the diffusion model, we conduct an experiment on a simplified class-wise condition, where softmax weighting in Equation (12) is replaced by one-hot selection, which only keeps the probability score of the corresponding class and resets others to zero. The dark knowledge that exists in probability distributions is therefore erased. From the results, we find that the class-wise condition that only considers a single ID cannot effectively enhance the generalization capability, which even deteriorates the baseline on MS→MA generalization, whereas our correlation-aware condition brings the most salient enhancement, validating the importance of dark knowledge in condition generation.

Moreover, we conduct more experiments on the test sets of source domains, aiming to further investigate the performance on source domain data distribution. As demonstrated in Table 8, our correlation-aware conditioning scheme does not present obvious performance degradation on source domains, meaning that our approach indeed refines the capability of generalization and also preserves the performance on source domains.

5.4.5. Impact of the Hyper-Parameters

Table 9 investigates the impact of rank r in the LoRA adapters. For simplicity, we choose MA as the representative of small-scale source domains, i.e., MA, D, and C3, to analyze the rank value on MA→MS generalization. We observe that a lower rank

r = 8

is optimal for training on small datasets. For the dataset at larger scales, i.e., MS, we tested it with respect to MS→MA generalization and found that the higher rank

r = 32

was optimal.

5.4.6. Impact of More Intricate Conditioning Schemes

We investigate more intricate alternative designs of the conditioning scheme by employing further transformations on the basis of the original correlation-aware conditioning through linear combination. These new methods focus on the influences of the other two operations that frequently appear in neural networks, that is, non-linear mapping and normalization, instead of the linear operation only.

In Table 10, we compare these alternatives with the baseline and standard DCAC. “Non-linearity” denotes that we apply the SiLU [77] activation function, which is the same as the one activating latent features in the diffusion model, on the correlation-aware conditions to introduce non-linearity. “BatchNorm” denotes that a batch normalization layer [78] is appended after the linear combination of the ID-wise prompts to eliminate the internal covariate shift. “ConditionNet” denotes that a multi-layer perceptron network, including the mixture of linear, non-linear, and normalization layers, is employed on generated conditions. Surprisingly, we find that the more complicated variants do not seem to provide further improvements in generalization capability compared with the simple but effective correlation-aware conditioning.

5.4.7. Visualization Results

In Figure 3, we use GradCAM [79] visualization to investigate the Re-ID model’s capability of capturing correlations across different IDs, which reflects the generalization capability of the Re-ID model. Specifically, we select several visually similar IDs #27, #76, and # 649 and compute the activation maps under another similar ID #28. A well-generalized Re-ID model is expected to focus on similar ID-relevant areas even if the ID class and the image is not consistent.

From the results, we observe that the baseline model mainly focuses on background areas but the person bodies are almost ignored, indicating its poor ability to learn ID correlations. When adopting the diffusion knowledge feedback, the attentive areas are rectified to shared ID-relevant attributes such as the purple T-shirt and black trousers. Furthermore, using our correlation-aware conditioning scheme helps covering more body parts and reducing perception on background areas, which shows the effectiveness of our approach.

6. Conclusions

In this work, we explore the feasibility of leveraging a pre-trained diffusion model to enhance generalizable feature learning for DG Re-ID. By adopting a simple yet effective correlation-aware conditioning scheme, we utilize the ID classification probabilities to guide the diffusion model for generalization knowledge unleashing and transferring towards the Re-ID model via gradient feedback. Through extensive experimentation on both single- and multi-source DG Re-ID settings, our approach demonstrates its effectiveness by achieving state-of-the-art performance levels.

Author Contributions

Conceptualization, J.L.; methodology, J.L.; software, J.L.; validation, J.L.; formal analysis, J.L.; investigation J.L.; resources, X.G.; supervision, X.G.; visualization, J.L.; writing—original draft, J.L.; writing—review and editing X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Province Pioneer Research and Development Project “Research on Multi-modal Traffic Accident Holographic Restoration and Scene Database Construction Based on Vehicle-cloud Intersection” under grant no. 2024C01017.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets [60,61,62,63] used in this article are publicly accessible.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Lian, Y.; Huang, W.; Liu, S.; Guo, P.; Zhang, Z.; Durrani, T.S. Person re-identification using local relation-aware graph convolutional network. Sensors 2023, 23, 8138. [Google Scholar] [CrossRef] [PubMed]
Li, G.; Liu, P.; Cao, X.; Liu, C. Dynamic Weighting Network for Person Re-Identification. Sensors 2023, 23, 5579. [Google Scholar] [CrossRef] [PubMed]
Zhou, J.; Zhao, S.; Li, S.; Cheng, B.; Chen, J. Research on Person Re-Identification through Local and Global Attention Mechanisms and Combination Poolings. Sensors 2024, 24, 5638. [Google Scholar] [CrossRef]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15013–15022. [Google Scholar]
Jin, X.; Lan, C.; Zeng, W.; Chen, Z.; Zhang, L. Style normalization and restitution for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3143–3152. [Google Scholar]
Zhang, Y.F.; Zhang, Z.; Li, D.; Jia, Z.; Wang, L.; Tan, T. Learning domain invariant representations for generalizable person re-identification. IEEE Trans. Image Process. 2022, 32, 509–523. [Google Scholar] [CrossRef] [PubMed]
Yuan, Y.; Chen, W.; Chen, T.; Yang, Y.; Ren, Z.; Wang, Z.; Hua, G. Calibrated domain-invariant learning for highly generalizable large scale re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 3589–3598. [Google Scholar]
Zhuang, Z.; Wei, L.; Xie, L.; Zhang, T.; Zhang, H.; Wu, H.; Ai, H.; Tian, Q. Rethinking the distribution gap of person re-identification with camera-based batch normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 140–157. [Google Scholar]
Choi, S.; Kim, T.; Jeong, M.; Park, H.; Kim, C. Meta batch-instance normalization for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3425–3435. [Google Scholar]
Jiao, B.; Liu, L.; Gao, L.; Lin, G.; Yang, L.; Zhang, S.; Wang, P.; Zhang, Y. Dynamically transformed instance normalization network for generalizable person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 285–301. [Google Scholar]
Liu, J.; Huang, Z.; Li, L.; Zheng, K.; Zha, Z.J. Debiased batch normalization via gaussian process for generalizable person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Online, 22 February–1 March 2022; Volume 36, pp. 1729–1737. [Google Scholar]
Xu, B.; Liang, J.; He, L.; Sun, Z. Mimic embedding via adaptive aggregation: Learning generalizable person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 372–388. [Google Scholar]
Han, G.; Zhang, X.; Li, C. One-Shot Unsupervised Cross-Domain Person Re-Identification. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1339–1351. [Google Scholar] [CrossRef]
Peng, W.; Chen, H.; Li, Y.; Sun, J. Invariance Learning under Uncertainty for Single Domain Generalization Person Re-Identification. IEEE Trans. Instrum. Meas. 2024, 73, 5031911. [Google Scholar] [CrossRef]
Zhao, Y.; Zhong, Z.; Yang, F.; Luo, Z.; Lin, Y.; Li, S.; Sebe, N. Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 6277–6286. [Google Scholar]
Ni, H.; Song, J.; Luo, X.; Zheng, F.; Li, W.; Shen, H.T. Meta distribution alignment for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2487–2496. [Google Scholar]
Zhang, L.; Liu, Z.; Zhang, W.; Zhang, D. Style uncertainty based self-paced meta learning for generalizable person re-identification. IEEE Trans. Image Process. 2023, 32, 2107–2119. [Google Scholar] [CrossRef] [PubMed]
Ang, E.P.; Shan, L.; Kot, A.C. Dex: Domain embedding expansion for generalized person re-identification. In Proceedings of the British Machine Vision Conference (BMVC), Online, 22–25 November 2021. [Google Scholar]
Ang, E.P.; Lin, S.; Kot, A.C. A unified deep semantic expansion framework for domain-generalized person re-identification. Neurocomputing 2024, 600, 128120. [Google Scholar] [CrossRef]
Syed, M.A.; Ou, Y.; Li, T.; Jiang, G. Lightweight Multimodal Domain Generic Person Reidentification Metric for Person-Following Robots. Sensors 2023, 23, 813. [Google Scholar] [CrossRef] [PubMed]
Robinson, J.; Sun, L.; Yu, K.; Batmanghelich, K.; Jegelka, S.; Sra, S. Can contrastive learning avoid shortcut solutions? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 7 December 2021; Volume 34, pp. 4974–4986. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 36479–36494. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Baranchuk, D.; Voynov, A.; Rubachev, I.; Khrulkov, V.; Babenko, A. Label-Efficient Semantic Segmentation with Diffusion Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25 April 2022. [Google Scholar]
Yang, X.; Shih, S.M.; Fu, Y.; Zhao, X.; Ji, S. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv 2022, arXiv:2208.07791. [Google Scholar]
Hudson, D.A.; Zoran, D.; Malinowski, M.; Lampinen, A.K.; Jaegle, A.; McClelland, J.L.; Matthey, L.; Hill, F.; Lerchner, A. Soda: Bottleneck diffusion models for representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 23115–23127. [Google Scholar]
Jaini, P.; Clark, K.; Geirhos, R. Intriguing properties of generative classifiers. arXiv 2023, arXiv:2309.16779. [Google Scholar]
Fuest, M.; Ma, P.; Gui, M.; Fischer, J.S.; Hu, V.T.; Ommer, B. Diffusion models and representation learning: A survey. arXiv 2024, arXiv:2407.00783. [Google Scholar]
Deja, K.; Trzciński, T.; Tomczak, J.M. Learning data representations with joint diffusion models. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Turin, Italy, 18–22 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 543–559. [Google Scholar]
Xu, Z.; Wang, G.; Huang, X.; Sang, J. DenoiseReID: Denoising Model for Representation Learning of Person Re-Identification. arXiv 2024, arXiv:2406.08773. [Google Scholar]
Wang, W.; Sun, Q.; Zhang, F.; Tang, Y.; Liu, J.; Wang, X. Diffusion feedback helps clip see better. arXiv 2024, arXiv:2407.20171. [Google Scholar]
Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Biderman, D.; Portes, J.; Ortiz, J.J.G.; Paul, M.; Greengard, P.; Jennings, C.; King, D.; Havens, S.; Chiley, V.; Frankle, J.; et al. LoRA Learns Less and Forgets Less. arXiv 2024, arXiv:2405.09673. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 4195–4205. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Xiang, W.; Yang, H.; Huang, D.; Wang, Y. Denoising diffusion autoencoders are unified self-supervised learners. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 15802–15812. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Li, A.C.; Prabhudesai, M.; Duggal, S.; Brown, E.; Pathak, D. Your diffusion model is secretly a zero-shot classifier. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 2206–2217. [Google Scholar]
Clark, K.; Jaini, P. Text-to-image diffusion models are zero shot classifiers. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, USA, 10–15 December 2024; Volume 36. [Google Scholar]
Huang, H.; Huang, Y.; Wang, L. Vi-diff: Unpaired visible-infrared translation diffusion model for single modality labeled visible-infrared person re-identification. arXiv 2023, arXiv:2310.04122. [Google Scholar]
Song, Z.; Hu, G.; Zhao, C. Diverse Person: Customize Your Own Dataset for Text-Based Person Search. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4943–4951. [Google Scholar]
Bhunia, A.K.; Khan, S.; Cholakkal, H.; Anwer, R.M.; Laaksonen, J.; Shah, M.; Khan, F.S. Person image synthesis via denoising diffusion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 5968–5976. [Google Scholar]
Asperti, A.; Fiorilla, S.; Orsini, L. A generative approach to person reidentification. Sensors 2024, 24, 1240. [Google Scholar] [CrossRef] [PubMed]
Tao, X.; Kong, J.; Jiang, M.; Lu, M.; Mian, A. Unsupervised Learning of Intrinsic Semantics With Diffusion Model for Person Re-Identification. IEEE Trans. Image Process. 2024, 33, 6705–6719. [Google Scholar] [CrossRef]
Jia, C.; Luo, M.; Dang, Z.; Dai, G.; Chang, X.; Wang, J. PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
Dai, Y.; Li, X.; Liu, J.; Tong, Z.; Duan, L.Y. Generalizable person re-identification with relevance-aware mixture of experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16145–16154. [Google Scholar]
Guo, Y.; Dou, X.; Zhu, Y.; Wang, X. Domain generalization person re-identification via style adaptation learning. Int. J. Mach. Learn. Cybern. 2024, 15, 4733–4746. [Google Scholar] [CrossRef]
Dou, Z.; Wang, Z.; Li, Y.; Wang, S. Identity-seeking self-supervised representation learning for generalizable person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 15847–15858. [Google Scholar]
Xiang, S.; Gao, J.; Guan, M.; Ruan, J.; Zhou, C.; Liu, T.; Qian, D.; Fu, Y. Learning robust visual-semantic embedding for generalizable person re-identification. arXiv 2023, arXiv:2304.09498. [Google Scholar]
Xiang, S.; Chen, H.; Ran, W.; Yu, Z.; Liu, T.; Qian, D.; Fu, Y. Deep multimodal fusion for generalizable person re-identification. arXiv 2022, arXiv:2211.00933. [Google Scholar]
Dai, Z.; Wang, G.; Yuan, W.; Zhu, S.; Tan, P. Cluster contrast for unsupervised person re-identification. In Proceedings of the Asian Conference on Computer Vision (ACCV), Macao, China, 4–8 December 2022; pp. 1142–1160. [Google Scholar]
Phuong, M.; Lampert, C. Towards understanding knowledge distillation. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5142–5151. [Google Scholar]
Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Wang, Y.; Li, H.; Chau, L.p.; Kot, A.C. Embracing the dark knowledge: Domain generalization using regularized knowledge distillation. In Proceedings of the 29th ACM International Conference on Multimedia (ACM MM), Virtual, 20–24 October 2021; pp. 2595–2604. [Google Scholar]
Dong, N.; Yan, S.; Tang, H.; Tang, J.; Zhang, L. Multi-view information integration and propagation for occluded person re-identification. Inf. Fusion 2024, 104, 102201. [Google Scholar] [CrossRef]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1116–1124. [Google Scholar]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3754–3762. [Google Scholar]
Wei, L.; Zhang, S.; Gao, W.; Tian, Q. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 79–88. [Google Scholar]
Zhong, Z.; Zheng, L.; Cao, D.; Li, S. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1318–1327. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13001–13008. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar]
Liao, S.; Shao, L. Transmatcher: Deep image matching through transformers for generalizable person re-identification. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; Volume 34, pp. 1992–2003. [Google Scholar]
Liao, S.; Shao, L. Interpretable and generalizable person re-identification with query-adaptive convolution and temporal lifting. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 456–474. [Google Scholar]
Liao, S.; Shao, L. Graph sampling based deep metric learning for generalizable person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7359–7368. [Google Scholar]
Li, Y.; Song, J.; Ni, H.; Shen, H.T. Style-controllable generalized person re-identification. In Proceedings of the 31st ACM International Conference on Multimedia (ACM MM), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7912–7921. [Google Scholar]
Ni, H.; Li, Y.; Gao, L.; Shen, H.T.; Song, J. Part-aware transformer for generalizable person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 11280–11289. [Google Scholar]
Sun, J.; Li, Y.; Chen, L.; Chen, H.; Peng, W. Multiple integration model for single-source domain generalizable person re-identification. J. Vis. Commun. Image Represent. 2024, 98, 104037. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia (ACM MM), Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Zhou, K.; Yang, Y.; Cavallaro, A.; Xiang, T. Omni-scale feature learning for person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3702–3712. [Google Scholar]
Elfwing, S.; Uchibe, E.; Doya, K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Netw. 2018, 107, 3–11. [Google Scholar] [CrossRef] [PubMed]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the International Conference on Machine Learning (ICML). PMLR, Lille, France, 6–11 July 2015; Volume 37, pp. 448–456. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Illustration of different diffusion-based representation learning designs. (a) A separate denoising decoder and a classifier with a shared encoder, (b) intertwined feature extraction and feature denoising, (c) a diffusion model with a separate image encoder for instance-wise conditioning, and (d) a diffusion model and a classification model bridged by a correlation-aware ID-wise conditioning scheme. In addition, (e) illustrates that the dark knowledge embedded in the logits of classifiers is able to capture the ID relationships, including nuanced similarities and differences beyond the hard ID labels, which helps generate better conditions to guide the diffusion model.

Figure 2. An overview of the proposed framework. It consists of a baseline Re-ID model, a pre-trained diffusion model, and a correlation-aware conditioning scheme based on learnable ID-wise prompts. The Re-ID model is built upon the pre-trained CLIP image encoder [41] and a BN Neck [1], optimized by an ID loss and a prototypical contrastive loss. The diffusion model is constructed on via pre-trained stable diffusion [24], with LoRA [34] for efficient adaptation. The informative classification probabilities predicted by the Re-ID model is employed to produce a correlation-aware condition to guide the diffusion model for unleashing specific knowledge of generalization, with gradients back-propagated to the Re-ID model for enhanced generalizable feature learning.

Figure 3. GradCAM [79] visualization of several visually similar IDs selected from the Market1501 [60] dataset. In groups (a–c), the activation maps are computed with the images of ID #27, #76, and #649 under ID #28, respectively, which reflects the Re-ID model’s capability of capturing correlations among IDs. From left to right, each group contains the original image and the activation maps of the baseline model, the instance-wise condition-guided model, and our correlation-aware-condition-guided model, respectively.

Table 1. The statistics of four public Re-ID datasets and their composition on different splits, including the training set (denoted by “Train”) and testing set (denoted by “Query” and “Gallery”).

Dataset	Cameras	IDs	Train	Query	Gallery
Market1501 [60]	6	1501	12,936	3368	15,913
DukeMTMC-reID [61]	8	1812	16,522	2228	17,661
MSMT17 [62]	15	4101	32,621	11,659	82,161
CUHK03-NP [63]	2	1467	7365	1400	5332

Table 2. Comparison with SOTA methods on the single-source DG Re-ID setting with source domains Market1501 [60] and DukeMTMC-reID [61]. † indicates that the method uses target domain data for test-time updating. ‡ indicates that the method uses input image sizes larger than 256 × 128. The best and second-best results are marked with bold and underline, respectively. Data are collected from corresponding works.

Model	MA→D		MA→MS		MA→C3		D→MA		D→MS		D→C3
Model	mAP	R1	mAP	R1	mAP	R1	mAP	R1	mAP	R1	mAP	R1
SNR [6]	33.6	55.1	-	-	-	-	33.9	66.7	-	-	-	-
CBN [9]	-	-	9.5	25.3	-	-	-	-	9.5	25.3	-	-
QAConv [69]	28.7	48.8	7.0	22.6	-	-	27.2	58.6	8.9	29.0	-	-
TransMatcher ‡ [68]	-	-	18.4	47.3	-	-	-	-	-	-	-	-
MetaBIN [10]	33.1	55.2	-	-	-	-	35.9	69.2	-	-	-	-
DTIN-Net [11]	36.1	57.0	-	-	-	-	37.4	69.8	-	-	-	-
QAConv-GS [70]	-	-	15.0	41.2	-	-	-	-	-	-	-	-
MDA † [17]	34.4	56.7	11.8	33.5	-	-	38.0	70.3	-	-	-	-
Li et al. [71]	-	-	21.8	47.5	-	-	-	-	-	-	-	-
SuA-SpML [18]	34.8	55.5	11.1	30.1	-	-	36.3	65.8	13.6	37.8	-	-
DIR-ReID [7]	33.0	54.5	-	-	-	-	35.2	68.2	-	-	-	-
GN [14]	34.0	52.3	10.3	28.6	14.5	14.4	34.3	64.3	12.3	33.8	10.3	10.2
GN+SNR [14]	34.7	55.4	-	-	15.2	15.1	36.9	68.5	-	-	11.5	11.0
PAT [72]	-	-	18.2	42.8	-	-	-	-	-	-	-	-
LDU [15]	38.0	59.5	13.5	35.7	18.2	18.5	42.3	73.2	16.7	44.2	14.2	14.2
MTI [73]	36.4	57.8	-	-	16.2	16.3	38.2	70.5	-	-	13.3	13.3
Baseline (ours)	47.8	67.3	22.0	50.1	30.0	30.4	41.7	70.5	19.0	46.9	22.1	23.1
DCAC (ours)	49.5	69.1	23.4	52.1	32.5	33.2	42.3	71.5	19.7	47.4	23.0	23.5

Table 3. Comparison with SOTA methods on the single-source DG Re-ID setting, where the source domains are MSMT17 [62] and CUHK03-NP [63]. † indicates that the method uses target domain data for test-time updating. ‡ indicates that the method uses input image size larger than 256 × 128. The best and second-best results are marked with bold and underline, respectively. Data are collected from the corresponding works.

Model	MS→MA		MS→D		MS→C3		C3→MA		C3→D		C3→MS
Model	mAP	R1	mAP	R1	mAP	R1	mAP	R1	mAP	R1	mAP	R1
PCB [74]	26.7	52.7	-	-	-	-	-	-	-	-	-	-
MGN [75]	25.1	48.7	-	-	-	-	-	-	-	-	-	-
OSNet-IBN [76]	37.2	66.5	45.6	67.4	-	-	-	-	-	-	-	-
SNR [6]	41.4	70.1	50.0	69.2	-	-	-	-	-	-	-	-
CBN [9]	45.0	73.7	-	-	-	-	-	-	-	-	-	-
TransMatcher ‡ [68]	52.0	80.1	-	-	-	-	-	-	-	-	-	-
QAConv-GS [70]	46.7	75.1	-	-	-	-	-	-	-	-	-	-
MDA † [17]	53.0	79.7	-	-	-	-	-	-	-	-	-	-
GN [14]	-	-	-	-	-	-	40.6	67.6	31.2	50.0	11.9	33.4
GN+SNR [14]	37.5	68.0	45.4	66.2	18.3	17.4	-	-	-	-	-	-
LDU [15]	44.8	74.6	48.9	69.2	21.3	21.3	37.5	68.1	29.5	51.8	12.6	36.9
MTI [73]	42.7	72.9	47.7	67.5	16.0	15.4	-	-	-	-	-	-
Baseline (ours)	51.0	76.5	57.1	73.8	32.7	32.9	39.6	66.2	41.4	62.7	16.6	45.2
DCAC (ours)	52.1	77.9	58.4	75.0	34.1	34.4	42.0	68.6	43.2	64.8	17.8	47.3

Table 4. Comparison with SOTA methods on the multi-source DG Re-ID setting. The best and second-best results are marked with bold and underline, respectively. Data are collected from corresponding works.

Model	Target: MA		Target: D		Target: MS		Target: C3		Average
Model	mAP	R1	mAP	R1	mAP	R1	mAP	R1	mAP	R1
QAConv₅₀ [69]	39.5	68.6	43.4	64.9	10.0	29.9	19.2	22.9	28.0	46.6
M³L [16]	48.1	74.5	50.5	69.4	12.9	33.0	29.9	30.7	35.4	51.9
M³L_IBN [16]	50.2	75.9	51.1	69.2	14.7	36.9	32.1	33.1	37.0	53.8
RaMoE [50]	56.5	82.0	56.9	73.6	13.5	34.1	35.5	36.6	40.6	56.6
PAT [72]	51.7	75.2	56.5	71.8	21.6	45.6	31.5	31.1	40.3	55.9
DEX [19]	55.2	81.5	55.0	73.7	18.7	43.5	33.8	36.7	40.7	58.9
UDSX [20]	60.4	85.7	55.8	74.7	20.2	47.6	37.2	38.9	43.4	61.7
SALDG [51]	57.6	82.3	52.0	71.2	18.1	46.5	32.4	34.5	40.0	58.6
DCAC (Ours)	56.7	80.0	58.9	75.4	27.5	56.7	42.5	43.6	46.4	63.9

Table 5. Ablations on various diffusion model fine-tuning methods.

θ_{a}

and

θ_{n a}

denote trainable parameters in the attention and non-attention layers of the denoising U-Net. “PT” denotes whether pre-trained diffusion model weights were used. Partial fine-tuning 1 and 2 are the variants of full fine-tuning, where only the output blocks and both the middle and output blocks of the U-Net are trainable, respectively. ✓ and × denote adopting corresponding option or not, respectively. The best results are marked with bold.

Table 5. Ablations on various diffusion model fine-tuning methods.

θ_{a}

and

θ_{n a}

denote trainable parameters in the attention and non-attention layers of the denoising U-Net. “PT” denotes whether pre-trained diffusion model weights were used. Partial fine-tuning 1 and 2 are the variants of full fine-tuning, where only the output blocks and both the middle and output blocks of the U-Net are trainable, respectively. ✓ and × denote adopting corresponding option or not, respectively. The best results are marked with bold.

Model	PT	Trainable		MA→MS		MS→MA
Model	PT	$θ_{a}$	$θ_{na}$	mAP	R1	mAP	R1
Baseline	-	-	-	22.0	50.1	51.0	76.5
DCAC (frozen)	✓	×	×	21.3	49.7	50.8	76.5
DCAC (LoRA adaptation)	✓	✓	×	23.4	52.1	52.1	77.9
DCAC (partial fine-tuning 1)	✓	✓	✓	22.1	50.3	47.1	74.6
DCAC (partial fine-tuning 2)	✓	✓	✓	22.1	50.4	51.9	78.5
DCAC (full fine-tuning)	✓	✓	✓	21.8	50.1	51.5	76.6
DCAC (train from scratch)	×	✓	✓	21.9	50.3	51.7	77.6

Table 6. Ablations on the computational overhead of various diffusion model fine-tuning methods. For generalization performance, we use MA→MS results. For computational overhead, we report the time latency and the count of tera floating point operations (TFLOPs) in forward propagation to measure the time efficiency. Additionaly, GPU memory consumption and the number of the model’s trainable parameters are reported to measure space efficiency. The best results are marked with bold.

Mode	Model	MA→MS		Time (ms)	TFLOPs	Memory (GB)	Parameters (M)
Mode	Model	mAP	R1	Time (ms)	TFLOPs	Memory (GB)	Parameters (M)
Training	Baseline	22.0	50.1	140.3	1.46	7.58	85.94
	DCAC (Frozen)	21.3	49.7	464.7	8.50	17.22	86.51
	DCAC (LoRA adaptation)	23.4	52.1	578.1	8.52	18.15	89.01
	DCAC (Partial fine-tuning 1)	22.1	50.3	647.1	8.50	23.92	599.37
	DCAC (Partial fine-tuning 2)	22.1	50.4	662.1	8.50	25.01	696.41
	DCAC (Full fine-tuning)	21.8	50.1	735.9	8.50	28.24	946.03
Inference	-	-	-	40.3	1.46	2.27	85.55

Table 7. Ablations on the conditioning scheme. “Instance-wise” denotes that the instance feature is directly adopted as the diffusion condition. “Class-wise” denotes that the classification probability belonging to the real ID class of the image is adopted to generate the diffusion condition. “Correlation-aware” denotes the proposed method, which adopts all ID classification probabilities to generate the diffusion condition in a linear combination manner. The best results are marked with bold.

Conditioning Scheme	MA→MS		MS→MA
Conditioning Scheme	mAP	R1	mAP	R1
Baseline	22.0	50.1	51.0	76.5
DCAC (Instance-wise)	22.4	50.8	51.6	77.7
DCAC (Class-wise)	22.6	51.2	50.5	76.5
DCAC (Correlation-aware)	23.4	52.1	52.1	77.9

Table 8. Ablations on source domain Re-ID performance. The best results are marked with bold.

Model	Market1501		MSMT17
Model	mAP	R1	mAP	R1
Baseline	86.4	94.4	70.4	88.1
DCAC (instance-wise)	86.4	94.7	70.4	88.5
DCAC (class-wise)	86.6	94.5	70.6	88.2
DCAC (correlation-aware)	86.8	94.9	70.1	88.3

Table 9. Parameter analysis on the rank r of LoRA adapters. The best results are marked with bold.

Rank r	MA→MS		MS→MA
Rank r	mAP	Rank-1	mAP	Rank-1
8	23.4	52.1	51.4	77.3
16	22.1	50.7	51.7	77.9
32	22.4	51.4	52.1	77.9
64	22.9	51.6	51.1	77.5

Table 10. Further studies on more sophisticated conditioning schemes beyond the correlation-aware conditioning through linear combination. Non-linear activations like SiLU [77] and normalization layers like batch normalization [78] are adopted on the generated conditions. ✓ and × denote adopting corresponding option or not, respectively. The best results are marked with bold.

Model	Transformations		MA→MS		MS→MA
Model	SiLU	BN	mAP	R1	mAP	R1
Baseline	-	-	22.0	50.1	51.0	76.5
DCAC	×	×	23.4	52.1	52.1	77.9
+ Non-linearity	✓	×	22.4	50.8	50.4	76.9
+ BatchNorm	×	✓	22.5	50.9	51.0	76.8
+ ConditionNet	✓	✓	21.9	50.1	50.5	76.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, J.; Gong, X. Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification. Sensors 2025, 25, 552. https://doi.org/10.3390/s25020552

AMA Style

Li J, Gong X. Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification. Sensors. 2025; 25(2):552. https://doi.org/10.3390/s25020552

Chicago/Turabian Style

Li, Jiachen, and Xiaojin Gong. 2025. "Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification" Sensors 25, no. 2: 552. https://doi.org/10.3390/s25020552

APA Style

Li, J., & Gong, X. (2025). Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification. Sensors, 25(2), 552. https://doi.org/10.3390/s25020552

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unleashing the Potential of Pre-Trained Diffusion Models for Generalizable Person Re-Identification

Abstract

1. Introduction

2. Related Work

2.1. Generative Diffusion Models

2.2. Diffusion Models for Representation Learning

2.3. Diffusion Models for Person Re-ID

2.4. Generalizable Person Re-ID

3. Diffusion Preliminaries

4. The Proposed Method

4.1. The Baseline Re-ID Model

4.2. The Generative Diffusion Model

4.3. The Correlation-Aware Conditioning Scheme

4.4. The Entire Training Loss

5. Experiments

5.1. Datasets and Evaluation Protocols

5.2. Implementation Details

5.3. Comparison with State-of-the-Arts

5.3.1. Single-Source DG Re-ID

5.3.2. Multi-Source DG Re-ID

5.4. Ablation Studies

5.4.1. Effectiveness of the CLIP-Based Re-ID Model

5.4.2. Effectiveness of the Diffusion Model Assistance

5.4.3. Ablations on Computational Overhead

5.4.4. Effectiveness of the Conditioning Scheme

5.4.5. Impact of the Hyper-Parameters

5.4.6. Impact of More Intricate Conditioning Schemes

5.4.7. Visualization Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI