PSF-C-Net: A Counterfactual Deep Learning Model for Person Re-Identification Based on Random Cropping Patch and Shuffling Filling

Sun, Ruiwang; Chen, Qing; Dong, Heng; Zhang, Haifeng; Wang, Meng

doi:10.3390/math12131957

Open AccessArticle

PSF-C-Net: A Counterfactual Deep Learning Model for Person Re-Identification Based on Random Cropping Patch and Shuffling Filling

by

Ruiwang Sun

¹

,

Qing Chen

¹,

Heng Dong

¹,

Haifeng Zhang

¹ and

Meng Wang

^2,*

¹

College of Information Science and Engineering, Northeastern University, Shenyang 110819, China

²

School of Opto-Electronic Engineering, Zaozhuang University, Zaozhuang 277160, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(13), 1957; https://doi.org/10.3390/math12131957

Submission received: 28 May 2024 / Revised: 17 June 2024 / Accepted: 21 June 2024 / Published: 24 June 2024

Download

Browse Figures

Versions Notes

Abstract

:

In the task of person re-identification (re-ID), capturing the long-range dependency of instances is crucial for accurate identification. The existing methods excel at extracting local features but often overlook the global information of instance images. To address this limitation, we propose a convolution-based counterfactual learning framework, called PSF-C-Net, to focus on global information rather than local detailed features. PSF-C-Net adopts a parameter-sharing dual-path structure to perform counterfactual operations in the prediction space. It takes both the actual instance image and a counterfactual instance image that disrupts the contextual relationship as the input. The counterfactual framework enables the interpretable modeling of global features without introducing additional parameters. Additionally, we propose a novel method for generating counterfactual instance images, which effectively constructs an explicit counterfactual space, to reliably implement counterfactual strategies. We have conducted extensive experiments to evaluate the performance of PSF-C-Net on the Market-1501 and Duke-MTMC-reID datasets. The results demonstrate that PSF-C-Net achieves state-of-the-art performance.

Keywords:

person re-identification; counterfactual strategy; dual-path; random cropping patch; shuffling filling

MSC:

68T07

1. Introduction

In recent years, person re-identification has become increasingly important in criminal investigations, security prevention, and various public environments (such as shopping malls and streets) to enhance people’s safety [1].

Person re-identification refers to the process of searching, querying, and accurately matching target individuals in gallery images captured by various cameras and from different angles [2]. Due to the complexity of the environments in which people are present, randomly captured images of pedestrians from daily life are prone to various factors such as the surrounding conditions and object occlusion. Consequently, it is vital to develop a model that not only minimizes the loss of image feature information but also accurately identifies the target individual [3].

In the field of person re-identification, convolutional models are widely used as the backbone network structures. Examples of these models include ResNet101, ResNet152, SeResNet101, and VGGNet [4]. These models are skilled at extracting local features and are frequently expanded upon by researchers. However, the convolutional model fails to capture the long-range dependencies between instances during feature extraction, which is crucial information for person re-identification tasks (e.g., segment scale). Additionally, due to the relatively small size of the existing datasets, incorporating global information can help to expand the semantic space of features. Therefore, many current methods utilize Transformer structures or extract multi-scale features to model global features. For instance, Jia et al. [5] proposed using Transformer structures to decouple person occlusion features without alignment, while Zhang et al. [6] proposed the AAB attribute module, which uses reinforcement learning to eliminate noise attributes and aggregate attribute attention to improve Re-ID performance. Gong et al. also proposed the Local Attention Guided Network LAG-Net [7], which leverages a Local Attention System (LAS) to extract important local details and exploits the most salient areas among different people. However, both of these methods require significant computational resources and lack interpretability. To address this issue, this paper proposes an interpretable counterfactual framework that can simultaneously model local and global features with a reduced computational burden. The proposed framework offers a clearer and more concise structure compared to other methods.

Generating effective counterfactual data poses a major challenge when implementing a counterfactual strategy. Currently, generative models like generative adversarial networks (GANs) or variational autoencoders (VAEs) are widely employed for generating counterfactual data. For example, Zheng et al. introduced DGNet [8], which utilizes a GAN to transfer clothing items across individuals by manipulating the appearance and structural details, thereby generating a broader range of diverse counterfactual data. In contrast, Yang et al. proposed the causal VAE [9], which utilizes a VAE to disentangle instance features and generate a substantial amount of counterfactual data. However, these methods typically require a larger quantity of raw data and are challenging to train. Another approach to generating counterfactual spaces involves modifying feature spaces. For instance, Rao et al. developed CAL [10], which generates counterfactual spaces by manipulating attention maps. However, these methods may lead to slower model convergence and the truncation of local features in convolutional extraction. Therefore, this paper presents a counterfactual strategy for generating counterfactual data that directly constructs counterfactual instances within the instance space. This strategy facilitates the reliable and effective generation of counterfactual data.

Due to the limited size of the existing datasets, researchers must address the issue of overfitting in deep network models. Currently, most methods for mitigating overfitting directly disrupt the contextual relationships within instances. For instance, random erasing [11] randomly selects rectangular areas in an image and replaces their pixels with random values. However, the existing random erasure methods often erase connected areas, leading to a loss of the local image features and reduced model robustness. Moreover, filling the erased regions with fixed pixel values alters the pixel value distribution of the original image. Consequently, it sacrifices the local image features and impacts the final predictions. As depicted in Figure 1, this paper introduces a counterfactual data generation method utilizing random cropping patch and shuffling filling (RCPSF). This method generates counterfactual instances that disrupt the global information while preserving the intricate local image features without introducing additional noise.

This study introduces a novel approach, the counterfactual deep convolutional network, which incorporates the RCPSF techniques to address the challenges mentioned earlier. The network effectively captures global features and takes into account the long-range dependencies among instances. Additionally, the proposed counterfactual strategy enables the generation of counterfactual data in a simple and reliable manner. By utilizing the RCPSF method during the data generation process, the network preserves the local detailed features of the original images while eliminating their contextual information. The main contributions of this paper can be summarized as follows:

In this study, we propose a counterfactual framework aimed at improving the accuracy of person re-identification by explicitly incorporating long-range dependent features. The proposed framework adopts a dual-path counterfactual architecture with shared convolutional parameters, simultaneously accepting both factual and counterfactual instances as inputs to disrupt contextual relationships. The framework also implements controllable counterfactual operations in the prediction space to interpret global features without the need for additional parameters.
In this study, we propose a method for generating counterfactual data at the instance level. This method utilizes a counterfactual strategy to construct counterfactual instances directly in the instance space. The main objective of our proposed method is to efficiently create an explicit counterfactual space that enables the reliable implementation of counterfactual predictions. Specifically, we randomly crop patches of equal size from the original image and shuffle them before filling the image back. This process breaks the context relationship within the image without introducing any additional noise.
In this study, we conducted an evaluation of the proposed method using widely recognized benchmarks for person re-identification, namely the Market-1501 and Duke-MTMC-ReID datasets. The experimental results demonstrate that our proposed method consistently achieves state-of-the-art performance across various scenarios.

The remainder of this article is structured as follows. Section 2 provides an overview of the existing work in the field of person re-identification. In Section 3, we present a detailed description of our proposed person re-identification network framework. The experimental Section 4 covers two used major benchmark datasets as well as the evaluation protocols and implementation specifics. A comparative analysis of the experimental results on these benchmark datasets is also included. Finally, in Section 5, we conclude with a summary and discuss potential future directions for this research.

2. Related Work

Person re-identification has been the subject of extensive research due to its crucial significance. The existing methods often employ techniques such as stripe local features or incorporate additional semantic information. In contrast, our approach focuses solely on utilizing global features. In this section, we provide a concise overview of the relevant literature on dual-path deep convolutional networks and causal counterfactual inference, which are proposed based on the counterfactual framework.

2.1. Deep Model for Person Re-Identification

The main challenge in person re-identification is the effectiveness of feature extraction. Two multi-branch deep networks, namely the PCB network proposed by Sun et al. [12] and the MGN network proposed by Wang et al. [13], have achieved state-of-the-art performance on various benchmarks such as Market-1501, DukeMTMC-reID, CUHK03, VIPeR, and PRID2011. The MGN network [13] enhances person re-identification by using multi-scale features. It divides images into horizontal bands processed through separate branches, each extracting fine-grained and coarse-grained features. This approach captures detailed local and global context, resulting in comprehensive person representation. The hierarchical MGN structure, with a ResNet50 backbone and three branches, ensures robust feature extraction and aggregation, improving the model’s handling of pose, occlusion, and background variations. MGN combines global average pooling and fully connected layers to reduce feature map dimensionality while retaining essential information, achieving state-of-the-art performance with

95.7 %

Rank-1 accuracy and

86.9 %

mAP on Market-1501, and high accuracy on DukeMTMC-reID. The SCPNet network [14] extracts features from each channel set to represent a specific spatial region of the person’s body and utilizes spatial-channel correlation to supervise the network in learning more robust features. Additionally, some researchers have proposed incorporating human pose information into Re-ID networks. For instance, Miao et al. [15] introduced a pose-guided feature alignment (PGFA) method that utilizes human semantic key-points to guide the matching of a probe and gallery. Furthermore, attention mechanisms have been employed in certain Re-ID models to avoid the need for image segmentation. However, effectively guiding the network to focus on relevant regions of interest remains a challenge.

The use of a single convolutional neural network often limits the detailed feature extraction of the original instance. In contrast, models based on dual-path deep neural networks employ a dual-path structure with shared parameters to perform multi-feature extraction and learning. For example, Fu et al. [16] proposed a dual-path spatial segmentation network, where the dual-path deep network separately extracts and learns global and local features. Farooq et al. [17] utilized a visual and verbal dual-path deep network to incorporate textual information about the person and match their identity. Yang et al. [18] fused the features extracted by the dual-path neural network to obtain a comprehensive spatiotemporal representation of personal information. Chen et al. [19] introduced a dual-path deep network to learn the appearance representation of a person with identity discrimination. By using a dual-path deep neural network, researchers aim not only to embed input images more effectively compared to a single network but also to further reduce the false matching rate. In this work, we propose a dual-path deep network based on counterfactual inference, which explicitly complements the long-range dependent features during semantic feature extraction. Our proposed neural network solely utilizes global features, resulting in a lower computational cost and a simple and clear model structure while maintaining a certain level of competitiveness in person re-identification.

2.2. Causal Counterfactual in Vision

Researchers have integrated causal inference tools with deep learning and machine learning in their studies. Causal inference tools enable us to eliminate the influence of confounding factors and identify the causal relationships between the variables in the model [20]. For example, Rao et al. [10] proposed causal counterfactual attention learning to assist the network model in learning the true region of interest. Yang et al. [9] performed correlation decomposition on person re-identification domain classes, while Jin et al. [21] introduced a double causal loss constraint to the signal-to-noise ratio to encourage the separation of person-related and person-independent features. Building upon the extensive utilization of causal reasoning tools, this paper presents a framework for a neural network structure that combines causal counterfactual and counterfactual data generation strategies. By extracting comprehensive global context information and robust features, this framework uncovers the causal relationship between the region of shuffled filling patches in the counterfactual images and the original factual images area. Consequently, it enhances the overall feature representation of instance images and reduces the prediction bias of person instance images caused by the shuffled filling patches area.

Note: Despite the significant advancements, the existing person re-identification models still face several challenges. Many models focus heavily on local features, potentially missing out on important global contextual information. They also often struggle with variations in pose, occlusions, and background clutter, which can degrade performance. Additionally, some models tend to overfit to specific training data, reducing their generalization capability in real-world scenarios. High computational costs and a lack of interpretability are further issues that hinder the practical deployment and understanding of these models. Addressing these challenges is essential for developing more robust, efficient, and interpretable person re-identification systems.

3. Methods

In this section, we describe the PSF-C-Net’s counterfactual structure proposed in Section 3.1, which includes a parameter-sharing dual-path structure and counterfactual prediction of the output layer. Then, we present the RCPSF method for counterfactual data generation in Section 3.2. In Section 3.3, we introduce a counterfactual feature learning strategy designed for PSF-C-Net, which utilizes interpretable global features to mitigate the prediction bias of person instance images resulting from broken context relationships. Furthermore, we explain the training and inference strategies in detail. The proposed network framework and method are illustrated in Figure 2.

3.1. Person Instance Long-Range Features Extraction

We propose a parameter-sharing counterfactual dual-path structure to serve as a feature extractor that considers the long-range dependencies of instances, allowing for the modeling of both local and global features of a person simultaneously.

We construct a dual-path network structure with shared parameters, where the first path takes the raw factual data as input and the other path takes the corresponding counterfactual data as input. During data loading, we randomly select K identities and M images for each person to form a training batch. Finally, the batch size is set to

B = K \times M

. For a factual image x and a counterfactual image

x^{*}

, the original factual image sequence is denoted as

T_{n} = [I_{1 G}^{n}, I_{2 G}^{n} \dots I_{B G}^{n}] \in R^{B G \times h_{e} \times w_{e} \times 3}

, and the counterfactual image sequence is represented as

T_{n}^{*} = [I_{1 F}^{n}, I_{2 F}^{n} \dots I_{B F}^{n}] \in R^{B F \times h_{e} \times w_{e} \times 3}

. The CNN backbone generates feature maps for the factual and counterfactual images, denoted as

X = C N N (x) \in R^{C \times H \times W}

and

X^{*} = C N N (x^{*}) \in R^{C \times H \times W}

, where C, H, and W denote the channel dimension, height, and width of the feature maps, respectively.

3.2. RCPSF of Counterfactual Data Generation

Considering the small scale of the existing dataset, most of the current methods such as random erasure are used to destroy the context relationship on the instance, which makes many attribute features of the original instance image erased. In order to solve this feature, relying on CNN-based methods has advantages for fine-grained local feature extraction, and we advocate for the use of RCPSF. Firstly, several patches are extracted from the instance image, the patch area is pixel-erased at the same time as the feature extraction, and then the patches are shuffling filling and embedded in the instance image to generate a fine-grained counterfactual image with features. The RCPSF method involves randomly selecting and cropping multiple rectangular patches from the original image, shuffling these patches to disrupt contextual relationships, and then filling them back into their original locations to preserve local details while altering global context.

In the model training process, we apply an RCPSF with a certain probability to the training batch. Specifically, given an image X in a batch, we set the probability of recombination to p and the probability of no change to

1 - p

. As illustrated in Figure 2, we randomly select rectangular areas with the same size

(W_{e}, H_{e})

in the image and save them as counterfactual patches. The area ratio of counterfactual patches is denoted as

r_{e} = S_{e} / S

, where

S_{e}

and S represent the area of the counterfactual patch and the image, respectively. The counterfactual patches list is denoted as

I_{e} = [I_{e}^{1}, I_{e}^{2} \dots I_{e}^{n}]

. By shuffling the extracted counterfactual patches, a counterfactual patches list is generated as

I_{e} = [I_{e}^{x_{1}}, I_{e}^{x_{2}} \dots I_{e}^{x_{n}}]

, where

x_{i} \in [1, n]

. Then, the counterfactual patches are shuffling filling into the original images to produce counterfactual training images.

The random cropping and shuffling filling of pixel patches enables the extraction of fine-grained features when extracting global features. In practical applications, flipping operations are also applied to pixel patches. Increasing the perturbation helps to improve the robustness of the person re-identification model. Compared with other methods, our approach eliminates supervised signals for fine-grained local detail features, resulting in a more concise method. We are particularly interested in random cropping patches because it introduces variability and perturbation that can enhance the model’s ability to generalize and be robust against overfitting.

3.3. Counterfactual Data Learning

Researchers often use causal diagrams or structural causal models to describe real-world correlation features and their interactions. We use the directed acyclic graph

G = (N, ε)

to redefine and represent the causal counterfactual structure module in Figure 3. Each variable in the model is a node of

N

in a directed acyclic graph, and

ε

represents the interaction between the variables. As shown in Figure 3, we represent the proposed Re-ID framework as nodes of a directed acyclic graph, including the original fact image used, the counterfactual image generated using RCPSF of counterfactual data generation, and the final predicted result Y. Link

I \to I_{e}

indicates the found counterfactual pixel patches, and

(I, I e) \to Y

indicates that the prediction is determined by the original factual image and the pixel patches. The causal relationship between nodes is encoded in link

ε

, where we call node I the causal parent node of

I_{e}

and Y the causal child node of I and

I_{e}

.

Sufficient counterfactual examples are effective in improving the generalization ability of models and reducing overfitting. However, few people have paid attention to the impact of counterfactual instances on the final prediction. In this paper, we use causal reasoning as a tool to analyze the causal relationship between the original factual image, the generated counterfactual image, and the final prediction. We suggest using causality to fully supplement person ID information and help the network to learn a more complete global context feature relationship, thereby reducing the dataset training bias caused by counterfactual images.

Based on the causal relationship of the variables in the causal diagram, we directly manipulate several of the causal variables and analyze the causal relationship between the variables. In our proposed methodological framework, we use counterfactual shuffling filling patches instead of the original factual image region to perform controlled counterfactual intervention operations on the input images.

During the learning process, we use

Y_{e f f e c t}

to represent the effect of counterfactual pixel patches on the predicted outcome of the final person. According to Figure 3, we can obtain

Y_{e f f e c t} = [Y (I_{e} = I_{e}, I = I) - Y (I_{e} = I_{e}^{*}, I = I)] .

(1)

In this work, we introduce a strong supervisory signal for the fact and counterfactual features, which are extracted from the dual-path structure and shared by the model parameters. This signal not only measures the distance between the generated counterfactual image and the original factual image but also provides us with tools and methods to analyze the impact of counterfactual shuffling filling patches on the final prediction.

£ = £_{c e} (Y_{e f f e c t}, y) + £_{o t h e r s} .

(2)

Here, y is the classification label,

£ c e

is the cross-entropy loss, and

£_{o t h e r s}

represents the original objective that contains standard classification loss, triplet loss, and center loss.

By optimizing the proposed new objectives, we aim to achieve two goals. (1) The proposed model should explicitly and succinctly supplement long-range dependent features and extract more valuable feature information. (2) We penalize counterfactual predictions in counterfactual frameworks, which compels the classifier to mitigate the impact of biased training sets due to insufficient information.

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Dataset

We used two datasets for our experiments. The Market-1501 [22] dataset contains 32,668 images of 1501 individuals captured from 6 camera views. Following the standard setting, the entire dataset is split into a training set with 12,936 images of 751 individuals and a testing set with 19,732 images of 750 individuals. The Duke-MTMC-reID [23] dataset contains 36,411 images of 1812 people from 8 cameras. A total of 16,522 images of 702 individuals are randomly selected as the training set, and the remaining images are divided into the testing set, which consists of 2228 query images and 17,661 gallery images.

4.1.2. Evaluation Metric

We adopt the cumulative matching characteristic (CMC) and the mean average precision (mAP) as evaluation metrics to assess the performance of the proposed approach.

4.1.3. Implementation Details

We conducted our experiments using the ResNet50 [24] architecture, which was pre-trained on the ImageNet dataset [25]. The input image size was set to 256 × 128, and the last stride of the backbone network was set to 1. We formed a training batch by randomly sampling 16 labeled individuals with 4 images of each person. The probability of using the counterfactual data generation method for RCPSF was set to

p = 1.0

. The model was trained using the Adam optimizer, with an initial learning rate of 0.0004, attenuated by a factor of 0.1 every 40 epochs [26]. All models used in the experiments were implemented using Python programs based on PyTorch1.10.1 and were run on computers equipped with RTX3090 GPUs.

4.2. Ablation Experiments

We conducted ablation studies on the Market-1501 and Duke-MTMC-reID datasets to validate the effectiveness of the counterfactual framework and the counterfactual data generation method. Our baseline model is the strong baseline [27] with triplet loss, ID loss, and cross-entropy loss.

4.2.1. Superiority of the Model

The counterfactual framework is represented as

B + T_{w /} £_{c e}

. As shown in Table 1, the network framework combined with counterfactual feature learning improves the recognition performance of the baseline.

B + T_{w /} £_{c e}

has improved the Rank-1 and mAP by

0.3 %

and

0.7 %

over the benchmark network on the Market-1501 dataset and also achieved good results on the Duke-MTMC. These results demonstrate that the counterfactual framework has a strong ability to model local and global features. Moreover, it verifies the effectiveness of the proposed method and confirms the effectiveness of counterfactual learning for person Re-ID task.

4.2.2. Training Parameters

As shown in Table 2, we provide the number of training parameters for PSF-C-Net and baseline [27] to show the complexity of each method. Overall, our PSF-C-Net only requires forward propagation once per epoch, with a slight increase in 2M parameters compared to the baseline [27] while achieving better performance.

4.2.3. RCPSF Improves Different Baseline Models

We evaluated the counterfactual data generation method for RCPSF as described in Section 3.2. In this experiment, we designed a model, denoted as

B + S

, which incorporates only the proposed counterfactual data generation method into the baseline model while maintaining the original training strategy during the training process. As shown in Table 3, the proposed counterfactual data generation method improved the evaluation metrics. Considering the small scale of the person re-identification dataset, the counterfactual data generation method not only preserved the detailed local characteristics of the original images but also achieved good results in anti-overfitting.

Three baselines used in person Re-ID are IDE [28], baseline [27], and Pyramid [29]. All networks follow their own. The input images are resized to

256 \times 128

. When implementing RCPSF in these baseline models, we can observe that RCPSF consistently improves the Rank-1 accuracy and mAP, as shown in Table 3. Specifically, for Market-1501, RCPSF improves the Rank-1 by

2.50 %

and the mAP by

4.94 %

for IDE using ResNet-50. For Duke-MTMC-reID, RCPSF increases the Rank-1 accuracy from

71.99 %

to

74.86 %

and mAP accuracy from

51.29 %

to

56.53 %

for IDE with ResNet-50. In addition, we used a more complex network structure on top of the baseline to verify the reliability of our method. In Table 3, as the depth of the network deepens, the proposed method can reduce the risk of overfitting and improve the performance of Re-ID. Of course, we have also applied this method to the Pyramid model, and we have also received good feedback.

Table 4 shows the performance comparison against other state-of-the-art data augmentation. In the experimental process, PSF-C-Net has used regularization methods such as random erasing, AugMix, and self-augmentation. RCPSF increased the mAP metric by

0.5 %

or even higher.

4.3. Choice of Parameter Setting

4.3.1. Counterfactual Data Generation Probability p

We conducted a study on relevant parameter settings in the PSF-C-Net network by setting them to different values and evaluating the Re-ID performance. The experimental results on the Market1501 dataset are presented in Figure 4. First, we analyzed the effect of the “possibility” parameter p, which represents the probability of occurrence in RCPSF. As shown in Figure 4a, as the value of p increases, the Rank-1 and mAP also increase by

0.7 %

and

1.4 %

when

p = 1.0

. The overall trend of the two evaluation metrics indicates that the patches generated by the RCPSF method effectively disrupt the image context information while preserving the local detailed features of the original image. As p increases, the relationship between the local detailed features of the patches area in PSF-C-Net and the instance context becomes increasingly significant.

4.3.2. Counterfactual Loss Weights

Figure 4b illustrates the effect of the scale factor

λ

.

λ

represents the equilibrium weight in counterfactual learning. As the value of

λ

increases, the Rank-1 and mAP increase by

0.7 %

and

1.4 %

for

λ = 1.0

, indicating that counterfactual learning is beneficial for learning better local detailed features. However, the performance decreases as

λ

continues to increase since the weights of ID loss and triplet loss decrease, thus affecting the feature distribution of ID and triplet loss in the feature space.

4.4. Effectiveness of Each Loss

In Table 5, we show the contribution of each loss to the final performance on the Market-1501 dataset.

The ID loss

£_{c e}

[34] treats the Re-ID training process as an image classification problem, where different images of the same individual are regarded as one category. In this network, we incorporated label smoothing [35] to prevent the overfitting of the training IDs. We observed a significant drop in performance when

£_{c e}

was removed, with a decrease of 13.5% in Rank-1 score and 25% in mAP. This suggests that

£_{c e}

can effectively predict the person’s information and attribute characteristics.

The triplet loss

£_{T r i}

treats the Re-ID training process as a problem of retrieving similar images of the same identity by learning the similarity between images. Without

£_{T r i}

, the performance dropped by 5.8% in mAP and 3.4% in Rank-1. This result indicates that

£_{T r i}

is more effective in discriminating similar images.

The counterfactual loss

£_{c l}

regards the Re-ID training process as a problem of reducing prediction bias through causal counterfactual inference. We construct counterfactual images to force the convolutional model to extract more representative detailed features.

£_{c l}

is able to mitigate the prediction bias of person instance images arising from counterfactual shuffling filling patches regions. Our experiments show that the combination of

£_{c e}

and

£_{c l}

is similar to that of

£_{c e}

and

£_{T r i}

. Without

£_{c l}

, the performance drops 4.8% in mAP and 3.4% in Rank-1.

The ID center loss

£_{c e n t e r}

[34] is integrated into the learning process of deep features of each class center, treating Re-ID training as a central clustering problem. Removing

£_{c e n t e r}

would lead to a decrease in performance by 1.2% mAP and 0.4% Rank-1.

£_{c e n t e r}

can achieve better clustering and loss balance.

4.5. Ranking Results

We present some retrieval examples with five retrieved images for each query in Figure 5. As in the visualization, our PSF-C-Net obtained better retrieval results than the baseline. The first results show strong robustness, and, despite the challenges of low resolution and similar clothing colors, the PSF-C-Net feature can still robustly represent the first-person identity. The results of the second and third query images are also impressive, in which our proposed PSF-C-Net establishes both global features and local fine-grained features. According to some detailed clues such as white and red short sleeves, morphological signs, etc., even without a backpack, we can accurately match the target person by relying on the comparison of fine-grained features.

4.6. Comparison with State-of-the-Art Methods

We compare our proposed network with the state-of-the-art methods on two datasets, Market-1501 and Duke-MTMC, and report the results in Table 6. We consider four types of person re-identification methods for comparison, including strip feature-based methods (PCB+RPP [12], MGN [13], Pyramid [36], Auto-ReID [37], and GCP [38]), attention-based approaches (IANet [39], CASN+PCB [36], CAMA [40], MHN-6 [41], SCAL [42], and CAL [10]), methods based on additional semantic features such as SPReID [43], spatiotemporal information like st-reID [44] and InSTD [45], semantic feature alignment using DSA-reID [46], ID attribute information including AANet [47], GPS [48], and TransReID [49], and human pose points using HONet [50]. We also include global feature-based approaches such as DMML [51], SFT [52], Circle [53], and baseline [27]. From the results in Table 6, we observe that the GCP [38] network based on stripe features and the SCAL [53] network using an attention mechanism achieve excellent results in Re-ID recognition. However, when the GCP [38] network only uses global features, its Rank-1 accuracy and mAP indicators are not ideal. Additionally, the modified attention mechanism approach that focuses on a more specific region of interest has proven effective, but it requires more training parameters and involves significant computational costs during training. Transformer-based decoupling characterization methods like DRL-Net [5] achieve similar recognition accuracy as our proposed method, but our method is lighter and more effective with a simpler and more effective loss function part. Some methods, such as InSTD [45], integrate a time series model, spatial sequence model, and visual feature extraction of three dimensions based on additional semantic features, but our method only extracts visual features. If we expand our method to incorporate spatiotemporal characteristics, we can achieve even more remarkable accuracy. Among all the methods that use only global features, our proposed network achieves the best results and outperforms other approaches on the Market-1501 and Duke-MTMC-reID datasets.

The proposed PSF-C-Net achieves the best performance on the Market-1501 dataset with a Rank-1 of 95.2% and an mAP of 87.3%. Our experimental results further indicate that PSF-C-Net is more effective in improving the mAP indicator, which represents the average ranking of the target image, indicating that it can effectively improve the matching ranking of the target task image. On the Duke-MTMC dataset, individuals are often carrying personal attributes such as backpacks and other items. Our proposed PSF-C-Net network, which only extracts global features without using attribute guidance, achieves Rank-1

87.1 %

and mAP

76.9 %

. Some methods based on extra attribute guidance have been shown to be less competitive when compared to the PSF-C-Net network we proposed. By contrast, our proposed counterfactual framework can fully exploit the potential and important personal context relationships and robust features.

4.7. Baseline Meets State of the Art

In our study, we have integrated our proposed method into several popular Re-ID algorithms to compare their performance before and after modification. Given the numerous excellent algorithms available for Re-ID, we have selected several typical models, namely the modified counterfactual attention method CAL [10] using an attention mechanism and the MGN [13] network based on stripe features. To ensure a fair comparison, we have trained the models using the same losses reported in their respective papers. For instance, in the CAL [10] model, we have eliminated the center loss in the original benchmark and used only ID loss, triplet loss, and the counterfactual loss we proposed. A comprehensive summary of the experiment’s details and results is provided in Table 7. ^† represents the newly implemented network based on our counterfactual framework.

We modified the CAL [10] based on PSF-C-Net from scratch. On top of PSF-C-Net, CAL ^† added the method of modifying fake attention using counterfactual adversarial, which is

0.2 %

and

0.3 %

higher in Rank-1 and mAP than the CAL [10] network. Our counterfactual framework in PSF-C-Net bridges the gap of local feature truncation in CAL [10] and extends the causal adversarial space. When we modified the stripe feature-based MGN [13] network, we separated the dual-path deep networks in the counterfactual framework, used the convolution layers of the counterfactual data for global feature extraction, and used factual data for stripe segmentation of features so as to perform controllable causal adversarial prediction at the output layer. Multiple stripe features are integrated to reduce the weight of global instance features, so MGN ^† is not particularly outstanding in improving network performance, such as Rank-1 and mAP indicators. The CAL ^† and MGN ^† networks further demonstrate the scalability of our proposed network framework.

Through further experiments, our proposed method achieved good results in improving the mAP and Rank-1 indicators, and our method does not require additional human semantic information, local features, and attention modules, and is relatively simple compared to the other state-of-the-art methods.

5. Conclusions

This paper proposes a counterfactual convolutional framework, PSF-C-Net, based on causal inference. This method achieves concise and explicit supplementary long-range dependency features by making controllable counterfactual predictions at the output layer. Using causal counterfactual inference, PSF-C-Net performs global inference to reduce the dependence on local detailed features and additional semantic information and reduce the prediction bias of person images. In addition, to efficiently construct an effective explicit counterfactual space, we propose a counterfactual data generation method with RCPSF to encourage the construction of counterfactual data that destroy context relationships and preserve local features without introducing additional noise counterfactuals. The PSF-C-Net framework we propose can be extended based on various methods of person re-recognition tasks and can achieve good results. The advantage of this method lies in constructing a controllable counterfactual space using causal inference and avoiding dependence on local features. During the counterfactual prediction process, explicit supplementary long-range dependency features can be added to improve the model’s robustness. Our proposed counterfactual data generation method can also effectively construct an explicit counterfactual space, thereby further improving the model’s performance. Our extensive experimental evaluations on several benchmarks demonstrate the effectiveness and superiority of the proposed method. Moreover, our method provides new ideas for considering the global information of instances and constructing counterfactual instances to achieve better feature representation, which can be applied to various computer vision tasks other than person re-identification.

Author Contributions

Conceptualization, R.S. and Q.C.; methodology, R.S., H.D. and H.Z.; software, R.S., H.D. and H.Z.; validation, Q.C.; formal analysis, R.S. and M.W.; investigation, R.S.; data curation, Q.C.; writing—original draft preparation, R.S.; writing—review and editing, M.W.; visualization, R.S. and H.D.; supervision, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No additional data are available.

Acknowledgments

Thank you for the constructive comments provided by the reviewers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, Z.; Jiang, J.; Yu, Y.; Satoh, S. Incremental Re-Identification by Cross-Direction and Cross-Ranking Adaption. IEEE Trans. Multimedia 2019, 21, 2376–2386. [Google Scholar] [CrossRef]
Zheng, L.; Yang, Y.; Hauptmann, A.G. Person re-identification: Past, present and future. arXiv 2016, arXiv:1610.02984. [Google Scholar]
Wang, C.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2018; pp. 365–381. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-scale Image Recognition. In Proceedings of the International Conference on Learning Representation, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Jia, M.; Cheng, X.; Lu, S.; Zhang, J. Learning Disentangled Representation Implicitly Via Transformer for Occluded Person Re-Identification. IEEE Trans. Multimedia 2022, 25, 1294–1305. [Google Scholar] [CrossRef]
Zhang, J.; Niu, L.; Zhang, L. Person Re-Identification with Reinforced Attribute Attention Selection. IEEE Trans. Image Process. 2021, 30, 603–616. [Google Scholar] [CrossRef] [PubMed]
Gong, X.; Yao, Z.; Li, X.; Fan, Y.; Luo, B.; Fan, J.; Lao, B. LAG-Net: Multi-Granularity Network for Person Re-Identification via Local Attention System. IEEE Trans. Multimedia 2022, 24, 217–229. [Google Scholar] [CrossRef]
Zheng, Z.; Yang, X.; Yu, Z.; Zheng, L.; Yang, Y.; Kautz, J. Joint Discriminative and Generative Learning for Person Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2133–2142. [Google Scholar] [CrossRef]
Yang, M.; Liu, F.; Chen, Z.; Shen, X.; Hao, J.; Wang, J. CausalVAE: Disentangled Representation Learning via Neural Structural Causal Models. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9588–9597. [Google Scholar] [CrossRef]
Rao, Y.; Chen, G.; Lu, J.; Zhou, J. Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1005–1014. [Google Scholar] [CrossRef]
Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; Yang, Y. Random erasing data augmentation. arXiv 2017, arXiv:1708.04896. [Google Scholar] [CrossRef]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 480–496. [Google Scholar]
Wang, G.; Yuan, Y.; Chen, X.; Li, J.; Zhou, X. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 274–282. [Google Scholar]
Fan, X.; Luo, H.; Zhang, X.; He, L.; Zhang, C.; Jiang, W. SCPNet: Spatial-Channel Parallelism Network for Joint Holistic and Partial Person Re-identification. In Computer Vision; Jawahar, C., Li, H., Mori, G., Schindler, K., Eds.; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar]
Miao, J.; Wu, Y.; Liu, P.; Ding, Y.; Yang, Y. Pose-guided feature alignment for occluded person re-identification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 542–551. [Google Scholar]
Fu, X.; Huang, F.; Zhou, Y.; Ma, H.; Xu, X.; Zhang, L. Cross-Modal Cross-Domain Dual Alignment Network for RGB-Infrared Person Re-Identification. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6874–6887. [Google Scholar] [CrossRef]
Farooq, A.; Awais, M.; Kittler, J.; Akbari, A.; Khalid, S.S. Cross Modal Person Re-identification with Visual-Textual Queries. In Proceedings of the 2020 IEEE International Joint Conference on Biometrics (IJCB), Houston, TX, USA, 28 September–1 October 2020; pp. 1–8. [Google Scholar] [CrossRef]
Yang, X.; Liu, L.; Wang, N.; Gao, X. A Two-Stream Dynamic Pyramid Representation Model for Video-Based Person Re-Identification. IEEE Trans. Image Process. 2021, 30, 6266–6276. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Wang, Y.; Shi, Y.; Yan, K.; Geng, M.; Tian, Y.; Xiang, T. Deep Transfer Learning for Person Re-Identification. In Proceedings of the 2018 IEEE 4th International Conference on Multimedia Big Data (BigMM), Xi’an, China, 13–16 September 2018; pp. 1–5. [Google Scholar] [CrossRef]
Pearl, J.; Mackenzie, D. The Book of Why: The New Science of Cause and Effect; Basic Books: New York, NY, USA, 2018. [Google Scholar]
Jin, X.; Lan, C.; Zeng, W.; Chen, Z.; Zhang, L. Style Normalization and Restitution for Generalizable Person Re-Identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3140–3149. [Google Scholar] [CrossRef]
Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; Tian, Q. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 1116–1124. [Google Scholar]
Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the Computer Vision Workshop on Benchmarking Multi-Target Tracking, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016. Proceedings, Part II. [Google Scholar]
Zhang, K.X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25. [Google Scholar] [CrossRef]
Fan, X.; Jiang, W.; Luo, H.; Fei, M. SphereReID: Deep hypersphere manifold embedding for person re-identification. J. Vis. Commun. Image Represent. 2019, 60, 51–58. [Google Scholar] [CrossRef]
Luo, H.; Gu, Y.; Liao, X.; Lai, S.; Jiang, W. Bag of Tricks and a Strong Baseline for Deep Person Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 15–20 June 2019; pp. 1487–1495. [Google Scholar] [CrossRef]
Zheng, Z.; Zheng, L.; Yang, Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Zheng, F.; Deng, C.; Sun, X.; Jiang, X.; Guo, X.; Yu, Z.; Huang, F.; Ji, R. Pyramidal person re-identification via multi-loss dynamic training. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8514–8522. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar] [CrossRef]
Liu, X.; Shen, F.; Zhao, J.; Nie, C. RandoMix: A mixed sample data augmentation method with multiple mixed modes. Multimedia Tools Appl. 2024, 1–17. [Google Scholar] [CrossRef]
Hendrycks, D.; Mu, N.; Cubuk, E.D.; Zoph, B.; Gilmer, J.; Lakshminarayanan, B. AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty. arXiv 2020, arXiv:1912.02781. [Google Scholar]
Seo, J.-W.; Jung, H.-G.; Lee, S.-W. Self-augmentation: Generalizing deep networks to unseen classes for few-shot learning. Neural Netw. 2021, 138, 140–149. [Google Scholar] [CrossRef] [PubMed]
Zheng, Z.; Zheng, L.; Yang, Y. A Discriminatively Learned CNN Embedding for Person Reidentification. ACM Trans. Multimedia Comput. Commun. Appl. 2018, 14, 1–20. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Zheng, M.; Karanam, S.; Wu, Z.; Radke, R.J. Re-identification with consistent attentive Siamese networks. In Proceedings of the IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5735–5744. [Google Scholar]
Quan, R.; Dong, X.; Wu, Y.; Zhu, L.; Yang, Y. Auto-ReID: Searching for a part-aware ConvNet for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3750–3759. [Google Scholar]
Park, H.; Ham, B. Relation Network for Person Re-Identification. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11839–11847. [Google Scholar] [CrossRef]
Hou, R.; Ma, B.; Chang, H.; Gu, X.; Shan, S.; Chen, X. Interaction-and-aggregation network for person re-identification. In Proceedings of the IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9317–9326. [Google Scholar]
Yang, W.; Huang, H.; Zhang, Z.; Chen, X.; Huang, K.; Zhang, S. Towards rich feature discovery with class activation maps augmentation for person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1389–1398. [Google Scholar]
Chen, B.; Deng, W.; Hu, J. Mixed high-order attention network for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 371–381. [Google Scholar]
Chen, G.; Lin, C.; Ren, L.; Lu, J.; Zhou, J. Self-critical attention learning for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9637–9646. [Google Scholar]
Kalayeh, M.M.; Basaran, E.; Gokmen, M.; Kamasak, M.E.; Shah, M. Human semantic parsing for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1062–1071. [Google Scholar]
Ren, M.; He, L.; Liao, X.; Liu, W.; Wang, Y.; Tan, T. Learning Instance-level Spatial-Temporal Patterns for Person Re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14910–14919. [Google Scholar] [CrossRef]
Wang, G.; Lai, J.; Huang, P.; Xie, X. Spatial-temporal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27–28 January 2019; Volume 33. No. 01. [Google Scholar]
Zhang, Z.; Lan, C.; Zeng, W.; Chen, Z. Densely semantically aligned person re-identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 667–676. [Google Scholar]
Tay, C.-P.; Roy, S.; Yap, K.-H. AANet: Attribute attention network for person re-identifications. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7134–7143. [Google Scholar]
Nguyen, B.X.; Nguyen, B.D.; Do, T.; Tjiputra, E.; Tran, Q.D.; Nguyen, A. Graph-based Person Signature for Person Re-Identifications. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Virtual, 19–25 June 2021; pp. 3487–3496. [Google Scholar] [CrossRef]
He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; Jiang, W. TransReID: Transformer-based Object Re-Identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 14993–15002. [Google Scholar] [CrossRef]
Wang, G.; Yang, S.; Liu, H.; Wang, Z.; Yang, Y.; Wang, S.; Yu, G.; Zhou, E.; Sun, J. High-order information matters: Learning relation and topology for occluded person re-identification. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6449–6458. [Google Scholar]
Chen, G.; Zhang, T.; Lu, J.; Zhou, J. Deep meta metric learning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9547–9556. [Google Scholar]
Luo, C.; Chen, Y.; Wang, N.; Zhang, Z.-X. Spectral feature transformation for person re-identification. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4976–4985. [Google Scholar]
Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; Wei, Y. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6398–6407. [Google Scholar]

Figure 1. Examples of random cropping patch and shuffling filling. This method can generate counterfactual instances that destroy global information while preserving the fine-grained local features of the image without introducing additional noise.

Figure 2. The proposed framework, called PSF-C-Net, is depicted to the left of the dotted line. Within this network framework, we extract semantic features from both real images and counterfactual images. We introduce causal counterfactual inference and enhance the local features of the ID embedded space through reverse optimization. This allows us to achieve controllable counterfactual prediction. On the right of the dotted line, random cropping patch (RCP) and shuffling filling (SF) are presented to generate the necessary counterfactual data.

Figure 3. Counterfactual framework causal diagram.

Figure 4. Rank-1 and mPA for the probability of occurring p and the scale factor

λ

.

Figure 4. Rank-1 and mPA for the probability of occurring p and the scale factor

λ

.

Figure 5. Top-5 ranking list for some query images on Market-1501 [33] datasets by PSF-C-Net and baseline.

Table 1. Ablation study over three Re-ID datasets.

	Market-1501		Duke-MTMC
Methods	Rank-1	mAP	Rank-1	mAP
B	94.5	85.9	86.4	76.4
$B + S$	94.9	86.4	86.6	76.6
$B + T_{w /} £_{c e}$	94.8	86.6	86.8	76.7
$B + S + T_{w /} £_{c e}$ (overall)	95.2	87.3	87.1	76.9

Note: B, S, and

T_{w /} £_{c e}

denote baseline, RCPSF, and counterfactual learning framework.

Table 2. Comparison of the parameters of PSF-C-Net and the baseline.

Methods	Market-1501 #nParam (K)	Duke-MTMC #nParam (K)
Baseline	25,668	25,892
Proposed method	27,688	27,616

Table 3. Person re-identification performance with RCPSF on Market-1501 and Duke-MTMC-reID.

			Market-1501		Duke-MTMC
Method	Model	RCPSF	Rank-1	mAP	Rank-1	mAP
IDE	ResNet18	NO	79.87	57.37	67.73	46.87
		YES	82.83	63.02	71.20	51.83
	ResNet34	NO	82.93	62.34	71.63	49.71
		YES	85.23	66.02	74.02	54.63
	ResNet50	NO	83.14	63.56	71.99	51.29
		YES	85.64	68.50	74.86	56.53
Baseline	ResNet50	NO	94.5	85.9	86.4	76.4
		YES	94.9	86.4	86.6	76.6
	ResNet101	NO	94.5	87.1	87.6	77.6
		YES	94.9	87.4	88.1	77.8
	IBN-Net50-a	NO	95.0	88.2	90.1	79.1
		YES	95.5	88.4	90.5	79.3
Pyramid	ResNet50	NO	95.7	88.2	89.0	79.0
		YES	96.1	88.6	89.4	79.3

Table 4. Comparison against other state-of-the-art data augmentation methods on the Market-1501 and Duke-MTMC datasets.

			Market-1501		Duke-MTMC
Method	Model	Data Augmentation	Rank-1	mAP	Rank-1	mAP
		Random Crop	94.3	86.3	86.2	76.3
		Cutmix [30]	94.5	86.5	86.5	76.4
		Random Erasing [11]	94.6	86.7	86.6	76.5
PSF-C-Net	ResNet50	RandomMix [31]	94.3	86.8	86.4	76.3
		AugMix [32]	94.9	86.6	86.3	76.2
		Self-Augmentation [33]	94.8	86.4	86.5	76.1
		RCPSF (Proposed method)	95.2	87.3	87.1	76.9

Table 5. Evaluation of the importance of each component of the overall loss functions over Market-1501.

$£_{ce}$	$£_{Tri}$	$£_{cl}$	$£_{center}$	Rank-1	mAP
✗	✓	✓	✓	81.7	62.3
✓	✗	✓	✓	91.8	81.5
✓	✓	✗	✓	91.8	82.5
✓	✓	✓	✗	94.8	86.1
✓	✓	✓	✓	95.2	87.3

Table 6. Comparison with the state-of-the-art methods.

			Market-1501			Duke-MTMC
Type	Methods	Backbone	Rank-1	Rank-5	mAP	Rank-1	Rank-5	mAP
	PCB+RPP [12]	ResNet50	93.8	97.5	81.6	83.3	-	69.2
	MGN [13]	ResNet50	95.7	-	86.9	88.7	-	78.4
Stripe-Based	Pyramid [29]	ResNet50	95.7	98.4	88.2	89.0	94.7	79.0
	Auto-ReID [37]	Searched	94.5	-	85.1	-	-	-
	GCP [38]	ResNet50	95.2	-	88.9	89.7	-	78.6
	IANet [39]	ResNet50	94.4	-	83.1	87.1	-	73.4
	CASN+PCB [36]	ResNet50	94.4	-	82.8	87.7	-	73.7
Attention-Based	CAMA [40]	ResNet50	94.7	98.1	84.5	85.8	-	72.9
	MHN-6 [41]	ResNet50	95.1	98.1	85.0	89.1	94.6	77.2
	SCAL [42]	ResNet50	95.8	98.5	88.9	89.0	95.1	79.6
	CAL [10]	ResNet50	95.5	98.5	89.5	90.0	96.1	80.5
	SPReID [43]	ResNet152	92.5	-	81.3	84.4	-	71.0
	AANet [47]	ResNet50	93.9	-	82.5	86.4	-	72.6
	DSA-reID [46]	ResNet50	95.7	-	87.6	86.2	-	74.3
	HONet [50]	ResNet50	94.2	-	84.9	86.9	-	75.6
Extra Semantics-Based	GPS [48]	ResNet50	95.2	98.4	87.8	88.2	95.2	78.7
	TransReID [49]	ViT-B/16	95.2	-	89.5	90.7	-	82.6
	st-reID [44]	ResNet50	98.1	99.3	87.6	94.4	97.4	83.9
	InSTD [45]	ResNet50	97.6	99.5	90.8	95.7	97.2	89.1
	DMML [51]	ResNet50	93.5	-	81.6	85.9	-	73.7
	SFT [52]	RseNet50	93.4	-	82.7	86.9	-	73.2
Global feature	Circle [53]	RseNet50	94.2	-	84.9	-	-	-
	Baseline [27]	RseNet50	94.5	-	85.9	86.4	-	76.4
	Baseline [27] (RK)	RseNet50	95.4	-	94.2	90.3	-	89.1
Global feature	PSF-C-Net (Our)	RseNet50	95.2	98.7	87.3	87.1	93.9	76.9
	PSF-C-Net (Our) (RK)	RseNet50	96.5	98.7	94.8	91.2	93.8	89.8

Note: The methods in the first group are based on stripes. The methods in the second group are based on attention. The methods in the third group are based on semantics. The methods in the fourth group are based on global features. The last line is our approach.

Table 7. Our proposed PSF-C-Net reproduces the performance of some most advanced methods.

		Market-1501			Duke-MTMC
Methods	Backbone	Rank-1	Rank-5	mAP	Rank-1	Rank-5	mAP
CAL [10]	ResNet50	95.5	98.5	89.5	90.0	96.1	80.5
CAL ^†	ResNet50	95.7	98.6	89.8	90.5	96.2	80.7
MGN [13]	ResNet50	95.7	-	86.9	87.7	-	78.4
MGN ^†	ResNet50	95.8	-	87.2	88.9	-	78.6

^† is the result of our reproduction.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, R.; Chen, Q.; Dong, H.; Zhang, H.; Wang, M. PSF-C-Net: A Counterfactual Deep Learning Model for Person Re-Identification Based on Random Cropping Patch and Shuffling Filling. Mathematics 2024, 12, 1957. https://doi.org/10.3390/math12131957

AMA Style

Sun R, Chen Q, Dong H, Zhang H, Wang M. PSF-C-Net: A Counterfactual Deep Learning Model for Person Re-Identification Based on Random Cropping Patch and Shuffling Filling. Mathematics. 2024; 12(13):1957. https://doi.org/10.3390/math12131957

Chicago/Turabian Style

Sun, Ruiwang, Qing Chen, Heng Dong, Haifeng Zhang, and Meng Wang. 2024. "PSF-C-Net: A Counterfactual Deep Learning Model for Person Re-Identification Based on Random Cropping Patch and Shuffling Filling" Mathematics 12, no. 13: 1957. https://doi.org/10.3390/math12131957

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PSF-C-Net: A Counterfactual Deep Learning Model for Person Re-Identification Based on Random Cropping Patch and Shuffling Filling

Abstract

1. Introduction

2. Related Work

2.1. Deep Model for Person Re-Identification

2.2. Causal Counterfactual in Vision

3. Methods

3.1. Person Instance Long-Range Features Extraction

3.2. RCPSF of Counterfactual Data Generation

3.3. Counterfactual Data Learning

4. Experiments

4.1. Datasets and Evaluation Metrics

4.1.1. Dataset

4.1.2. Evaluation Metric

4.1.3. Implementation Details

4.2. Ablation Experiments

4.2.1. Superiority of the Model

4.2.2. Training Parameters

4.2.3. RCPSF Improves Different Baseline Models

4.3. Choice of Parameter Setting

4.3.1. Counterfactual Data Generation Probability p

4.3.2. Counterfactual Loss Weights

4.4. Effectiveness of Each Loss

4.5. Ranking Results

4.6. Comparison with State-of-the-Art Methods

4.7. Baseline Meets State of the Art

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI