Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Swin Transformer Based on Two-Fold Loss and Background Adaptation Re-Ranking for Person Re-Identification

Electronics 2022, 11(13), 1941; https://doi.org/10.3390/electronics11131941

by Qi Wang^1,†, Hao Huang^1,†, Yuling Zhong^1,†, Weidong Min^2,3,4,*

, Qing Han², Desheng Xu¹ and Changwen Xu⁵

Reviewer 1:

Zhenyu Zhou

Reviewer 2: Anonymous

Electronics 2022, 11(13), 1941; https://doi.org/10.3390/electronics11131941

Submission received: 17 May 2022 / Revised: 7 June 2022 / Accepted: 16 June 2022 / Published: 21 June 2022

(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

This paper proposed a novel approach to identify pedestrians from photos, with the consideration of different backgrounds. The authors leveraged 2-fold loss, foreground extraction and reranking with background noises to optimize the model. Experiment results show that TL-TransNet outperforms related approaches even with background interference.

Generally speaking this is an interesting paper, achieving promising results for an important problem. The most outstanding shortcoming is the presentation. The current version has many (1) typos and grammar errors; (2) undefined notations; (3) undefined concepts; and (4) disconnected thought flow. These make the current draft hard to follow and reproduce for future readers.

Besides that, I would also highly encourage the authors to provide more intuitions besides jumping into implementation details. Otherwise, the current draft lacks the big picture of the rationale of your design choices, such as why you choose to use the current definition of the loss.

Please find my detailed per-section comments below:

intro

- "the background variations of person Re- ID mainly exist in two aspects": I'm not sure why these are 2 aspects. Isn't the ranking problem affected by inaccurate background extraction during the training stage? It's just like you cannot argue that low training accuracy and low testing accuracy are 2 independent issues. (EDIT: Later in the related work section, I find you may need a specific algorithm for "reranking". If so, please make this clear in intro, before you first introduce the ranking stage. That's another example of the problem I pointed out in the next point.)

- This section is generally not well organized. You listed too many related works without explaining them in detail, without introducing the big picture. Most related work should be in the next section, and for intro, you should really focus on why the problem is important and why it is hard. As for the related works, you should try to classify them instead of simply listing them out, as none of them makes sense to readers if you simply mention their names. For example, I'm seeking the answer for this question: what makes the proposed approach outstanding for filtering out background noise? I don't find any clues for this question besides simply stating "filter out background-related information and focus on pedestrian body information". I'm not convinced that there is no previous work considering the background at all. And if so, why does the proposed approach work better for filtering out the background? As a result, I'm not convinced by your intro that the contribution is novel and significant.

Related Work

- "which outperforms previous algorithms": what are "previous algorithms"?

- "ResNet-50 model": are you sure resnet50 is only designed for this problem?

- In this section, could you please also specify where the proposed approach stand? ie. the similarity and difference compared to the related work.

Proposed Method

- "First, a TL-TransNet based on swin transformer and two types of loss supervision is developed to train input data, which captures pedestrian body information intensively": if the TL-TransNet is already able to capture the pedestrian, why do you still need DeepLabV3+ to remove bg noise? Why do you have both steps and looks like they are executed in parallel?

- How do you define "class"? Per "gradient contribution of positive and negative samples to each other", sounds like you only have 2 classes?

- "let us assume that there are ? within- class similarity scores and ? between-class similarity scores associated with ?": Does K equal to the number of elements within the same class as x? Does L equal to the number of classes?

- In formula (7), what is \gamma?

- "where ???? is an image feature. ???? = ????(?)... of CNN": what is "image feature"? What is "I"? (Is it the "I" you defined in (12) afterwards?) Where do you use CNN?

- "??h??? = [?1, ?2, ... , ??]": what is "n"? especially when you already defined s_n^j in (8).

- "Every weight ?? is a 2048-dim vector": why 2048?

- "? denotes the loss and ? denotes": you don't have L and P in (9) and (10).

- "Each pedestrian across the multi-camera has a different background and then generates background interference": (1) "a different background"; (2) why is bg different for each pedestrian? Do you mean each pedestrian has multiple bg?

- what is "retrieval stage"? What stages do you have? You should define the stages first.

- "in response to the above challenges": what are the "challenges"? I only find one challenge as the bg noise.

- Is "DeepLabv3+" your contribution? Or do you just leverage it in your approach? Also, the description for DeepLabv3+ is fairly hard to follow as it's unclear which part in fig4 you're talking about.

- "In order to alleviate the influence of background-related features": don't you already extract the foreground? Does that mean your foreground extraction doesn't work well?

- "we are given a probe p": what is "a probe"?

- "by sorting this distance": what is "this distance"?

nit:

- "Existing researchers proposed various...": previous researchers or existing researches are proposed

- "And introduced a unique region-level triplet loss...": is this sentence complete?

- "B. De Baets [21] propose the kernel method": a kernel method

- "two-fold loss is designed to supervise training": grammar

- "it's simple": "it is". abbr is highly discouraged for academic writing.

Author Response

Response to Reviewer 1 Comments

Point 1: "the background variations of person Re-ID mainly exist in two aspects": I'm not sure why these are 2 aspects. Isn't the ranking problem affected by inaccurate background extraction during the training stage? It's just like you cannot argue that low training accuracy and low testing accuracy are 2 independent issues. (EDIT: Later in the related work section, I find you may need a specific algorithm for "reranking". If so, please make this clear in intro, before you first introduce the ranking stage. That's another example of the problem I pointed out in the next point.)

Response 1: Thank you very much for your suggestions. We have modified Figure 1 and related description in Section 1. We also have added the definition description of the re-ranking algorithm (yellow highlighted region).

Point 2: This section is generally not well organized. You listed too many related works without explaining them in detail, without introducing the big picture. Most related work should be in the next section, and for intro, you should really focus on why the problem is important and why it is hard. As for the related works, you should try to classify them instead of simply listing them out, as none of them makes sense to readers if you simply mention their names. For example, I'm seeking the answer for this question: what makes the proposed approach outstanding for filtering out background noise? I don't find any clues for this question besides simply stating "filter out background-related information and focus on pedestrian body information". I'm not convinced that there is no previous work considering the background at all. And if so, why does the proposed approach work better for filtering out the background? As a result, I'm not convinced by your intro that the contribution is novel and significant.

Response 2: Thank you very much for your suggestions. We have rewritten the description according to your comments in Section 1 (yellow highlighted region).

Point 3: "which outperforms previous algorithms": what are "previous algorithms"?

Response 3: Thank you very much for your suggestions. “previous algorithms” refers to hand-crafted feature extraction algorithms. We have corrected this error and cited references [14] and [15].

Point 4: "ResNet-50 model": are you sure resnet50 is only designed for this problem?

Response 4: Thank you very much for your suggestions. "ResNet-50 model" here does not match reference [7]’s network model. We have corrected the model mismatch error.

Point 5: In this section, could you please also specify where the proposed approach stand? ie. the similarity and difference compared to the related work.

Response 5: Thank you very much for your suggestions. We have add some descriptions in Section 2 (yellow highlighted region).

Point 6: "First, a TL-TransNet based on swin transformer and two types of loss supervision is developed to train input data, which captures pedestrian body information intensively": if the TL-TransNet is already able to capture the pedestrian, why do you still need DeepLabV3+ to remove bg noise? Why do you have both steps and looks like they are executed in parallel?

Response 6: Thank you very much for your suggestions. TL-TransNet is used to improve feature extraction in the training phase, and DeepLabV3 is used to remove background in retrieval phase. They are not parallel. We have added further descriptions in Section 1 and Section 3.1 (yellow highlighted region).

Point 7: How do you define "class"? Per "gradient contribution of positive and negative samples to each other", sounds like you only have 2 classes?

Response 7: Thank you very much for your suggestions. We have added the descriptions to define the class in Section 3.2 (yellow highlighted region).

Point 8: "let us assume that there are �� within- class similarity scores and �� between-class similarity scores associated with ��": Does K equal to the number of elements within the same class as x? Does L equal to the number of classes?

Response 8: Thank you very much for your suggestions. We have further described the definitions of parameter K and L in Section 3.2 (yellow highlighted region).

Point 9: In formula (7), what is \gamma?

Response 9: Thank you very much for your suggestions. We have defined and described the parameter gamma in formula 7 (yellow highlighted region).

Point 10: "where �� is an image feature. �� = ��(��)...of CNN": what is "image feature"? What is "I"? (Is it the "I" you defined in (12) afterwards?) Where do you use CNN?

Response 10: Thank you very much for your suggestions. We have modified the descriptions of , I, and CNN in Section 3.2 (yellow highlighted region).

Point 11: "��h�� = [��1, ��2, ... , ��]": what is "n"? especially when you already defined s_n^j in (8).

Response 11: Thank you very much for your suggestions. We have used the new parameter “p” to correct the error of defining “n” many times, and added the description of “p” in Section 3.2 (yellow highlighted region).

Point 12: "Every weight �� is a 2048-dim vector": why 2048?

Response 12: Thank you very much for your suggestions. Since the total number of identity classes in the two person Re-ID datasets ranges between 1024 and 2048, in order to unify the hyperparameters of the network, the weights is a 2048-dim vector. We also have updated the above descriptions to in Section 3.2 (yellow highlighted region).

Point 13: "�� denotes the loss and �� denotes": you don't have L and P in (9) and (10).

Response 13: Thank you very much for your suggestions. We have changed the wrong parameter names to and in Section 3.2 (yellow highlighted region).

Point 14: "Each pedestrian across the multi-camera has a different background and then generates background interference": (1) "a different background"; (2) why is bg different for each pedestrian? Do you mean each pedestrian has multiple bg?

Response 14: Thank you very much for your suggestions. We have corrected this grammatical error in Section 3.3 (yellow highlighted region).

Point 15: what is "retrieval stage"? What stages do you have? You should define the stages first.

Response 15: Thank you very much for your suggestions. We have defined the relevant stage in Section 1 (yellow highlighted region).

Point 16: "in response to the above challenges": what are the "challenges"? I only find one challenge as the bg noise.

Response 16: Thank you very much for your suggestions. We have corrected this grammatical error in Section 3.3 (yellow highlighted region).

Point 17: Is "DeepLabv3+" your contribution? Or do you just leverage it in your approach? Also, the description for DeepLabv3+ is fairly hard to follow as it's unclear which part in fig4 you're talking about.

Response 17: Thank you very much for your suggestions. DeepLabv3+ is leveraged in background removal step. We also have added the description corresponding to figure 4 in Section 3.3 (yellow highlighted region).

Point 18: "In order to alleviate the influence of background-related features": don't you already extract the foreground? Does that mean your foreground extraction doesn't work well?

Response 18: Thank you very much for your suggestions. We have revised the contribution description of the proposed method in Section 1 (yellow highlighted region). We also have added some descriptions in Section 3.4 (yellow highlighted region).

Point 19: "we are given a probe p": what is "a probe"?

Response 19: Thank you very much for your suggestions. We have corrected this grammatical error in Section 3.4 (yellow highlighted region).

Point 20: "by sorting this distance": what is "this distance"?

Response 20: Thank you very much for your suggestions. “this distance” means pairwise distance between a query and the gallery sets. We have corrected the misrepresentation of this sentence in Section 3.4 (yellow highlighted region).

Point 21: "Existing researchers proposed various...": previous researchers or existing researches are proposed.

Response 21: Thank you very much for your suggestions. We have corrected this grammatical error in Section 1 (yellow highlighted region).

Point 22: "And introduced a unique region-level triplet loss...": is this sentence complete?

Response 22: Thank you very much for your suggestions. We have corrected this grammatical error in Section 1 (yellow highlighted region).

Point 23: "B. De Baets [21] propose the kernel method": a kernel method.

Response 23: Thank you very much for your suggestions. We have corrected this grammatical error in Section 2.2 (yellow highlighted region).

Point 24: "two-fold loss is designed to supervise training": grammar.

Response 24: Thank you very much for your suggestions. We have corrected this grammatical error in Section 3.2 (yellow highlighted region).

Point 25: "it's simple": "it is". abbr is highly discouraged for academic writing.

Response 25: Thank you very much for your suggestions. We have corrected this grammatical error in Section 3.3 (yellow highlighted region).

Author Response File: Author Response.docx

Reviewer 2 Report

This paper develops a Transformer-based model for person re-id. The model is trained jointly with a circle loss and an instance loss. A background adaptation re-ranking method is developed to alleviate the impact of background features in the test phase. The results look good. However, several issues should be addressed:

Circle loss is improved from triplet loss, however, I am wondering whether contrastive loss, which is a more popular metric learning technique, can also be used to further improve the performance. Please refer to the papers Exploring cross-image pixel contrast for semantic segmentation,Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation and discuss them.

How is the semantic segmentation model trained? which dataset is used for training? It is also not clear why deeplabv3+ is selected as the semantic segmentation model. How will different semantic segmentation models affect the performance, and a brief discussion of semantic segmentation models should be provided in Sec. 2. Please refer to Rethinking Semantic Segmentation: A Prototype View.

How will the two different losses affect the performance? Some ablative experiments should be performed.

Author Response

Response to Reviewer 2 Comments

Point 1: Circle loss is improved from triplet loss, however, I am wondering whether contrastive loss, which is a more popular metric learning technique, can also be used to further improve the performance. Please refer to the papers Exploring cross-image pixel contrast for semantic segmentation,Regional Semantic Contrast and Aggregation for Weakly Supervised Semantic Segmentation and discuss them.

Response 1: Thank you very much for your suggestions. We have added Table 3 and some ablation analysis in Section 4.3 (yellow highlighted region). We also have cited the references [35] and [36].

Point 2: How is the semantic segmentation model trained? which dataset is used for training? It is also not clear why deeplabv3+ is selected as the semantic segmentation model. How will different semantic segmentation models affect the performance, and a brief discussion of semantic segmentation models should be provided in Sec. 2. Please refer to Rethinking Semantic Segmentation: A Prototype View.

Response 2: Thank you very much for your suggestions. We have added some descriptions in Section 3.3 (yellow highlighted region). We also have cited reference [37] and added a brief discussion of semantic segmentation models in Section 2.4.

Point 3: How will the two different losses affect the performance? Some ablative experiments should be performed.

Response 3: Thank you very much for your suggestions. We have added some analysis in Section 4.3 (yellow highlighted region) and ablative experiments in Table 4 and Figure 6.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thanks for the response. Many minor concerns are resolved and the draft is more readable now.

Please find my remaining comments/concerns below. Essentially, I'm under the impression that at "some stages" you intend to remove the interference of background but at "some stages" you intend to leverage the information from background. It's not clear how you decide whether background should be filtered out and why the decision is better than related work. For example, there is related work completely filtering out the background, then why is it worse than your approach that only "focuses" more on the pedestrian?

"Learning without any background information can distinguish pedestrians with the same identity in different backgrounds, but performs poorly in the case of similar pedestrians with different identities": I'm not fully convinced by this argument. How can you distinguish these 4 cases: (1) person 1 with background 1; (2) person 1 with background 2; (3) person 2 with background 1; (4) person 2 with background 2. In other words, why could we associate the background with identity?

"so that the more positive examples appear in the final rank-list": I cannot parse this sentence. What will be caused by "more positive examples"? I'd expect sth like "the more positive examples, the better xxx".

"recalculate the original features and background removal features, respectively": I don't understand what you meant by "respectively" here. So what recalculates the original features and what recalculates the background removal features?

"TL-TransNet is proposed to capture pedestrian body features more intensively": may I ask how could you capture pedestrian body features "more intensively" than the approaches removing background completely?

"That is to say, there are ? within-class similarity scores and ? between-class similarity score": may I ask why not K-1 and L-1 here?

nit:

"The deeplabv3 + model used in the paper is evaluated on the PASCAL": DeeplabV3+

"Existing researches are proposed various network models to optimize feature extraction of pedestrian images": grammar

Author Response

Response to Reviewer 1 Comments Point 1: Please find my remaining comments/concerns below. Essentially, I'm under the impression that at "some stages" you intend to remove the interference of background but at "some stages" you intend to leverage the information from background. It's not clear how you decide whether background should be filtered out and why the decision is better than related work. For example, there is related work completely filtering out the background, then why is it worse than your approach that only "focuses" more on the pedestrian? Response 1: Thank you very much for your suggestions. We have added some analysis in Section 1 (yellow highlighted region). Point 2: "Learning without any background information can distinguish pedestrians with the same identity in different backgrounds, but performs poorly in the case of similar pedestrians with different identities": I'm not fully convinced by this argument. How can you distinguish these 4 cases: (1) person 1 with background 1; (2) person 1 with background 2; (3) person 2 with background 1; (4) person 2 with background 2. In other words, why could we associate the background with identity? Response 2: Thank you very much for your suggestions. We have added some descriptions about the four cases you mentioned in Section 1. (i.e., “For pedestrians with the same identity in different backgrounds” corresponds to cases (1) ,(2) or (3), (4); “For pedestrians with different identities in the same background” corresponds to cases (1),(3) or (2), (4). ) We also have added some analysis in term of the association between background and identity in Section 1 (yellow highlighted region). Point 3: "so that the more positive examples appear in the final rank-list": I cannot parse this sentence. What will be caused by "more positive examples"? I'd expect sth like "the more positive examples, the better xxx". Response 3: Thank you very much for your suggestions. We have modified the description of this sentence in Section 1(yellow highlighted region). Point 4: "recalculate the original features and background removal features, respectively": I don't understand what you meant by "respectively" here. So what recalculates the original features and what recalculates the background removal features? Response 4: Thank you very much for your suggestions. We have modified the description of this sentence in Section 1(yellow highlighted region). Point 5: "TL-TransNet is proposed to capture pedestrian body features more intensively": may I ask how could you capture pedestrian body features "more intensively" than the approaches removing background completely? Response 5: Thank you very much for your suggestions. We have corrected the misnomer and added some analysis in term of the proposed method and the complete background-removal methods in Section 1 (yellow highlighted region). Point 6: "That is to say, there are ? within-class similarity scores and ? between-class similarity score": may I ask why not K-1 and L-1 here? Response 6: Thank you very much for your suggestions. We have corrected this error in Section 3.2 (yellow highlighted region). Point 7: "The deeplabv3 + model used in the paper is evaluated on the PASCAL": DeeplabV3+ "Existing researches are proposed various network models to optimize feature extraction of pedestrian images": grammar. Response 7: Thank you very much for your suggestions. We have corrected the grammar error in Sections 1 and 3.3 (yellow highlighted region).

Author Response File: Author Response.docx

Reviewer 2 Report

The revision has addressed all my concerns.

Author Response

Response to Reviewer 2 Comments

Point 1: The revision has addressed all my concerns.

Response 1: Thank you very much for your suggestions.

Author Response File: Author Response.docx

Round 3

Reviewer 1 Report

Thanks for the comments! Most of my concerns are resolved and the draft is in better shape now.

A minor typo:

"several re-ranking methods only considers .. and does not": grammar

I'd also suggest the authors proofreading the entire draft since the presentation is still kind of poor. Please do fix typos before submitting and publishing.

Article Menu

Swin Transformer Based on Two-Fold Loss and Background Adaptation Re-Ranking for Person Re-Identification

Further Information

Guidelines

MDPI Initiatives

Follow MDPI