3.1. Pipeline and Overview
Our motivation is mainly to solve the Re-ID challenge of background variations. To alleviate the interference of background variations in the training phase and retrieval phase, we proposed a TL-TransNet and background adaptation re-ranking for person re-identification. As shown in
Figure 2, the pipeline of our framework consists of three parts (i.e., training with TL-TransNet, pedestrian background segmentation and removal, and background adaptation re-ranking).
First, a TL-TransNet based on a swin transformer and two types of loss supervision is developed to train input data, which captures pedestrian body information intensively and preserves a little valuable background information. Then, DeepLabV3+ is applied to remove background interference in terms of probe and gallery set. Finally, a background adaptation re-ranking method based on a mixed similarity metric is designed to combine original and background removal features. It can obtain a ranking list that is more robust to background interference.
3.2. Training with TL-TransNet
Architecture. The key to enhancing the training model of the pedestrian sample is to reduce background noise interference. Simultaneously, some valuable background information should be retained when facing similar pedestrians with different identities. The swin transformer [
38] as a Re-ID benchmark model is utilized in this paper. Hence, a TL-TransNet based on two-fold loss is designed to pay more attention to the pedestrian identity embeddings during the training phase in this section.
The swin transformer block constructs the multi-headed self-attention
based on shift windows. As shown in
Figure 3, there are two swin transformer blocks in a row. A LayerNorm
layer, a multi-head self-attention module, residual connection, and a multilayer perceptron
with two completely linked layers and GELU non-linearity make up each swin transformer block. In the two succeeding transformer blocks, the window-based multi-head self-attention
and shifted window-based multihead self-attention
modules are used. Based on such a window partitioning method, continuous swin transformer blocks may be defined as follows:
in which
represents the outputs of the
or
module, and
denotes the
module of the block
l.
Given a pedestrian input of size,
, the swin transformer first reshapes the input to a
feature. In addition, each window has
patches, for a total of
windows. Then, self-attention is calculated for each patch. For a patch feature,
the query, key, and value matrices,
Q,
K and
V, are computed as:
where
,
, and
are projection matrices that are shared across different windows. Self-attention is computed as follows:
where
and
represent the dimension of the query or key. Additionally, the values in
are taken from the bias matrix
. In this way, the complexity of computation receives huge optimization.
Loss Function. In this work, two-fold loss is designed to supervise the entire training process of TL-TransNet, which consists of circle loss [
39] and instance loss [
40].
Circle loss is introduced to strengthen the identity ability of pedestrians. This loss, as an improved triplet loss, designs a weight to control the gradient contribution of positive and negative samples to each other. Given L classes in a person Re-ID dataset, a triplet input is composed of three samples
,
and
.
is an anchor sample belong to class a, where class a in {1, 2, …,
L}.
is a positive sample that comes from the same person class as
, while
is a negative sample taken from a different person class in terms of
. The function of circle loss is computed as follows:
Assume that there are K samples within the same class as anchor and L is the class number of the whole dataset. That is to say that there are within-class similarity scores and between-class similarity scores associated with . We denote these similarity scores as and , respectively. and are non-negative weighting factors. is a scale factor for better within-class similarity separation.
During training, the gradient with respect to
is to be multiplied with
when backpropagated to
. When a similarity score deviates far from its optimum (i.e.,
for
and
for
), it should obtain a large weighting factor so as to receive an effective update with a large gradient. To this end, we define
and
in a self-paced manner:
where
is the “cut-off at zero” operation to ensure
and
are non-negative.
Instance loss is added to provide better weight initialization for TL-TransNet and encourage the Re-ID model to find fine-grained details with different pedestrians. As instance loss clearly considers the inter-class distance, it can reduce the feature correlation between two pedestrian images. The instance loss is formulated as follows:
where
is a feature vector extracted from TL-TransNet.
, in which
I denotes input image, and
function is the forward propagation process of TL-TransNet.
is the parameter of the final fully connected layer. It can be viewed as concatenated weights
, and p represents the total number of categories in the training process. Every weight
from
is a 2048-dim vector. Since the total number of identity classes in the two person Re-ID datasets ranges between 1024 and 2048, in order to unify the hyperparameters of the network, the weight is a 2048-dim vector.
denotes the loss and
denotes the probability over all classes.
is the predicted possibility of the right class
.
The final loss function with TL-TransNet is the combination of circle loss and instance loss, which can be defined as follows:
where
is a predefined weight for instance loss. The generalization ability of two-fold loss is greatly improved by integrating the advantages of the two losses mentioned above, which is outstanding in the model training of Re-ID.
3.3. Pedestrian Background Segmentation and Removal
Each pedestrian across the multi-camera has multiple backgrounds and then generates background interference. The aforementioned phenomena enable the Re-ID model to add too much background information during the feature extraction stage, causing retrieval accuracy to suffer. That is to say, each pedestrian image has a lot of background information that will reduce the weight of each pedestrian instance feature during the retrieval stage. Some existing approaches usually incorporate an attention mechanism to make the network focus more on extracting the salient aspects of pedestrians in response to the above challenge. However, the robustness of the attention mechanism in some complicated and variable scenarios is not good. Thus, our purpose is to embed robust pedestrian body representation after filtering out the background through image segmentation technology in the feature extraction stage.
The pedestrian segmentation results are not sufficient when using the original semantic segmentation method. It would have a major impact on pedestrian Re-ID accuracy if background segmentation results were applied to train the model directly. This is due to the presence of distractors such as roadblocks, buildings, and distractors, as well as incomplete pedestrian features. As shown in
Figure 4, this paper employs the DeepLabV3+ network as a segmentation tool to segment the findings in the initial scene segmentation. Therefore, the background in the whole image can be better removed, and the foreground (pedestrian) can be precisely retrieved while cutting the background.
The background segmentation process based on DeeplabV3+ in
Figure 4 is divided into two processes, i.e., encoder and decoder. In encoder part, the pedestrian input ares passed into atrous convolution to extract the multi-scale features and then the number of channels is changed using 1 × 1 convolution to produce high-level features. In the decoder part, the high-level features are first upsampled and then concatenated with the corresponding low-level features from the network backbone. After the concatenation, the fused features use a 3 × 3 convolution to upsample and then obtain the background segmentation mask.
The Deeplabv3+ model used in the paper is evaluated on the PASCAL VOC 2012 semantic segmentation benchmark which contains 20 foreground object classes and one background class. The original dataset contains 1464(train), 1449(val), and 1456(test) pixel-level annotated images.
Before the test phase of person Re-ID, this paper segments all the pedestrians enhanced by the interference background in probe and gallery set in order to limit the impact of the background on pedestrians as much as feasible. The redundant information is erased according to the mask, and the erased backdrop is set to zero by setting all pixel values to zero, resulting in all black, leaving only the color of the essential characteristics such as pedestrians. The background removal is shown in the Formula (12):
The original image in the probe and gallery set is represented by , and is the mask created by segmenting the original image using the DeepLabV3+ model. Except for the foreground pixel in pedestrian, all other pixels in the mask are black. I denotes a background removal pedestrian image, which subtracts the grayscale of the original image’s mask pixel by pixel.
Figure 5 shows the whole process of background removal. Pedestrians in
Figure 5 are in a complex and changing public space, which prompts the Re-ID model to extract redundant environmental information during the testing phase. For example, the pedestrian’s legs are partially occluded by the box in
Figure 5. Therefore, the Re-ID model incorrectly extracts the occluded body parts as the occluded information.
3.4. Background Adaptation Re-Ranking
In order to improve the final rank-list result, a background adaptation re-ranking method is designed to dig out more positive samples with lower rankings due to background variations. That is to say, the motivation for background adaptation re-ranking is to fuse the original results and background removal results to improve positive samples affected by background variations. The proposed re-ranking method mainly consists of two steps.
In the first step, we are given a probe pedestrian image p and a gallery pedestrian image set
with
N examples. The initial ranking
for each example is determined by sorting the pairwise distance between a query and the gallery sets in an ascending order. The pairwise distance is calculated by the Euclidean metric of the feature output by TL-TransNet. Assuming
as k-nearest neighbors of the probe, p is defined as follows:
In [
41], the
k-reciprocal nearest neighbors set
, which can be defined as:
Due to background variations, some positive samples tend to rank lower in the initial rank-list. In the second step, the background adaptation set
of probe
p is designed based on
. The similarity distance of
is calculated by fusing the original features and background removal ones. The fused feature
adopts the method of dimension splicing, which can be computed as follows:
where
and
denote the original and background removal feature vectors, respectively, that are extracted by TL-TransNet. The background adaptation set
sequentially can be defined as:
where
represents the
k-nearest neighbor set obtained by the fused features.
As the Jaccard metric evaluates dissimilarity between samples, if the background of the two images is different and belongs to the same pedestrian, their k-reciprocal closest neighbor sets overlap. To add more positive candidates, the proposed re-ranking re-calculates the Jaccard distance from the background adaptation set .
The final distance
of our re-ranking is composed of the original distance and the Jaccard distance metric, which is expressed as follows:
where
;
and
express the Jaccard distance and the Euclidean distance between
and
, respectively.