In this section, we will present the designed techniques to improve the transformer baseline. As demonstrated in
Figure 2, the proposed PIT consists of the baseline, a patch feature extract network PPN, and the semantic information encoder IDE. The first part is the baseline, we use the ViT model as the baseline for person re-identification task. The details will be introduced in
Section 3.1. The second part is the PPN, which obtains fine-grained features that are invariant to global disturbances by solving the local feature information loss caused by dicing operations. The details of which are introduced in
Section 3.2. The third part is the IDE, which aims to avoid interfering with the generation of information encoding. The same person should have the same identity information encoding. The details will be introduced in
Section 3.3.
3.1. Baseline
As shown in
Figure 3, input an image
, while
C,
H, and
W represent the channel size, height and width, respectively. We divide it into
N fixed-size patches
through the slicing operation. These patches are flattened into one-dimensional tensors and then projected onto a lower-dimensional space D using a linear transformation. The resulting feature vectors for each patch are used as inputs for the subsequent layers. In addition, a learnable embedding token denoted as
is pre-loaded into the input sequence, with its output serving as the global feature denoted by
f. Learnable position embeddings are also incorporated to capture spatial information. The input sequence to the transformer layers can be represented as:
where
is the input sequence embeddings of the baseline, and
is the location information embedding.
F is the linear mapping of the diced patches
to D dimensions. Additionally,
l transformer layers is used to learn feature representations.
Patch Slicing. Patch splitting refers to the process of dividing an input image into several small, equally-sized blocks called “patches”. We use a sliding window approach to divide an image of size
×
into
N fixed-size patches. The stride is denoted as
, and the patch size is denoted as
.
where
is the floor function and
is set same to
.
and
represent the number of splitting patches in height and width, respectively.
Position Embeddings. The image resolution of person re-ID is different from the original resolution of VIT images. The pre-trained position embeddings in ImageNet cannot be directly loaded into the baseline. Therefore, we introduce bilinear 2D interpolation to handle the input image resolution.
Supervised Learning. We use ID loss to optimize the baseline. ID loss is cross-entropy loss without label smoothing and can be expressed as:
where
represents the total number of people,
y represents the truth ID label,
represents the target probability, and
represents the ID prediction logits of class
. The smaller the cross-entropy loss, the closer the predicted result is to the true result.
3.2. Patch Pyramid Network
Although encoding the transformer provides a larger receptive field, the dicing operation reduces the correlation of the patch in the spatial dimension, leading to an inability to effectively extract fine-grained features.
As shown in
Figure 4, the proposed PPN is a multi-scale patch fine-grained feature extraction network. Given an image
,
C,
H, and
W represent the channel size, height and width, respectively. Then, we split it into
N fixed-sized patches
by PPN. We explain the details of PPN below.
Based on convolution operations with four different kernel sizes (2 × 2, 4 × 4, 8 × 8, 16 × 16),
x is divided into various patch sizes
, which only alter the size of the patches. The size of
, and
among the patches are expressed as
Then,
and
are arranged in a pyramid structure, where the top of the pyramid is the largest patch scale
and the bottom is the smallest
. Starting from the top of the pyramid, the downsampling operator
is applied to scale the patches. After downsampling, adjacent patches are summed together. The above two operations are repeated until the patches reach the smallest scale, which is the target patch. Accordingly, the object patches
can be obtained. Moreover, the proposed feature extraction module
is used before each downsampling operation to obtain
. It is worth noting that the minimum size of a patch does not perform feature extraction, having an impact on model accuracy due to the shallow feature extraction operation. Finally, the output
can be deduced, given by
where
represents the downsampling operation with a convolution kernel of 3 × 3.
As shown in
Figure 5, the extraction module consists of four parts: Input
, AC module, AS module and output
. The AC module is an attention mechanism module in the channel dimension, and the AS module is an attention mechanism module in the spatial dimension. The two modules work in parallel in the extraction module by being connected together. They, respectively, extract channel features and spatial features in the patch block, and then add them together to obtain the final fine-grained feature.
AC module. In the channel dimension, the AC module is an adaptive feature extraction module. Two
2D maps
and
can be generated by the parallel maximum-pooling
and average-pooling
, which resize the input
P from
to
. Then, in order to avoid overcomplicating the model, the obtained
2D feature maps are processed by the squeeze and permute operation
to obtain the
and
. After this, the
1D convolution module
with a kernel size of 3 × 3 is used to process
and
separately. The local cross-channel interaction can be realized, leading to
and
. Then, both are added together to obtain
. The
is acquired by applying a sigmoid activation function
to
. Finally, the weight of the channel dimension
is obtained after a dimensional upgrade operation
. The operation of the AC module can be deduced by
The adaptive channel feature
can be acquired by multiplying
by the input patch
P, giving
AS module. The AS module is an adaptive feature extraction module in the spatial dimension. Two
2D maps
and
can be generated through the adaptive maximum-pooling
and adaptive average-pooling
, producing two 1*H*W feature maps from the input
P. After this, the
is obtained by concatenating
and
through a concatenation operation ⊕. Then,
is obtained by a convolution
with a kernel size of 3 × 3 that turns the number of channels in
to 1. The adaptive weight
of the spatial dimension is acquired by applying a sigmoid activation function
to the
. Accordingly, the operation of the AS module can be computed as
Similarly, the adaptive spatial feature
can be defined as
Based on the AC and AS operation, the output
of the extraction module can be computed as
where
and
are the weights that change with the gradient whose sum is 1.
3.3. Identity Information Embedding Module
Despite fine-grained feature being obtained, the impact of clothing changes cannot be ignored. In other words, the model may not be able to discriminate different objects from the same angle due to the bias of clothing information. Therefore, the proposed IDE incorporates the identity information of people into the embedding representation to obtain robust features.
Similar to positional embeddings employing learnable embeddings to encode positional information, we insert learnable one-dimensional embeddings to preserve identity information. In particular, as shown in
Figure 1, we insert the identity information embedding into the transformer encoder with the patch and position embeddings. Specifically, assuming that there is a total of
M person IDs, we initialize the learnable identity information embedding as
. If the ID of a person is
k, the identity information embedding can be expressed as
. Unlike positional embeddings which vary between patches, identity information embeddings
are the same for all patches of an image.
As the identity information embedding, patch embedding, and position embedding are all linearly mapped to a D-dimensional space, their input sequences can be directly added for information integration. The IDE can be written as
where
is the input sequence after adding the identity information embedding,
is the original input sequence in the baseline, and
is the parameter of the identity information embedding. The transformer encoding layer can encode embeddings of different distribution types and add these features directly.