3.1. Design of Two-Branch Network
Since ResNet was proposed, it has been widely used in various fields of deep learning due to its simple and practical characteristics. Many algorithms in the VI-ReID domain are also completed using ResNet.
Two-branch networks were introduced as early as when Wu et al. [
23] proposed the cross-modality person re-identification problem and are a common approach to accomplishing feature extraction in cross-domain tasks. The two-branch network structure first extracts single-modality features from the input RGB and IR images using two separate networks and then projects the extracted RGB and IR features into the shared feature space of VI-ReID using a parameter sharing network.
The data for the two modalities contains both modality-related feature information and identity-matching-related feature information. It is necessary to distinguish modality information from identity information and map more identity information to the shared feature space to improve the identity recognition capability under unified feature representation. Therefore, a two-branch network is designed for feature extraction of different modalities, in which the parameters of the shallow layer are independent to extract modality-specific information, which solves the problem of differences caused by modality between different data. Meanwhile, the two-branch network utilizes a partially shared structure to learn multimodal shareable features by extracting modality-specific information and modality-shared information simultaneously.
The two-branch cross-modality person re-identification network proposed in this paper uses ResNet50 as the backbone network, and its network structure is shown in
Figure 1.
According to the description in the literature [
37], the backbone network ResNet50 can be divided into five parts, as shown in
Table 1, which gives the naming of the ResNet50 network structure and the description of the corresponding layer parameters.
In the process of information extraction by the neural network, the shallow convolutional layer mainly captures the low-level visual information of the image, and the deeper the network is, the more high-level semantic information can be extracted, which contains more identity discriminative information. The data from both modalities in the cross-modality person re-identification task have shared identity feature information as well as modality-specific feature information, and the model needs to acquire more features related to identity information to achieve good results. Therefore, the designed two-branch network uses parameter-independent network structure in the shallow layer to extract the low-level visual information of the two modalities separately, and the parameter-sharing deep layer is used to extract the shared features.
The two-branch network structure has two main components: feature extraction and feature embedding. In the feature extraction part, the two branches input visible data and colored pseudo-visible data, respectively, which can capture modality-specific information about different images; the feature embedding part focuses on learning the shared space across modalities and characterizing the extracted features. The learning objectives mainly contain cross-modality and intra-modality constraints.
Feature extraction. The model uses ResNet50 as the backbone of the feature extraction network for the backbone network layers Conv1, Conv2, Conv3, Conv4, and Conv5, where Conv1 and Conv2 are shallow layers with no shared parameters, while Conv3, Conv4, and Conv5 are deep layers with shared network parameters, so that more features related to identity discrimination can be learned. As shown in
Figure 1, given the visible light input data R or the infrared input data FR, the feature maps
or
are obtained through the backbone network ResNet50.
Feature embedding. To learn the discriminative information of two different modalities, we introduce a fully connected layer after the two-branch feature extractor, where the parameters of the fully connected layer are shared to model the shared modality information. Otherwise, the features learned by the two modalities may be in completely different subspaces. The shared structure serves as a projection function to project two different modalities into a common space. The feature map or map obtained in feature extraction is cut into n parts along the horizontal direction, and n ∗ 2048 dimensional features are obtained using a global pooling layer on each part. In order to further reduce the feature dimension, a 1 × 1 kernel convolution layer and BatchNorm layer are used to reduce the dimension of each 2048-dimensional component feature, and finally a 256-dimensional feature expression is obtained. Therefore, for each input image R or FR, n 256-dimensional features or can eventually be represented. Subsequent features are trained by fully connected layers, where each feature or is viewed independently. The fully connected layer parameters designed for each person are shared, and each feature has its corresponding probability prediction output through the fully connected layer, which is used to calculate the identity loss .
3.2. Design of Loss Function
The loss is mainly considered in the following two aspects: (1) cross-modality constraint, for the huge inter-modality differences, the core idea is that the distance of different IDs in different modalities should be greater than the distance of the same ID in different modalities, and the distance of different IDs in the same modality is greater than the distance of the same ID in different modalities; (2) intra-modality constraint, i.e., identity classification loss, to distinguish different samples in the same modality.
For the feature differences between same-modality and cross-modality, the loss function is designed as follows:
As shown in
Figure 2, where the same color represents the same modality, the same border shape represents the same ID, and the cross-modality constraint target is that the distance of different IDs within the same modality is greater than the distance of the same ID within the cross-modality, i.e.,
. Similarly, the distance across different IDs of a modality should be greater than the distance across the same ID of a modality, i.e.,
. The intra-modality constraint objective is to distinguish different IDs under the same modality.
In terms of objective function construction, a multi-objective joint optimization is performed as shown in Equation (
1), including identity loss
for sample identity identification under the same modality and cross-modality constraint
.
where
, n denotes the number of samples trained by each batch, and
denotes the predicted probability that the input image
and its class label
, after softmax classification,
is recognized as class
.
in turn includes the constraint
between the distance of the same ID across modalities and the distance of different IDs within the same modality and the distance of different IDs across modalities and the distance of the same ID across modalities
.
where
v denotes visible data,
t denotes infrared data, and
i,
j denote sample IDs,
p is a predefined threshold value.