4.1. Datasets
In order to evaluate the performance of the MFF model, here we evaluate three datasets in the experiments, i.e., Market-1501 [
33], DukeMTMC-reID [
6] and CUHK03 [
34]. The dataset of person re-identification is divided into Training_set, Verification_set, Query and Gallery. In our experiment, the network model is trained on the training set. Then we calculate the similarity of features extracted from Query and Gallery which is used to find similar pedestrian images of Query in Gallery. Pedestrian images of the Gallery are sorted according to the similarity of image features, as shown in
Figure 5.
The Market-1501 [
33] dataset includes 1501 identities captured by six cameras and 32,668 detected pedestrian rectangles under six camera viewpoints. In this dataset, each pedestrian contains at least two camera viewpoints. The training set is consisted of 751 identities and each identity includes 17.2 training data on average. The test set is composed of 19,732 images of 750 identities. The pedestrian detection rectangle in the gallery is detected by DPM [
35]. Here, we use mean Average Precision (mAP) to evaluate person re-identification algorithms.
The DukeMTMC-reID [
6] dataset consists 36,411 images of 1404 identities. With those images collected by eight cameras and each image sampled every 120 frames from the video. This dataset is composed of 16,552 training images, 2228 query images and 17,661 gallery images. Half of the identities are randomly sampled as training sets while the others as test sets. DukeMTMC-reID offers human labeled bounding boxes.
The CUHK03 [
34] dataset is composed of 13,614 images and 1467 identities. Each identity automatically captured by two cameras. In this dataset, bounding boxes are provided by two different ways: automatically detected which is the same as Market-1501 dataset and hand-labeled bounding boxes. Here we use two kinds of bounding boxes in this paper. In the whole experiment, we evaluate the single-query setting and adopt new test protocol proposed in [
36] which is similar to Market-1501. CUHK03 is divided into a training set consisting of 756 pedestrians and a test set of 700 pedestrians in the new protocol. A randomly selected image is used as query image while the rest is used as gallery. In this way, each pedestrian gets multiple ground truths in gallery.
The detailed information about these datasets is summarized in
Table 2. Three widely-used person re-identification datasets contain many challenges, such as misalignment, low resolutions, viewpoints and background clusters. In addition,
Figure 6 shows some image samples of the four datasets.
For each query image, we merge the five feature vectors into one and calculate the Euclidean distance between query image and pedestrian image in gallery. We use the Euclidean distance value to rank the images. The higher the ranking, the more similar the image is to the query image. Then we arrange them in descending order according to the Euclidean distance, and use the Cumulative Match Characteristic (CMC) curve to show the performance. In terms of performance measurement, we use the Rank-1 accuracy and the mean Average Precision (mAp).
Mean Average Precision (mAP) is an important evaluation indicator for person re-identification. Precision and recall are important components of mean Average Precision. Precision is the ability of a model to identify only the relevant objects. Recall is the ability of a model to find all the relevant cases. The precision and recall are expressed as follows:
where TP means the number of true positive, FP means the number of false positive, FN means the number of false negative.
Average precision (AP) means the mean of the highest precision under different recalls, which is expressed as follows:
MAP is the average value of the AP, which is expressed as follows:
4.3. Comparison with Market1501
Comparison with the proposed method on Market-1501 is detailed in
Table 3. The MFF model is compared with several state-of-the-art person re-identification methods on Market-1501 in recent years, for example, the bag of words model BoW+KISSME [
33] with a hand-crafted method, the SVDNet [
34] using global features extracted by deep learning model, and the part-aligned representation PAR [
17] using part features extracted by a deep learning model. We can observe from
Table 3 that the proposed MFF model gets best results in Rank-1 accuracy, Rank-5 accuracy and Rank-10 accuracy. In the experiment, we use mean average precision (mAP) as an evaluation index of person re-identification. The MFF model achieves 87.9% mAP on the Market-1501, which is 18.8% higher than the best proposed method. In addition, the MFF model achieves Rank-1 accuracy of 96.0%, which is 11.1% higher than the best method. Rank-5 accuracy of our model achieves 98.7%, 4.5% better than the best compared method. This is because the MFF model fuses the global features and local features together. Moreover, adding PMN when extracting local features is also helpful to obtain better results.
4.4. Comparison with CUHK03
Comparison between the proposed method and CUHK03 is detailed in
Table 4 and
Table 5. We conduct experiments on a CUHK03-detected dataset and a CUHK03-labeled dataset, respectively. We only use the single-query method for person re-identification on CUHK03-detected and CUHK03-labeled datasets. In this paper, our model is compared with many methods, such as LOMO+KISSME [
6] using a horizontal occurrence model, pedestrian alignment network [
41] and HA-CNN [
25] using harmonious attention network. In this experiment, we use Rank-1 accuracy and mAP as evaluation indicators. As shown in
Table 4, the MFF model achieves Rank-1 accuracy of 67.4% which is 0.6% higher than the best experimental result on CUHK03-detected data. Additionally, the mAP reaches 66.7%, which is 0.7% better than the best experimental result. Comparison results obtained on CUHK03-labeled are as follows: we surpass MGN by 1.6% in Rank-1 accuracy for the single-query setting. The MFF model reaches mAP of 68.8%. Compared with other deep learning methods, our model is even more discriminative, which is attributed to our global feature extraction and each-part feature extraction. We believe that local feature extraction benefits from PMN, which is because PMN can extract low-to-high level features more comprehensively.
4.5. Comparison with DukeMTMC-reID
We compare the MFF model with a state-of-the-art model on DukeMTMC-reID. Comparative details are shown in
Table 6. Methods of extracting features are different in
Table 6, for example, LOMO+KISSME [
6] extract local features with a horizontal occurrence model, whereas PAN [
41] and SVDNet [
34] use a deep learning method to extract global features.
We evaluate the MFF model on DukeMTMC-reID with single-query-setting and the significant advantage can be observed in
Table 6. Rank-1 accuracy reaches 86.0% which achieves the highest accuracy in comparison methods. We also use mAP as an evaluation indicator. MFF model reaches 76.1% in mAP. Extracting local features and global features enrich the available features when searching for target pedestrians. Adding a classifier in different levels of ResNet50, which is good for extracting part features, can also increase the accuracy of our model. In addition, we visualize the top-10 ranking results on DukeMTMC-reID for some randomly-selected query pedestrian images in
Figure 7.
4.6. Effectiveness of PMN
We evaluate the MFF model compared to three classic datasets: Market1501, CUHK03 and DukeMTMC-reID. PMN is proposed to extract local features from the low-to-high level layers. In order to further explore the influence of the PMN model, we conduct two experiments on each dataset. Firstly, we remove the structure of the PMN model. We fuse local features and global features extracted from entire backbone network. GLB is the structure without the PMN model, as in
Figure 8. Experiments on GLB can clearly test the performance of our model without adding the PMN structure. Then we train the MFF model on three datasets and report their performance in
Figure 8. Difference between MFF and GLB is that MFF fuses low-to-high level local features.
We exhaustively train MFF and GLB on three datasets separately and use Rank-1 accuracy to Rank-10 accuracy as the evaluation standard. In
Figure 8, a comparison of experimental results of two models not only shows the effect of model enhancement after fusing low-to-high level local features, but also shows that the improvement effect of PMN on each dataset is different. PMN structure has the most significant effect on CUHK03 especially on CUHK03-labeled data. But the effect on Market-1501 is less significant.
Figure 8 shows that rank accuracy of MFF is higher than GLB on three datasets, which proves that low-to-high local features extracted by PMN structure have a positive impact on person re-identification.
4.7. Influence of the Number of Parts
In this paper, we use the method of dividing a pedestrian image into several parts to extract local features. The visualization of the delicate parts is shown in
Figure 9. Intuitively, the granularity of the part feature affects the results. When the number of parts is one, the learned feature is a global feature. As the number of divided parts increases, the retrieval accuracy increases. However, accuracy does not always increase with the number parts, as shown in
Figure 10. Rank-1 accuracy of three datasets shows that when the number of parts increases to eight, the performance drops dramatically. The over-increased parts actually compromise the extraction of local features. Therefore, we use six parts in our experiments.
Discussion: we divide the pedestrian image into six parts to get the best results. We consider different proportions and attributes of body parts. We divide the image into six parts according to the position of the elbow joint, crotch, knee joint, etc., as shown in
Figure 2. Due to the limitation of joints, the grate range of human motion is limited to these six parts. The image is divided into six parts to ensure that the local features of each part have a high degree of recognition when a pedestrian is engaged in a wide range of activities. In addition, we also consider the effect of attributes on the results. The relevant attributes in pedestrian images include clothing categories (dresses, shorts etc.), clothing color, hat, hair, etc. The recognition of the attribute features of each part is also strengthened after dividing the image into six parts.
4.8. Influence of the PMN Branches
Low-to-high level local features are extracted by Branch-1 to Branch-3 as in
Figure 3. To verify the effectiveness of different branches in PMN, we remove the branches of PMN in different ways and the experimental results are compared in
Figure 11. The way of removing branches is as follows. (1) Only Branch-1 is removed. (2) Branch-1 and Branch 2 are both removed. (3) Structure of PMN (Branch-1 to Branch-2) is removed (GLB). (4) No branches are removed (MFF). In
Figure 11, we can observe that MFF model achieves the highest rank precision. Removing Branch-1 means not extracting low-level local features which reduces the rank accuracy. In the same way, the more branches in PMN are removed, the lower rank accuracy of the model. This experiment proves that sampling local features from different depths is effective for MFF.
We can try to use PMN networks with different network structures to extract features in the future. In addition, the PMN branches can be used for face recognition to extract facial features from different network depths and learn higher discriminative features. PMN has a wide range of applications and can also be used in other image recognition networks.