1. Introduction
Person re-identification is a typical computer vision problem which aims at matching pedestrians across disjoint camera views. It has attracted a lot of research interest due to its significant application potentials, such as in visual recognition and surveillance [
1,
2]. One of the most important tasks that person re-identification is shouldering is to learn generic and robust feature representations of people. Recently, the methods based on deep learning learn feature representation directly from tasks and have shown significant improvement compared with hand-crafted feature extractors. State-of-the-art CNN network architectures, such as Inception network [
3,
4,
5], Resnet network [
6,
7], are applied to learn feature representation for person re-identification.
Person re-identification is a challenging task due to the misalignment of body parts caused by poses variation, background clutter, detection errors, camera point of view variation, different accessories and occlusion.
Figure 1 illustrates some examples in person re-identification tasks. There are images of two persons in
Figure 1, with one person in each row. Images in the top row are from the Market1501 dataset [
8], and those in the bottom row are from DukeMTMC-reID dataset [
9]. In the top row, poses, background, detection, camera viewpoints and accessories are quite different. In the bottom row, poses variation, background clutter, detection errors, camera point of view variation occlusion also occur. Part misalignment occurs frequently and degrades the performance of person re-identification.
To solve this problem, many scholars focus on person re-identification based on part alignment recently. Some methods divide the person image into many stripes or grids to reduce the effects of part misalignment [
7,
10]. The division of grids or strips is predefined and heuristic, which can’t locate the parts precisely. Pose-based methods [
5,
11] employ a pose estimation model to infer corresponding bounding boxes. However, parts missing is ineluctable; it causes the convolutional neural network to not work properly.
This paper focuses on the problem of body part misalignment. It proposes a human parts semantic segmentation aware representation learning method for person re-identification. We employ semantic segmentation network to infer corresponding bounding boxes, and propose a DropParts method to solve the part missing problem. Experiments on the standard benchmark datasets show the effectiveness of our proposed pipeline. The contributions of this paper are as follows:
(1) We design a four-branch convolutional neural network to deal with parts misalignment problem. The four-branch CNN network learns a person’s appearance features globally and using the features of three local body parts. The bounding boxes of three body parts are inferred from human parts semantic segmentation results, which are learned with a popular U_Net [
12] semantic segmentation network.
(2) We propose a DropParts method to solve the part missing problem, with which the local features are weighed due to the appearance vector and fused with global feature. The DropParts method makes the four-branch convolutional neural network work properly when part missing occurs. On the other hand, it improves the performance of person re-identification.
2. Related Work
In this section, we present a brief review of works in feature exaction and part alignment for person re-identification.
At the beginning of the study, hand-crafted features extractors, such as color histogram [
13], Scale-Invariant Feature Transform (SIFT) [
14], Local Binary Patterns (LBP) features [
15], Bag of Word (BoW) [
8] and Local Maximal Occurrence (LOMO) [
16] are employed for the person representations. Recently, the methods based on deep learning learn feature representation directly from tasks and have shown significant improvement compared with hand-crafted feature extractors. All kinds of popular CNN network architectures, such as Inception network [
3,
4,
5], Resnet network [
6,
7], are applied to learn feature representation for person re-identification. Additionally, different loss functions, such as Softmax loss [
17], Siamese loss [
4,
18], Cluster loss [
19], Triplet loss [
20] and their combination [
21] are used to improve the discriminative feature learning in person re-identification tasks. Softmax loss [
17] function is the common loss function used in recognition tasks.
Many scholars focus on person re-identification based on part alignment [
7,
10,
17,
22,
23]. Early works divide the person image into many stripes or grids to reduce the effects of part misalignment. Article [
10] divides the person image into three horizontal stripes and extracts CNN features of each strip. After that, they concatenate and fuse them with a fully connected layer to represent a person image. Meanwhile, DeepReID method [
17] also divides the person image into horizontal stripes and carries out patches matching within each stripe. On the other hand, SpindleNet [
22] takes the human body structure information into person re-identification pipeline to help align body part features of images. The features of different semantic levels are merged by a tree-structured fusion network based on human body region which is guided by multi-stage feature decomposition and tree-structured competitive feature fusion, to represent a person image. IDLA method [
23] captures local relationships between the two input images on the basis of mid-level features of each input image, and computes a high-level summary of the outputs of this layer by a layer of patch summary features, which are then spatially integrated with subsequent layers. More stripes- and grids-based methods can be found in [
7]. Although stripes- and grids-based methods reduce the risk of part misalignments, the division of grids or strips is predefined and heuristic, which can’t locate parts precisely.
Pose-based person re-identification methods leverage external cues from human pose estimation. Article [
11] incorporates a simple cue of the person’s coarse pose (i.e., the captured view with respect to the camera) and the fine body pose (i.e., joint locations) to learn a discriminative representation of person image. PDC method [
5] leverages the human part cues to alleviate the pose variations and learn feature representations from both the global image and different local parts. To match the features from global human body and local body parts, a pose driven feature weighting sub-network is further designed to learn adaptive feature fusions. Pose-based methods leverage human pose estimation to infer the location of body parts. However, parts missing is ineluctable, it makes the convolutional neural network not work properly. And it is hard to find the right body part in the crowd because there may be several parts of the same semantic label in an image.
Attention mechanism has a large impact on neural computation, which selects the most pertinent pieces of information and focuses on specific parts of their visual inputs to compute the adequate responses [
24,
25,
26,
27,
28,
29]. Article [
25] decomposes the human body into regions following the learned person re-identification sensitive attention maps. Accordingly, it computes the representations over the regions, and aggregates the similarities computed between the corresponding regions of a pair of probe and gallery images as the overall matching score. The PersonNet method [
26] learns attention map from different scales for each module and applies the attention map to different layers of the network. At the end, they learn features by fusing three attention modules with Softmax loss. Moreover, HydraPlus-Net method [
27] has several local feature extraction branches which learn a set of complementary attention maps in which hard attention is used for the local branch and soft attention for the global branch, respectively. More methods based on the attention mechanism can be found in [
28,
29]. Methods based on the attention mechanism highlight the important region information of person images, but they also increase the number of feature maps by several times, and bring risks of over-fitting.
We use a semantic segmentation network to infer human body parts in this paper. Due to the ensemble effects of label of each pixel, bounding boxes inferred from semantic segmentation map are stable. We propose a DropParts method to solve the part missing problem; the method makes the four-branch convolutional neural network work properly when part missing occurs.
3. The Proposed Method
3.1. Overview of the Proposed Method
Given a probe person image, person re-identification targets the most similar persons from gallery sets according to the distance between appearance representations. Our object is to learn the generic and robust feature representations of person.
Figure 2 illustrates the architecture of the proposed parts aware person re-identification network, consisting of four CNN branches which learn person appearance and three body parts feature maps. The four feature maps are fused to an image descriptor. Three local patches, including head patch, torso patch and lower-body patch, are inferred from a semantic segmentation map. Four image patches, including whole person image and three image patches, are resized to the fixed size and then input into the proposed four-branch network. Each branch learns the representation of one part and finally is fused by a concatenation layer and a fully connected layer. A softmax layer is used to classify person ID.
INPUT: Given an image I ∈ RM×N and its semantic segmentation map S ∈ {0,1,2,3}M×N, where semantic labels 0, 1, 2 and 3 represent background, head, torso and lower-body pixels of person image respectively; M and N are the height and width of person image, respectively.
NETWORK: The bounding boxes {BBi}i=1,2,3 of the three local parts are fixed by the minimum enclosing rectangles of pixels with the same semantic label. The corresponding image patches are denoted as {Pi}i=1,2,3 (Pi is a null matrix if its corresponding part is missing). The person image I and three local parts patches {P1, P2, P3} go through four network branches {CNNi}i=1,2,3,4, each image passes through one branch. The feature vectors of four network branches are CNN1(I), CNN2(P1), CNN3(P2) and CNN4(P3) respectively, CNNi(·) .
This paper uses a 3-dimensional vector to represent the absence of all 3 parts:
where
.
The proposed DropParts method (detailed in
Section 3.2) maps parts absence vector
PA to another 3-dimensional vector
:
Scale the part feature vectors and concatenate them with the whole image feature vector, get a fusion vector:
where
is a normalized operator. This paper uses batch normalization method [
27] to normalize features of each part branch.
And then a fully connected layer which functions as metric learning [
10,
30], is used to fuse the features of the whole person image and three body part patches:
where
,
.
The object of this paper is to learn stable and discriminative person representation .
OUTPUT: At last, a softmax classifier [
17] is used to discriminate different person IDs according to their fused CNN features.
3.2. Person Parts Localization and Parts Alignment
Semantic segmentation associates each pixel of an image with a class label. Due to the ensemble effects of label of each pixel, bounding boxes inferred from semantic segmentation map are more stable and accurate than detection methods. This paper uses semantic segmentation map to find the bounding boxes of human body parts.
U_Net [
12] is a popular semantic segmentation method which is good at biomedical image segmentation. U_Net architecture consists of a contracting path to capture the context and a symmetric expanding path that enables precise localization. We make three modifications to adopt it for the person parts segmentation. At first, we reduce the number of pooling operators due to the small size of person image. Next, we add two residual structures to compensate for the depth reduction. Third, we do not reduce the size of feature maps by 2 when passing through convolutional layers; as a result, the output segmentation maps have the same size as input images.
Figure 3 illustrates the U_Net structure we used. We use its segmentation maps to find the bounding boxes of human body parts. Person images are resized to 192 × 88 and pass through the U_Net network. The size of the output semantic segmentation maps is also 192 × 88, and then the segmentation maps are resized to the same size as the original person images.
Figure 4 illustrates some examples of part segmentation by super-pixels.
Bounding boxes of person parts are fixed by the parts semantic segmentation map. For the stable feature extractor, there are two points need to be considered: (1) Large scale differences make extracted feature instable; (2) Large aspect ratio changes lead to part misalignment. This paper gives up two kinds of part regions: (1) the part region whose area bellows 5‰ of its corresponding person image; (2) the part region whose aspect ratio beyond reasonable scope. We set reasonable scopes [0.75, 1.33] for head region, [1, 3] for torso region and [1, 3] for lower-body region. We crop the person images with minimum circumscribed rectangle of its corresponding parts if they are complete.
After parts localization, the person image and three local patches are propagated forward through the proposed four-branch network, which completes parts alignment. An example illustrated in
Figure 4.
Figure 4a is an example of parts misalignment. It is the head region in red rectangle region of left image while it is background in the same location in right image. We locate three body parts and combine them with the whole image as the input of proposed four-branch CNN network.
Figure 4b,c illustrate two input of the proposed network, which corresponds to two images of
Figure 4a. As seen from
Figure 4b,c, the input patches are well aligned.
3.3. Part Missing Representation and DropParts Method
Part missing is another problem of person re-identification in a complex environment, which happens when meeting with occlusion or parts region is small enough. It degrades the performance of person re-identification. This paper proposes the DropParts Method to solve part missing problem.
A normal feature fusion and metric learning are formulated as follows:
In Equation (5), both normalization and non-normalization of whole person image and part patches vectors are feasible, because subsequent metric learning Equation (6) layer will reweigh them.
When meeting with parts missing, the usual method set its corresponding patch or feature a zero matrix or a zero vector. However, it takes the risk of unstable training when all the numbers in a big block are zero. Norms of feature fusion vector with zero blocks and without zero blocks are quite different, as a result, parameters and cannot meet the demands of parts missing and part non-missing and a compromising solution degrades the performance.
The key is to make the norms of feature fusion vector
stable when part missing happens. In this paper, inspired by Dropout [
31], we propose a DropParts method to deal with the parts missing problem.
Dropout [
31] is a technique to deal with the over-fitting problem of deep neural networks with a large number of parameters. For example, the
l+1 th original hidden layer is formulated as:
where
denotes
ith node of layer
l+1,
denotes the
ith active value of layer
l + 1,
and
are the weights and biases of layer
l + 1 respectively.
f(·) is the active function.
The key idea of Dropout is to randomly drop units (along with their connection) with probability
p from the neural network during training. During training, dropout samples from an exponential number of different thinned networks. With Dropout, the
l + 1 th hidden layer is illustrated as:
At test time, approximate the effect of averaging the predictions of all these thinned networks by simply using a single un-thinned network that has smaller weights.
Dropout significantly reduces risks of over-fitting and gives major improvements over other regularization methods.
In the proposed DropParts method, we formulate the feature fusion of the whole feature and local part features as Equation (14).
where |·| is the L1 norm operator.
is a normalization operator, and this paper uses the batch normalization method [
32] to normalize the features. Here, normalization
is important, because it maintains the stability of L2-norm of feature vectors. The character of
PA = [α
1, α
2, α
3] is normalized too by been divided by its L1 norm. After this, norms of feature fusion vector
is stable.
Then, the metric learning is:
Parts missing samples are not frequent, which leads to imbalanced sample problem. To solve this problem, during training, we drop bins of the absence vector
PA, and normalize it:
Part missing can be regarded as an example of DropParts during training. So, at test time, the fusion feature extractor uses the same parameters
and
:
4. Experiment
4.1. Network Structure and Experiment Settings
Any network can be used as the baseline of our proposed network. Take 34-layer ResNet [
33] as an example, the architecture of our four-branch network and its feature map sizes (on Market-1501 dataset [
8]) of input, hidden and output layers are illustrated in
Table 1.
The person image size of the input layer (Branch01) is fixed by the average aspect ratio of all images of the dataset, and then the sizes of input layer Branch02, Branch03, Branch04 are fixed by width/2 × width/2, height/2 × width/2, and height/2 × width/2 respectively. The person image and part patches are resized to the input sizes of the corresponding CNN branch. In consideration of the small feature size of res4, we remove the res5 module in Branch02, Branch03, and Branch04. Pool5 layer is the results of global pooling of their previous feature map. We apply our DropParts method to pool5 feature maps of Branch02, Branch03 and Branch04 to get their scaled_pool5 feature maps, then concatenate them with pool5 of Branch01 to get F_concate feature map. An inner product operator is used to map the 1280-dimensional F_concate layer to 512-dimensional F_fuse layer. At last, we use Softmax loss function to train the model. When testing, we use F_concate feature map, normalized by L2-norm, as the features of a person for person re-identification experiments. Euclidean distance is employed to measure the differences between person features.
Our CNN networks are trained on Caffe framework [
34] with a TITAN X GPU. We perform stochastic gradient descent (SGD) [
35] to perform weight updates. Start with a base learning rate of
= 0.01 and gradually decrease it as the training progresses using a step policy:
, where
γ = 0.0001,
step_size = 10,000,
i is the current mini-batch iteration. We use a momentum of µ = 0.1 and weight decay
λ = 0.0005.
Training data augmenting often leads to better generalization. We carry out several primary kinds of data augmentation in experiments when training our networks: rotation, shifting, blurring, color jittering and flipping. For rotation, we rotate the image by random degrees between −30° and 30°. For shifting, we shift the image to the left, right, top and bottom at most 5% of its width or height. For blurring, we blur the image with a 3 × 3, 5 × 5 or 7 × 7 sized Gaussian kernel. For color jittering, we change the brightness, saturation, and contrast by at most 5% of its original value. For flipping, we flip the images horizontally with probability 0.5.
4.2. Modified U_Net Performance
At first, we perform experiments on public LIP dataset [
36]. There are 20 semantic labels in LIP dataset: background, hat, hair, glove, sunglasses, upper clothes, dress, coat, socks, pants, jumpsuits, scarf, skirt, face, left-arm, right-arm, left-leg, right-leg, left-shoe, and right-shoe. We change the output num of the last layer in U-net architecture (
Figure 3) from 4 to 20, to adopt it for the semantic segmentation tasks on LIP dataset. Our proposed method is compared with current state-of-the-art methods, including SegNet [
37], FCN-8s [
38], DeepLabV2 [
39], Attention [
40], DeepLabV2 + SSL [
36], Attention + SSL [
36] and standard U_Net [
12]. From
Table 2 it can be observed that standard U_Net network [
12] outperforms the state-of-the-art networks on human semantic segmentation dataset LIP [
36], our modified U_Net network outperforms the standard U_Net network by 0.23% at overall accuracy, 0.26% at mean accuracy and 0.35% at mean IoU index.
We group the 19 semantic labels of LIP dataset into 3 labels: head (hat, hair, sunglasses, scarf, face), torso (glove, upper clothes, dress, coat, left-arm, right-arm) and lower-body (socks, pants, jumpsuits, skirt, left-leg, right-leg, left-shoe, right-shoe), and train the modified U_Net network on LIP dataset with grouped labels at first. We randomly chose 300 images of people from the trainset of Market-1501 [
8], CUHK03 [
17], and DukeMTMC-reID [
9], and then labelled them with 4 semantic labels. Finally, we fine-tuned the modified U_Net network model on LIP dataset with labeled data. We use the fine-tuned model for part segmentation in the proposed person re-identification method.
Figure 5 illustrates some examples of parts semantic segmentation map and corresponding bounding boxes of human parts. The top row images are person images, and the bottom images illustrate their part segmentation by super-pixels and bounding boxes of person parts are demonstrated with red rectangles. It illustrates the results of parts localization in different situations, including normal situation (1st column), leg occlusion (2nd and 3rd columns), head occlusion (4th and 5th columns), detection mistakes (6th to 9th columns) and crowds (10th and 11th columns). As seen from the localization results in different situations, semantic segmentation-based part localization is stable and accurate. There are also some mistakes. As seen from the 6th column and the 10th column, there are some segmentation mistakes in the torso part, which result in the width of the bounding box of torso part reduced by 7.14% in the 6th column, and height of bounding box of lower-body part increased by 8.26% in the 10th column. We then randomly chose another images of people from the trainset of Market-1501 [
8], CUHK03 [
17], and DukeMTMC-reID [
9], and labelled their part bounding boxes to evaluate the performance of part location with modified U_Net. The mean IoU between labeled bounding boxes and inferred ones are 69.15% for head, 82.57% for torso and 76.78% for low-body, respectively. This is acceptable for part location and can be treated with data augmentation.
4.3. Person Re-Identification Performance
We performed experiments on three public person re-identification datasets: Market-1501 [
8], CUHK03 [
17], and DukeMTMC-reID [
9].
Market-1501 dataset [
8] consists of images of 1,501 persons 32,668 images which cropped with bounding-boxes predicted by DPM detector [
41]. These images are captured from 6 different cameras, including 5 high-resolution cameras, and one low-resolution camera. Overlap exists among different cameras. The whole dataset is divided into training set with 12,936 images of 751 persons and testing set with 3368 query images and 19,732 gallery images of 750 persons.
The
CHUK03 dataset [
17] includes 13,164 images of 1,360 people captured by six cameras. Each identity appears in two disjoint camera views (i.e., 4.8 images in each view on average). Our dataset is partitioned into training set (1160 persons), validation set (100 persons), and test set (100 persons).
The
DukeMTMC-reID dataset [
9] consists of 1,404 identities appearing in more than two cameras and 408 identities (distractor ID) who appear in only one camera. There are 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images (702 ID + 408 distractor ID).
Our proposed method is compared with current state-of-the-art methods, including IDLA [
23], Part-Aligned [
25], PersonNet [
26], HydraPlus-Net [
27], BoW+kissme [
8], Basel. + LSRO [
9], DCSL [
42], PDC [
5], PSE [
11], SVDNet [
43], PAN [
44] and ATWL [
45] to show our considerable performance advantage over all the existing competitors. In our experiments, we report the cumulative matching characteristics (CMC) rank-1, rank-5, rank-10 and mean average precision (mAP) to evaluate the performances of person re-identification methods. And we use state-of-the-art network architectures, such as VGG [
46], ResNet [
33], DenseNet [
47] and Inception v3 network [
48], as our baseline network to test the performance of different networks in our approach. The results are summarized in
Table 3,
Table 4 and
Table 5, where we denote our method as PADP (Parts Alignments with DropParts).
From
Table 3,
Table 4 and
Table 5, it can be seen that the proposed algorithm with different network architectures outperforms the current state-of-the-art person re-identification methods on average. Due to its simpler structure, 18-layer Resnet method performs worse than other networks, by 0.9% on Market-1501 dataset, 1.14% on the CUHK03 dataset, 0.76% on the DukeMTMC-ReID dataset at rank-1. However, it also outperforms the current state-of-the-art person re-identification methods on three datasets. On the Market-1501 dataset, our method with 34-layer Resnet outperforms the second best method by 1.25% at rank-1, and the Densenet121 based method outperforms the second best method by 2.33% at mAP. On the CUHK03 dataset, our method with 34-layer Resnet outperforms the second best method by 2.13% at rank-1. On the DukeMTMC-ReID dataset, our method with Densenet121 outperforms the second best method by 1.41% at rank-1, and 0.75% at mAP.
6. Conclusions
In this paper, we present a new deep architecture deal with parts misalignment, and propose a DropParts method firstly to solve the parts missing problem. Experiments on standard pedestrian datasets show the effectiveness of our proposed method.
For the future work, we will continue to improve the models of part localization and matching, by:
(1) Dividing person images into more parts, and improving the performance of parts localization.
(2) Designing an end-to-end model that includes both parts segmentation and re-identification tasks.