*4.1. Evaluation Metrics*

Following conventional settings, we record five evaluation criteria including accuracy, precision, recall rate, F1 value, and mA. The formula of mA can be calculated as following:

$$\text{mA} = \frac{1}{2\text{N}} \sum\_{i=1}^{L} \left( \frac{\text{TP}\_{i}}{\text{P}\_{i}} + \frac{\text{TN}\_{i}}{\text{N}\_{i}} \right) \tag{9}$$

where L is the number of attributes. *TPi* is the number of correctly predicted positive examples and *TNi* is the number of correctly predicted negative examples, *Pi* is the number of positive examples and *Ni* is the number of negative examples.

Evaluation metrics on multi-label classification is much more complicated than traditional single-label classification, so there are many evaluation metrics proposed by researchers which can mainly be divided into two groups, i.e., label-based metrics and example-based metrics. Researchers called the above evaluation criterions as label-based evaluation criterions because mA only takes each attribute into consideration and ignores the relationship between attributes. Thus, some researchers suggest using the example-based evaluation criterions like accuracy, precision, recall rate, and F1 value. The formula of those evaluation criteria can be defined as below:

$$\text{Acc}\_{\text{exam}} = \frac{1}{\text{N}} \sum\_{i=1}^{\text{N}} \left| \frac{\mathbf{Y}\_i \cap \mathbf{f}(\mathbf{x}\_i)}{\mathbf{Y}\_i \cup \mathbf{f}(\mathbf{x}\_i)} \right| \tag{10}$$

$$\text{Preexam} = \frac{1}{2\text{N}} \sum\_{i=1}^{\text{N}} \frac{\left| \mathbf{Y}\_{i} \cap \mathbf{f}(\mathbf{x}\_{i}) \right|}{\left| \mathbf{f}(\mathbf{x}\_{i}) \right|}, \tag{11}$$

$$\text{Rec}\_{\text{exam}} = \frac{1}{2\text{N}} \sum\_{\mathbf{i}=1}^{N} \frac{|\mathbf{Y}\_{\mathbf{i}} \cap \mathbf{f}(\mathbf{x}\_{\mathbf{i}})|}{|\mathbf{Y}\_{\mathbf{i}}|},\tag{12}$$

$$\text{F1} = \frac{\text{2\*Prec}\_{\text{exam}} \circ \text{Rec}\_{\text{exam}}}{\text{Prec}\_{\text{exam}} + \text{Rec}\_{\text{exam}}},\tag{13}$$

where N is the number of examples, *xi* denotes for i-th example, f(*xi*) returns the predicted positive labels of *xi*. *Yi* is the ground truth positive labels of the corresponding example.

#### *4.2. Dataset*

We use RAP-2.0 [32] as our benchmark pedestrian attribute recognition dataset. This dataset contains 84,928 images which were divided into three parts, of which 50,957 for training, 16,986 for validation, and 16,985 for testing. In our experiments, we use both training part and validation part as training dataset. The annotation of each image is very rich as shown in Table 1.

**Table 1.** Annotation attribute names in RAP.


#### *4.3. Implement Details*

We implement ResNet-101 to extract image feature and obtain 2048 × 14 × 14 feature maps from the "conv5\_x" layer. Followed by a global max pooling, which size is 14 × 14, can we achieve 2048 × 1 × 1-dim image features. Then we select 60 attributes out of 152 in our benchmark dataset and use one-hot word embedding to transform those attribute labels into 60 × 300-dim word embedding. As the graph convolutional network normally propagates information based on correlation matrix, we count the total occurrence times of each label in the training dataset to construct matrix *<sup>N</sup>* <sup>∈</sup> *<sup>R</sup>*1×<sup>60</sup> and we also count the occurrence times of label *Li* and label *Lj* in the training set to construct matrix *<sup>M</sup>* <sup>∈</sup> *<sup>R</sup>*60×60, then we can get probability matrix *<sup>P</sup>* = *<sup>M</sup>*/*<sup>N</sup>* <sup>∈</sup> *<sup>R</sup>*60×<sup>60</sup> as correlation matrix. Thus, graph convolutional network can work by mapping those label representations to classifiers based on the correlation matrix *P* and we will get an output vector of 2048 × 60 dimensionality. Obviously, we will get a 60-dim vector by 2048 dot product 2048 × 60. We take the 60-dim vector into traditional multi-label classification loss to ensure our model framework has the ability to be end-to-end trainable.

#### *4.4. Experiment Results*

As shown in Table 2, our model is a close match against most previous methods before further improvement. To improve final recognition performance, we optimize the correlation matrix, while the details are discussed in Section 4.5. Unsurprisingly, we get a significant improvement according to experiment results. Experiments verified that our improved method goes beyond other competitive counterparts in some of evaluation criteria. Improve operations are discussed in detail in next section.


**Table 2.** Comparison against previous methods.
