*4.5. Ablation Study*

In order to further improve final recognition performance, we propose to binarize the correlation matrix *P*, because *Pij* may be a noisy edge if *Pij* has a very small value. Specifically, we use the threshold *t* to filter noisy edges:

$$A\_{ij} = \begin{cases} \ 0, & \text{if } P\_{ij} < t \\ \ 1, & \text{if } P\_{ij} \ge t \end{cases}. \tag{14}$$

As shown in Figure 4, it will achieve a better recognition performance than the initial model when the threshold is about 0.6 and obviously it gets top results when the threshold is about 0.9. But it may cause sparse problems, for example, if t = 0.9, then most of edges will be filtered, so we propose to assign proportion *p*, which means the weight between the node itself and its neighbor nodes are:

$$A\_{i\bar{j}} = \frac{p}{\sum\_{j=1}^{C} A\_{i\bar{j}} + \varepsilon} A\_{i\bar{j}\prime} \tag{15}$$

where ε is a very small number, notice that if the proportion is too small, it will ignore the information from the neighbor nodes; if proportion is too large, it will ignore the feature of the node itself. As shown in Figure 5, our model will get better performance if proportion is close to 0.5, but the proportion strategy does not work for improving overall performance.

Furthermore, we take the GCN layers into consideration to see if the deeper layer can get higher accuracy. We stack more layers to see how the layer numbers affect the model performance. As shown in Figure 6, our model achieves best performance when layer equals to 3. As shown, when the number of graph convolution layers increases over 4, recognition performance drops quickly. The possible reason for the performance drop may be that when using more GCN layers, the propagation between nodes will be accumulated, which can result in over-smoothing.

**Figure 4.** Accuracy comparison of different thresholds.

**Figure 5.** Accuracy comparisons of different proportions.

**Figure 6.** Accuracy comparisons of different layers.

#### *4.6. Image Retrieval*

For a more intuitive display, we wrote a program to retrieve the pedestrian image in test dataset with some query words through our model. As shown in Figure 7, we can clearly observe that our image representation results are pretty accurate.

**Figure 7.** Query results.

#### **5. Conclusions**

Our model overturns conventional approaches, instead we apply a novel framework based on the graph convolutional network for pedestrian attribute recognition and obtain satisfactory results. Given a pedestrian image and corresponding labels, we choose ResNet-101 to extract image features, and transform attribute labels to word embedding, then we construct a correlation matrix according to the occurrence of labels in the training dataset and optimize the matrix for better performance. Graph convolutional network helps us map the information between nodes into classifiers, incorporating the image features extracted by convolutional neural network to ensure our model has the ability to be end-to-end training, which is graceful for network architecture and amazing for final recognition performance. Experiments have validated the superiority of our model.

**Author Contributions:** Conceptualization, C.Z.; software, C.Z.; validation, X.S., H.Y.; formal analysis, X.S.; writing—original draft preparation, X.S.; writing—review and editing, C.Z.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors thank all the anonymous reviewers for their insightful comments and useful suggestions.

**Conflicts of Interest:** The authors declare no conflict of interest.
