**2. Related Work**

Pedestrian attribute recognition has attracted a lot of attention from researchers, with many efforts dedicated to extending the deep convolutional network for pedestrian attributes recognition. Li et al. [10] proposed and compared two algorithms, where one only uses deep features for binary classification and the other considers the correlations between human attributes, which proves the importance of the modeling attribute relationship. The PGDM (pose guided deep model) [11] explored the structure knowledge of the pedestrian body for person attributes recognition. They used the pre-trained pose estimation model to extract deep features of part of the regions and fused them with the whole image together to improve final attribute recognition, however, the model could not be end-to-end trainable. Besides, many researchers have put a lot of effort in incorporating attention mechanisms and the sequence model in pedestrian attribute recognition. A HydraPlus-Net was proposed by Liu et al. [12] with the novel multi-directional attention modules to train complex features for fine-grained tasks of pedestrian analysis. The experiment showed that the method achieved significant improvements against the prior methods even if the model was hard to understand. Wang et al. [13] proposed to use the sequence-to-sequence model to assist attribute recognition. They first split the whole image into horizontal strip regions and formed region sequences from top to bottom which could help them better mine region dependency for better recognition performance, but the recognition accuracy was not satisfactory. Zhao et al. [14] considered the recurrent neural network's super capability of learning context correlations and the attention model's super capability of highlighting the region of interest on the feature map to propose two models, i.e., recurrent convolutional, which is used to explore the correlations between different attribute groups with the convolutional-LSTM (long short term memory) model and recurrent attention, which takes the

advantage of capturing the interest region, with the results significantly improved. Also, there is never shortage of novel approaches. Dong et al. [15] proposed a curriculum transfer network to handle the issue of less training data. Specifically, they first used the clean source images and their attribute labels online to train the model and then, simultaneously appended harder target images into the model training process to capture harder cross-domain knowledge. Their model was robust for recognizing attributes from unconstrained images taken from-the-wild. Fabbri et al. [16] proposed to use the deep generative model to re-construct super-resolution pedestrian images to deal with the problem of occlusion and low resolution, yet the reconstructed image was not good as expected.

Zhong et al. [17] proposed an image-attribute reciprocal guidance representation method. Due to the relationship between image features and attributes being not fully considered, the author not only investigated image feature and attribute feature together, but also developed a fusion attention mechanism as well as an improvement loss function to address the problem of imbalance attributes. Tan et al. [18] proposed three attention mechanisms including parsing attention, label attention, and spatial attention to highlight regions or pixels against the variations, such as frequent pose variations, blur images, and camera angles. Specifically, parsing attention mainly focuses to extract image features, where label attention pays more attention to attribute features, and spatial attention aims at considering problems from a global perspective, however, they do not fully consider the correlation between attributes. Li et al. [19] proposed to recognize pedestrian attribute by joint visual-semantic reasoning and knowledge distillation while the results remain to be discussed. Han et al. [20] proposed an attention aware pooling method for pedestrian attribute recognition which can also exploit the correlations between attributes. Xiang et al. [21] proposed a meta learning based method for pedestrian attribute recognition to handle the scenario for newly added attributes; semantic similarity and the spatial neighborhood of attributes are not taken into account in this method. In [22], the authors theoretically illustrated that the deeper networks generally take more information into consideration which helps improve classification accuracy. Chen et al. [23] first proposed video-based pedestrian attribute recognition. Their model was divided into two channels: spatial channel extract image features, while the temporal channel took image sequences as input to extract temporal features attached with spatial pooling to integrate the spatial features. Finally, they combed the two channels to achieve attribute classification, but they did not consider the spatial and temporal attention for attribute recognition in videos.

#### **3. Approach**

In this section, we first introduce some preliminary knowledge of multi-label classification and the graph convolutional network, then we discuss the model in-depth.

#### *3.1. Preliminary*

#### 3.1.1. Multi-Label Learning

Traditional supervised learning is prevailing and successful, which is also one of the most studied machine learning paradigms, where each object is just represented by a single feature vector and associated with a single label. But, traditional supervised learning methods have many limitations. In the real world, one object can be represented by many labels and many objects might co-occur in one scenario. The task of multi-label learning is to learn a function which can predict the proper label sets for unseen instances. A naïve way to deal with the multi-label recognition problem is to transform the multi-class into multiple binary-classification problems. However, if a label space contains 20 class labels, then the number of possible label outputs would exceed one million (220). Obviously, we cannot afford the overwhelming exponential-sized output size. So, it is necessary and crucial to capture correlations or dependency among labels which can effectively solve these problems. For example, the probability of an image being annotated with label "Female" would be high if we knew it had

labels "Long hair" and "Skirt". For the multi-label classification algorithms, the following three kinds of learning strategies can be concluded as noted in [24]:


## 3.1.2. Graph Convolutional Network

Graph convolutional network is a branch of the graph neural network [25]. It is an emerging field in recent years which originates from the limitations of convolutional neural networks. CNN has developed rapidly these years due to its translation invariance and weight sharing, which started the new era of deep learning [26]. However, the convolutional neural network normally operates on regular Euclidean data-like images and speech, and in other words, convolutional neural networks are not good at operating on non-Euclidean data-like graphs. Therefore, many researchers started to think how to define the convolution on non-Euclidean structures and extract features for machine learning tasks. Advance strategies in graph convolution are often categorized as spectral approaches and spatial approaches. This paper mainly focuses on spectral approaches.

Spectral network was proposed by [27]. The convolution operation is defined in the Fourier domain by computing the eigenvalue decomposition of the graph Laplacian. Given Laplacian:

$$L = I\_N - D^{-\frac{1}{2}} A D^{-\frac{1}{2}} = \mathcal{U} \Lambda \mathcal{U}^T \, , \tag{1}$$

where *D* is the degree matrix, *A* is the adjacency matrix of the graph, and Λ is the diagonal matrix of its eigenvalues. Notice that *L* is a symmetrical and positive semi-definite matrix which means *L* can be decomposed and each eigenvalue of *L* is called the spectrum of *L*. Then, researchers define traditional Fourier transform on graph, the graph Fourier transform of a signal *x* ∈ *RN* is defined as *UTx*, where *U* is the matrix of eigenvectors of the normalized graph Laplacian.

The convolution on graph can be defined as the multiplication of a signal *x* ∈ *RN* with a filter *g*<sup>θ</sup> = *diag*(θ) parameterized by θ ∈ *RN*:

$$\mathbf{g}\_{\partial} \* \mathbf{x} = \mathsf{U} \mathbf{g}\_{\partial} \mathsf{U}^{T} \mathbf{x}. \tag{2}$$

Then researchers focus on *g*<sup>θ</sup> and deduce that the computational complexity of the above operation is *O* - *n*3 , which means this operation may result in intense computational complexity. So, researchers hope to come up with a way equipped with fewer parameters and lower complexity. In [28], the author suggests that *g*<sup>θ</sup> can be approximated by a truncated expansion in term of Chebyshev polynomials *Tk*(*x*) up to *K*-th order:

$$\mathbf{x}\_{\mathcal{O}\prime} \* \mathbf{x} \approx \sum\_{k=0}^{K} \theta\_k T\_k(\overline{L}) \mathbf{x}\_{\prime} \tag{3}$$

where *<sup>L</sup>* = <sup>2</sup> <sup>λ</sup>*max L* − *IN*. λ*max* denotes the largest eigenvalue of *L*. Notice that the approximate operation only requires complexity of *O*(*K E* ) and *<sup>K</sup>* + 1 parameters, where *<sup>E</sup>* is the edge numbers of a graph. Next, [29] limits *K* = 1 and approximates λ*max* ≈ 2 as well as constrains the parameters with θ = θ <sup>0</sup> = −θ <sup>1</sup> to simplify the operation, then we can obtain the following expression:

$$\log\_{\theta'} \star \mathbf{x} \approx \theta\_0' \mathbf{x} + \theta\_1' (L - I\_N) \mathbf{x} = \theta \Big( I\_N + D^{-\frac{1}{2}} A D^{-\frac{1}{2}} \Big) \mathbf{x},\tag{4}$$

Finally, using renormalization trick, from a macro perspective, the formula can be expressed as follows:

$$H^{l+1} = \sigma(\overline{D}^{-\frac{1}{2}}\overline{A}\overline{D}^{-\frac{1}{2}}H^lW^l) \tag{5}$$

where *H<sup>l</sup>* denotes the nodes feature of *l*-th layer, *W<sup>l</sup>* denotes the parameters to be learned of *l*-th layer, σ(·) denotes a non-linear operation.
