3.2.1. Image Feature Extraction

Convolutional neural networks have achieved excellent performance in image processing these years; we can use any CNN base models to learn the features of each pedestrian image. In our experiments, this paper uses the deep residual network [30,31] as the base model to push forward our work. Deep residual network originates from the common fault of the deep convolutional neural network, as many researchers used to stack deeper layers to enrich more features and expected to get better recognition performance. However, is learning better networks as easy as stacking more layers? The answer is absolutely no. As layers go deeper at the beginning stage, accuracy will arise steadily, and as the depth continues increasing, the network may tend to get saturated, then the accuracy degrades rapidly and leads to a higher training error. So, researchers have proposed a solution to the awkward situation as shown in Figure 2. The added layers are identity mapping while the other layers are copied from the learned shallower model.

**Figure 2.** Identity mapping.

There are many explanations for why it works. Some researchers suppose that Float 32-bit numbers will no longer represent a gradient change when the network continues backpropagation for a long time once the derivative is less than one at the earlier step, thus shallower networks will not be able to learn things. Instead, deep residual networks take the gradient before derivation into consideration which artificially reduces the probability of gradient dispersion. From my perspective of view, each convolution operation will abandon some information inevitably due to random kernel parameters, inhibition of activation function, and so on. However, residual networks will fetch the information once processed by shortcut to reduce the loss. The authors of the deep residual network have made many model evaluations from 18-layer to 152-layer, and between those are 34-layer, 50-layer, and 101-layer. Our model use ResNet-101 to extract pedestrian image features. We randomly cropped each training image and resized into 224 × 224 resolution with horizontal flips for data augmentation. Finally, we can obtain 2048 × 14 × 14 feature maps from the "conv5\_x" layer.

#### 3.2.2. Correlation Matrix

Graph convolutional network works by propagating information between nodes based on the correlation matrix. Thus, it is crucial to construct a correlation matrix. Normally, we use adjacent matrix to represent the topology structure of a graph. However, such matrix will not be provided in any benchmark pedestrian attribute recognition datasets. In this paper, to construct the correlation matrix, we first define the matrix *N*:

$$\begin{bmatrix} N\_1 & N\_2 & \cdots & N\_{n-1} & N\_n \end{bmatrix} \tag{6}$$

where *Ni* denotes the occurrence numbers of each label in the training set, then we count the co-occurrence of label pairs in the training set and get the matrix *M*:

> ⎡ ⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

$$
\begin{array}{ccccc}
\mathcal{M}\_{11} & \mathcal{M}\_{12} & \cdots & \mathcal{M}\_{1,n-1} & \mathcal{M}\_{1n} \\
& \vdots & \ddots & \vdots \\
\mathcal{M}\_{n1} & \mathcal{M}\_{n2} & \cdots & \mathcal{M}\_{n,n-1} & \mathcal{M}\_{nn}
\end{array}
\Big|\tag{7}
$$

where *Mij* denotes the co-occurrence times of *Labeli* and *Labelj*. Then, we can get the conditional probability matrix by using formula *Pij* = *Mij*/*Ni*:

$$\begin{aligned} \left[ \begin{array}{ccccccccc} P\_{11} & P\_{12} & \cdots & P\_{1,n-1} & P\_{1n} \\ \vdots & & \ddots & \vdots \\ P\_{n1} & P\_{n2} & \cdots & P\_{n,n-1} & P\_{nn} \\ & & & & \\ & & & & \end{array} \right] \\ &= \left[ \begin{array}{ccccc} M\_{11} & M\_{12} & \cdots & M\_{1,n-1} & M\_{1n} \\ \vdots & & \ddots & \vdots \\ M\_{n1} & M\_{n2} & \cdots & M\_{n,n-1} & M\_{nn} \\ & & & \end{array} \right] / \left[ \begin{array}{ccccc} N\_{1} & N\_{2} & \cdots & N\_{n-1} & N\_{n} \\ \end{array} \right] / \left[ \begin{array}{ccccc} N\_{1} & N\_{2} & \cdots & N\_{n-1} & N\_{n} \\ \end{array} \right] / \left[ \begin{array}{ccccc} \end{array} \right] \end{aligned} \tag{8}$$

where *Pij* denotes the probability of occurrence of label *Lj* when label *Li* appears. Notice that the formula is not strictly matrix division. The relationship between label pairs is shown as Figure 3, for example, given an image, the probability of occurrence of label "Female" will be very high if label "Skirt" exists, however the probability of occurrence of label "Skirt" might not be high if label "Female" exists, because females wear skirts, but females do not necessarily wear skirts. On the contrary, label "Female" and label "Skirt" both have no relationship with label "Bald", as we seldom see females have no hair.

**Figure 3.** Conditional probability between labels.

## **4. Experiments**

In this section, we first introduce the evaluation metrics and dataset as well as implementation details. Then, we report the results on benchmark pedestrian attribute recognition dataset.
