*3.2. Architecture of Our Model*

The overall framework of our approach is shown in Figure 1.

**Figure 1.** Overall framework of our model for pedestrian attribute recognition.

Our model adopts ResNet-101 to extract features of each pedestrian image since ResNet-101 is a common paradigm in image classification, meanwhile, we transform the corresponding attribute labels into word embedding and feed them to our data-driven matrix. The directed line between ellipse pairs represents the dependency of label pairs. Graph convolutional network maps label into D × C-dim classifiers, where D denotes the dimensionality of the parameter to be learned and C denotes the categories of labels. Obviously, our model takes both image and word embedding as input and with multiplication of the corresponding two outputs to produce C-dim scores, and finally, this paper uses traditional multi-label loss to train our network architecture. The detail of image feature extraction and the data-driven matrix is discussed in Sections 3.2.1 and 3.2.2.
