3.2.2. Graph Representation

As a critical part of the graph structure, edges determine the weights of node information propagation and aggregation. In other fields, some researchers regard semantic relationship or co-occurrence frequency of objects as edges [1,26]. However, as a basic feature, there is still a gap between object semantics and sentiment, making it hard to accurately describe the sentimental relationship. Further, it is challenging to label abstract sentiments non-artificially due to the "affective gap" between low-level visual features and high-level sentiment. To solve this problem, we use the semantic relationship of objects in emotional space as the edges of the graph structure. Given the object category, we employ SentiWordNet as a sentiment annotation to label each category with sentimental information. SentiWordNet is a lexical resource for opinion mining that annotates the positive and negative values in the range [0,1] to words.

As shown in Equations (1) and (2), we retrieve words related to the object category in SentiWordNet, and judge the sentimental strength of the current word *W* with the average value of related words *W* , where *Wp* is the positive emotional strength, *Wn* is the negative emotion strength.

$$\mathcal{W}\_n = \frac{\sum\_{i=1}^n \mathcal{W}\_{in}^{'}}{n} \tag{1}$$

$$\mathcal{W}\_p = \frac{\sum\_{i=1}^n \mathcal{W}\_{ip}^{'}}{n} \tag{2}$$

In particular, we stipulate that sentimental polarity of a word is determined by positive and negative strength. As shown in Equation (3), sentiment value *S* is the difference between the two sentimental intensity of words. In this way, positive words have a positive sentiment value, and negative words are the opposite. And *S* is in [ −1, 1] because of the intensity of sentiments is between 0–1 in SentiWordNet.

$$S = \mathcal{W}\_{\mathcal{P}} - \mathcal{W}\_{\mathcal{U}} \tag{3}$$

Based on this, we design the method described in Equation (4). We can use a sentimental tendency of objects to measure the sentimental distance *Lij* between words *Wi* and *Wj*. When two words have the same sentimental tendency, we define the difference between the two sentiment values *Si* and *Sj* as the distance in the sentimental space. On the contrary, we specify that two words with opposite emotional tendencies are added by one to enhance the sentimental difference. Further, we build the graph over the sentimental values and the object information. In Figure 3c, we show the relationship among node "person" and adjacent nodes, and the length of the edge reflects the distance between nodes.

$$L\_{ij} = \begin{cases} ||S\_i| - |S\_j|| + 1, & \text{if } S\_i \* S\_j > 0\\ & 0.5, \\ ||S\_i| - |S\_j|| \, & \text{otherwise} \end{cases} \tag{4}$$

## 3.2.3. Feature Representation

The graph structure describes the relationship between objects. And the nodes of the graph aim to describe the features of each object, where we select hand-crafted feature, intensity distribution, and texture feature as the representation of objects. Inspired by Machajdik [14], we calculate and analyze the image intensity characteristics on image datasets EmotionROI and FI. In detail, we quantify the intensity of each pixel to 0–10 and make histograms of intensity distribution. As shown in Figure 4, we find that the intensity of positive emotions (joy, surprise, etc.) is higher than that of negative emotions (anger, sadness, etc.) when the brightness is 4–6, while the intensity of negative emotions is higher on 1–2.

**Figure 4.** The distribution curve of the number of brightness pixels of different emotion categories in the EmotionROI and Flickr and Instagram (FI) dataset.

The result shows that the intensity distribution can distinguish the sentimental polarity of the images to some extent. At the same time, we use the Gray Level Co-occurrence Matrix(GLCM) to describe the texture feature of each object in the image as a supplement to the image detail feature. Specifically, we quantified the luminance values as 0–255 and calculated a 256-dimensional eigenvector with 45 degrees as the parameter of GLCM. The node feature in the final graph model is a 512-dimensional eigenvector.

#### *3.3. Interaction Graph Inference*

Sentiment contains implicit relationships among the objects. Graph structure expresses low-level visual features and the relationship among objects, which is the source of interaction features, and inference is the process of generating interaction features. To simulate the interaction process, we employ GCN to propagate and aggregate the low-level features of objects under the supervision of sentimental distances. We select the stacked GCNs, in which the input of each layer is the output *H<sup>l</sup>* from the previous layer, and generate the new node feature *Hl*+1.

The feature update process of the layer l is shown in Equation (5), *A* ˜ is obtained by adding the edges of the graph model, namely the adjacency matrix and the identity matrix. *H<sup>l</sup>* is the output feature of the previous layer, *Hl*+<sup>1</sup> is the output feature of the current layer, *W<sup>l</sup>* is the weight matrix of the current layer, and *σ* is the nonlinear activation function. *D* ˜ is the degree matrix of *A* ˜ , which is obtained by Equation (6).The first layer's input is the initial node feature *H*<sup>0</sup> of 512 dimensions generated from the brightness histogram and GLCM introduced above. Also, the final output of the model is the feature vector of 2048 dimensions.

$$H^{l+1} = \sigma(\vec{D}^{-\frac{1}{2}}\vec{A}\vec{D}^{-\frac{1}{2}}H^{l}\mathcal{W}^{l})\tag{5}$$

$$
\vec{D}\_{\vec{n}} = \sum\_{\vec{j}} \vec{A}\_{\vec{i}\vec{j}} \tag{6}
$$

#### *3.4. Visual Feature Representation*

As a branch of machine learning, deep learning has been widely used in many fields, including sentiment image classification. Previous studies have proved that CNN network can effectively extract visual features in images, such as appearance and position, and map them to emotional space. In this work, we utilize CNN to realize the expression of visual image features. To make a fair comparison with previous works, we select the popularly used model VGGNet [27] as the backbone to verify the effectiveness of our method. For VGGNet, we adopt a fine-tuning strategy based on a pre-trained model on ImageNet and change the output number of the last fully connected layer from 4096 to 2048.

#### *3.5. Gcn Based Classifier Learning*

In the training process, we adopt the widely used concatenation method for feature fusion. In the visual feature branch, we change the last fully connected layer output of the VGG model to 2048 to describe the visual features extracted by the deep learning model. For the other branch, we process the graph model features in an average operation. In detail, the Equation (7) is used to calculate interaction feature *Fg*, where n is the number of nodes in a graph model, *F* is the feature of each node after graph convolution.

$$F\_{\mathcal{S}} = \frac{\sum\_{i=1}^{n} F'}{n} \tag{7}$$

After the above processing, we employ the fusion method described in Equation (8) to calculate the fusion feature of visual and relationship, which is fed into the fully connected layer and realize the mapping between features and sentimental polarity. And the traditional cross entropy function is taken as the loss function, as shown in Equation (9), *N* is the number of training images, *yi* is the labels of images, and *Pi* is the probability of prediction that 1 represents a positive sentiment and 0 means negative.

$$F = \begin{bmatrix} F\_{d'} ; F\_{\mathcal{K}} \end{bmatrix} \tag{8}$$

$$L = -\frac{1}{N} \sum\_{i=1}^{N} (y\_i \* \log P\_i + (1 - y\_i) \* \log(1 - P\_i))\tag{9}$$

Specifically, *Pi* is defined as Equation (10), where *c* is the number of classes. In this work, *c* is defined as 2, and *fj* is the output of the last fully connected layer.

$$P\_i = \frac{\mathbf{c}^{f\_i}}{\sum\_{j=1}^{c} \mathbf{c}^{f\_j}} \tag{10}$$
