*Article* **SGA-Net: Self-Constructing Graph Attention Neural Network for Semantic Segmentation of Remote Sensing Images**

**Wenjie Zi †, Wei Xiong †, Hao Chen \*,†, Jun Li and Ning Jing**

Department of Cognitive Communication, College of Electronic Science and Technology, National University of Defense Technology, Changsha 410000, China; ziwenjie@nudt.edu.cn (W.Z.); xiongwei@nudt.edu.cn (W.X.); junli@nudt.edu.cn (J.L.); ningjing@nudt.edu.cn (N.J.)

**\*** Correspondence: hchen@nudt.edu.cn

† These authors contributed equally to this work.

**Abstract:** Semantic segmentation of remote sensing images is always a critical and challenging task. Graph neural networks, which can capture global contextual representations, can exploit long-range pixel dependency, thereby improving semantic segmentation performance. In this paper, a novel self-constructing graph attention neural network is proposed for such a purpose. Firstly, ResNet50 was employed as backbone of a feature extraction network to acquire feature maps of remote sensing images. Secondly, pixel-wise dependency graphs were constructed from the feature maps of images, and a graph attention network is designed to extract the correlations of pixels of the remote sensing images. Thirdly, the channel linear attention mechanism obtained the channel dependency of images, further improving the prediction of semantic segmentation. Lastly, we conducted comprehensive experiments and found that the proposed model consistently outperformed state-of-the-art methods on two widely used remote sensing image datasets.

**Keywords:** self-constructing graph; semantic segmentation; remote sensing

#### **1. Introduction**

Semantic segmentation of remote sensing images aims to assign each pixel in an image with a definite object category [1], which is an urgent issue in ground object interpretation [2]. It has become one of the most crucial methods for traffic monitoring [3], environmental protection [4], vehicle detection [5], and land use assessment [6]. Remote sensing images are usually composed of various objects, highly imbalanced ground, and intricate variations in color texture, which bring challenges to the semantic segmentation of remote sensing images. Before the time of deep learning to display the distribution of vegetation and land cover, the superpixel was often used as measure for drawing features from multi-spectral images. However, hand-crafted descriptors are challenging tthe flexibility of these indices.

The convolutional neural network (CNN) [7] is widely used for the semantic segmentation of images. To achieve a better performance, CNN-based models regularly use multi-scale and deep CNN architectures to acquire information from multi-scale receptive fields and derive local patterns as much as possible. Owing to the restriction of the convolutional kernel, CNN-based models can only capture the dependency of pixels from the limited receptive field rather than the entire image.

CNN-based models have no ability to model the global dependency of each two pixels. However, a graph includes the connection of two nodes, so a graph neural network-based (GNN-based) model can capture the long-range global spatial correlation of pixels. There is no doubt that the traditional form of an image can be converted to a graph structure [8]. In this way, the graph can model the spatial relationship of each two pixels. In contrast, CNN can only obtain information from the limited receptive field. The adjacency matrix of

**Citation:** Zi, W.; Xiong, W.; Chen, H.; Li, J.; Jing, N. SGA-Net: Self-Constructing Graph Attention Neural Network for Semantic Segmentation of Remote Sensing Images. *Remote Sens.* **2021**, *13*, 4201. https://doi.org/10.3390/rs13214201

Academic Editor: Filiberto Pla

Received: 5 September 2021 Accepted: 15 October 2021 Published: 20 October 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

GNNs can represent the global relationship of images, which can contain more information than CNN-based models. Hence, we adopted a GNN to carry out semantic segmentation.

Nevertheless, a GNN does not ultimately demonstrate a strong point and is seldom used for dense prediction tasks because of the lack of prior knowledge of the adjacency matrix. Previous attempts [9–11] used prior knowledge-based manually generated static graphs, which did not fit each image well. A graph obtained by a neural network, is called "A self-constructing graph". Compared with these methods, a self-constructing graph can adjust itself and reflect the features of each remote sensing image.

Attention mechanisms [12] are added within the convolutional frameworks to improve the semantic segmentation performance in remote sensing images. Every true color image has RGB channels, and the RGB channels of objects have a potential correlation, which can be used to get a better semantic segmentation. The convolutional block attention module (CBAM) [13] adopts two kinds of non-local attention modules to the top of the atrous convolutional neural network: channel attention and spatial attention, respectively. CBAM achieves a competitive segmentation performance in the corresponding dataset. The channel attention mechanism can acquire the correlation among channels, improving the performance of semantic segmentation in remote sensing images. Every pixel has several channels, and each has a different importance for different kinds of pixels. Our channel attention mechanism could model the channels correlation to a large extent, inhibiting or enhancing the corresponding channel in different tasks, respectively.

In this paper, we propose a self-constructing graph attention neural network (SGA-Net) to implement the semantic segmentation of remote sensing images to model global dependency and meticulous spatial relationships between long-range pixels. The main contributions of this paper are as follows:


The rest of this paper is organized as follows, the related work is showed in Section 2. Section 3 presents that the details of our architecture SGA-Net. The experiments and corresponding analyses are showed in Section 4, and Section 5 presents the conclusion.

#### **2. Related Work**

#### *2.1. Semantic Segmentation*

The rise of convolutional neural networks (CNNs) marks a significant improvement in semantic segmentation. The fully convolutional network (FCN), which widely consists of the encoder–decoder module has dominated pixel-to-pixel semantic segmentation [14]. The FCN dominates semantic segmentation, and one with an encoder-decoder module can segment images at the pixel level by deconvolutional and upsampling layers, promoting the development of semantic segmentation. Compared with the FCN, the U-Net [15] applies multi-scale strategies to withdraw contextual patterns and perform semantic segmentation better. Owing to the use of multi-scale context patterns, U-Net can derive a better prediction result than the FCN. Segnet [16] proposes max-pooling indices to enhance location information, which can improve segmentation performance. Deeplab V1 [17] proposes atrous convolutions, which can enlarge the receptive field without increasing the number of parameters. Compared with Deeplab V1, Deeplab V2 [18] presents atrous spatial pyramid pooling (ASPP) modules that consist of atrous convolutions with different sampling rates. Because it uses information from a multi-scale rates receptive field, Deeplab V2 has better prediction than Deeplab V1. The above methods are all supervised models. FESTA [19] is a semi-supervised learning CNN-based model that encodes and regularizes image features and spatial relations. Compared to FESTA, our proposed method extracts

long-range spatial dependency and channels correlation to perform segmentation, and our proposed method is a GNN-based model. There are also models of non-grid convolutions for semantic segmentation. Deformable convolution [20] adds 2D offsets to the regular grid sampling locations in the standard convolution, which enhances the geometric transformation modeling capability of CNN. Deformable convolution is still limited in capturing long-range structured relationships. DGMN [21] obtains long-range structured relationships by constructing a dynamic graph. Our proposed model also adopts the idea of a dynamic graph to obtain global long-range correction of remote sensing images. HG-CNNs [22] is a heterogeneous grid convolutional neural network that constructs a dataadaptive graph structure from the convolutional layer by microclustering and assembling features into the graph. Our proposed model also constructs a data-adaptive graph, but the graph structure is extracted by convolutional operation from the high-level feature map.

#### *2.2. Graph Neural Network*

Recently, the GNN has become popular due to its success in many fields, such as natural language processing [23], social networks [24], reinforcement learning [25], computer vision [26]. There are lots of natural datasets of graph structures, recommender systems [27], protein networks [28] and knowledge graphs [29]. More and more GNN variants are produced and applied to various fields. In the beginning, only datasets in the form of graphs [10,30] were entered into graph neural networks. However, in a GNN neatly arranged matrix forms like remote sensing images can be extracted and transformed into diffferent kinds of graph structures [8]: convolutional networks, auto-encoders, attention networks (GATs) and isomorphism networks [31]. A GAT [32] and GCN are crucial branches of a GNN. Gao et al. [33] performed action recognition by using structured prior knowledge in the form of knowledge graphs. Yan et al. [34] completed skeletonbased action recognition with spatial-temporal graph convolutional networks (STGCNs) that auto-learn spatial and temporal patterns. Wang et al. [35] proposed a graph-based, language-guided attention mechanism that can clearly reveal inter-object properties and relationships with flexibility. GNN-based models (ASTGCN) [36] are used to predict traffic flow. Liu et al. [8] adopted a GCN to conduct experiences of semantic segmentation in remote sensing images, and the GCN adjacency matrix is built by neural networks. A GCN can simultaneously perform end-to-end learning of node feature information and structure information. In comparison, a GAT proposes a weighted summation of neighboring node features using an attention mechanism. The weights of neighboring node features entirely depend on the node features and are independent of the graph structure. Graph-SAGE [37] solves the GCN and GAT memory explosion problem by neighbori sampling for the large-scale graph. GNN-based models are used in a variety of applications.

#### *2.3. Attention Mechanisms*

With the publication of the paper in [12], attention mechanisms became more and more popular and attractive. Fu et al. [38] propose a dual attention network (DANet) that can adaptively learn local and global dependency to conduct semantic segmentation. Huang et al. [39] propose channelized axial attention (CAA) to integrate channel and axial attention seamlessly. CAA is similar to DANet in double-attention mechanisms, and these models have a competitive result in the corresponding dataset. CAA pays attention to channel and axial attention, DANet focuses on local and global attention. Compared with multi-attention mechanism, Tao et al. [40] propose a multi-scale attention mechanism that improves the accuracy of semantic segmentation. Transformer [12] is used to solve natural language processing, which is entirely based on the multi-head self-attention mechanism. Dosovitskiy et al. [41] adopt a transformer into the task of image classification, achieving excellent prediction results in many small- and medium-image recognition benchmarks.

#### **3. Methods**

In this section, we introduce the details of the model SGA-Net. An overview of the framework is presented in Figure 1 and consists of a feature maps extraction network, self-constructing graph attention network and a channel linear attention mechanism. The four SGA-Nets are shared weights. First, ResNet50 was employed as the backbone of the feature extraction network to acquire feature maps of remote sensing images, and *X* was denoted as the feature maps. Second, to ensure geometric consistency, feature maps were rotated by several degrees—90, 180 and 270. In addition, *X*90, *X*<sup>180</sup> and *X*<sup>270</sup> indicated the feature maps multi-views, where the index was the degree rotation. Third, multi-view feature maps were used to obtain self-constructing graphs *A*0, *A*1, *A*<sup>2</sup> and *A*<sup>3</sup> by a convolution neural network, separately. Fourth, these self-constructing graphs were fed into a neural network based on a GAT to extract the long-range dependency of pixels. Fifth, This network is called the self-constructing graph attention network and the outputs were used for inputs into channel linear attention, the ouputs of which were added to predict the final results. The adjacency matrix *A* is a high-level feature map of the corresponding remote sensing image feature map, and the projected remote sensing features maps in a specific dimension are defined as nodes. Therefore, the features maps *X* are defined as the features of nodes. *Aij* indicating the weight of the edge between node *i* and node *j*. We focused on the SGA-Net below.

#### *3.1. Self-Constructing Graph Attention Network*

The self-constructing graph is an undirected graph that shows the spatial similarity relationship of feature maps in remote images. The self-constructing graph is extracted by a neural network, instead of prior knowledge. Every image is unique; thus, models based on a self-constructing graph can be fitted for each remote sensing image very well.

The input image is denoted as *<sup>I</sup>*, where *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>*C*×*H*×*W*, *<sup>H</sup>* and *<sup>W</sup>* present the hight and width of corresponding image respectively, and *C* denotes the number of channels. The high-level feature maps is used as *<sup>X</sup>*, where *<sup>X</sup>* <sup>∈</sup> <sup>R</sup>*H*- ×*W*- ×*C*- , *H*- , *W*- and *C*- indicate that the number of height, width and channels, respectively. Next, we applied a convolutional neural network and dropout layer to extract the latent embedding space *S* of every remote sensing image, where *<sup>S</sup>* <sup>∈</sup> <sup>R</sup>*N*×*E*, *<sup>N</sup>* <sup>=</sup> *<sup>H</sup>*- <sup>×</sup> *<sup>W</sup>*- , where *E* is the number of the classification.

As we can see from Figure 2, which shows the latent embedding space *S* of buildings, cars, roads, trees and grass, respectively. *S* of buildings indicated that they are brighter than other objects: the higher the gray value, the greater the spatial similarity. In general, the same kind of features have the greatest spatial similarity relationship. The adjacency matrix was defined as *A* = ReLU(matmul(*S*, *ST*)), which highlighted and enhanced the differences between the target class and other categories. Since it does not arise from prior knowledge, but directly from the output of neural network the adjacency matrix is called the "self-constructing adjacency matrix ", which captures the distributions of the features in remote sensing images. Our model followed the convention of the variational auto-encoder [42] to learn the mean matrix *M* and the standard deviation matrix *D*, where *<sup>M</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>E</sup>* and *<sup>D</sup>* <sup>∈</sup> <sup>R</sup>*N*×*E*, and *<sup>E</sup>* denotes the number of the classification. The details of the mean matrix *M* and logarithm of the standard deviation matrix *D* are as follows:

$$\begin{aligned} M' &= \text{Filteren}\left(\text{Conv}\_{3 \times 3, \text{padding}=1}(X)\right) \\ M &= \text{Dropout}(\mathbf{p} = 0.2)(M') \end{aligned} \tag{1}$$

$$\begin{aligned} D' &= \text{Filter} \left( \text{Conv}\_{1 \times 1} \right)(X) \\ \log(D) &= \text{Dropout}(\mathbf{p} = 0.2)(D') \end{aligned} \tag{2}$$

**Figure 1.** In the flow chart of our model for semantic segmentation, ResNet50 was selected as the feature maps extraction network of our model; Conv3×<sup>3</sup> means the convolution operation with kernel size 3; SGA-Net denotes the self-constructing graph attention network and channel linear attention mechanism; GAT is graph attention network, and *Q*, *K*, *V* of channel linear attention mechanism indicate query, key and value, respectively. *X* denotes the feature input, *X*90, *X*<sup>180</sup> and *X*<sup>270</sup> indicate the feature maps multi-views, where the index is the rotation degree, and *A*0, *A*1, *A*<sup>2</sup> and *A*<sup>3</sup> present the adjacency matrix of the self-constructing graph of corresponding feature maps. *hi* means initial feature vector of each node, where *<sup>i</sup>* <sup>∈</sup> [1, 3];*α* represents the correlation coefficient; Concat denotes a concatenating operation; *P* indicates the number of channels, and*h* - *<sup>i</sup>* indicates the output of self-constructing graph attention neural network.

The latent embedding space *<sup>S</sup>* <sup>=</sup> *<sup>M</sup>* <sup>+</sup> log(*D*) · *<sup>α</sup>*, where *<sup>α</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>E</sup>* is an auxiliary noise variable that obeys standard normal distribution (*α* ∼ N*N*×*E*(**0**,**I**)). The adjacency matrix *A* was generated by an inner product operation between the transpose of the latent space embedding *<sup>S</sup><sup>T</sup>* and itself *<sup>S</sup>*, where *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*N*×*<sup>N</sup>* and *Aij* denotes the spatial similarity relationship between node *i* and *j*.

$$A = \text{ReLU}(\text{matmul}(\text{S}, \text{S}^{\top})) \tag{3}$$

*A* therefore can indicate the spatial similarity relation of each two nodes of the latent embedding space *S*. However, the CNN receptive field was restricted by the kernel size, and the CNN did not have the ability to present a spatial similarity relation between each two nodes. *A* in our model is not traditional binary but weighted and undirected.

The calculation of the SGA-Net was the same as for all kinds of attention mechanisms. The first step was computing the attention coefficient, and the last was aggregating the sum of weighted features [12]. For node *i*, the similarity coefficient between its neighbour nodes *<sup>j</sup>* and itself was calculated, where *<sup>i</sup>* <sup>∈</sup> <sup>N</sup> and *<sup>j</sup>* <sup>∈</sup> <sup>N</sup>. The details of the similarity coefficient are as follows:

$$\mathfrak{e}\_{ij} = \mathbf{a}([\![\![\![\mathbf{J}\![\cdot\!]\!]\!]\!]\!]) \tag{4}$$

where *<sup>U</sup>* is the learnable weight matrix, *hi* indicates the node feature of node *<sup>i</sup>*, *<sup>h</sup>* = (*h*1,*h*2, ··· ,*hN*), *hi* <sup>∈</sup> <sup>R</sup>*N*×*F*, where *<sup>F</sup>* denotes the number of features in each node and *h* = *X*, and **a** indicates the operation of self-attention, which is inner product, and the selfconstructing adjacency matrix *<sup>A</sup>* is set as a mask. Thus, *eij* <sup>∈</sup> <sup>R</sup>*N*×*N*. Next, we computed the attention coefficient*αij* as follows:

$$\vec{a}\_{ij} = \frac{\exp\left(\text{LeakyReLU}\left(\varepsilon\_{ij}\right)\right)}{\sum\_{k \in N} \exp\left(\text{LeakyReLU}\left(\varepsilon\_{ik}\right)\right)}\tag{5}$$

We applied an 8-head graph attention network to enhance the predictive capability of the model and make it more stable iduring training to improve the framework performance.

$$\vec{\mu}'\_i = \|\vec{\mu}'\_{l=1} \sigma\left(\sum\_{j \in \mathcal{N}\_i} \vec{\alpha}^k\_{ij} \mathsf{U}^k \vec{\mu}\_j\right) \tag{6}$$

where indicates the operation of concatenating, and *L* is the number of attention, *sigma* is the activate function sigmoid, and N*<sup>i</sup>* indicates some neighborhood nodes of the node *i* in the graph, and*α<sup>k</sup> ij* is the normalized attention coefficients computed by the *k*th attention mechanism **a**(*k*), and the *U*(*k*) indicates the *k*th corresponding input weight matrix. Specifically, *L* = 8 and we use an 8-head graph attention network in the work.

**Figure 2.** Latent embedding space of buildings, cars, roads, trees and low-vegetation present the latent embedding space of these categories separately.

#### *3.2. Channel Linear Attention*

Each channel of the high level features could be regarded as the special response of a category, and different responses have intrinsic independencies. The channels of each category had their own distinctive feature and correlations. Exploiting the inter-correlations among channels of images can improve the performance of specific semantic features. Therefore, we adopted a channel attention module to explore correlations among channels.

Suppose the query matrix is *Q*, the key matrix is *K* and the value matrix is *V*. In addition, all of *<sup>Q</sup>*, *<sup>K</sup>* and *<sup>V</sup>* <sup>∈</sup> <sup>R</sup>*K*×*P*, where *<sup>P</sup>* <sup>=</sup> *<sup>H</sup>* <sup>×</sup> *<sup>W</sup>*, and these are learnable parameters. In addition, suppose the output of SGA-Net is *<sup>H</sup>* , where *<sup>H</sup>* <sup>∈</sup> <sup>R</sup>*K*×*P*. The detail of the channel linear attention is as follows:

$$D(Q, K, V) = \hat{H} + \frac{V + \left(\frac{Q}{\|Q\|\_2}\right)\left(\left(\frac{K}{\|K\|\_2}\right)^T V\right)}{N + \left(\frac{Q}{\|Q\|\_2}\right)\left(\frac{K}{\|K\|\_2}\right)^T} \tag{7}$$

where *<sup>N</sup>* denotes the number of nodes. *<sup>D</sup>*(*Q*, *<sup>K</sup>*, *<sup>V</sup>*) <sup>∈</sup> <sup>R</sup>*K*×*P*. The equation highlights the input of a GAT, and emphasizes the importance of the *K*, *Q* and *V* at the same time. The channel linear attention can model the importance of different channels in a different task.

#### *3.3. Loss Function*

There is no doubt that *Aii* ought to be greater than 0 and close to 1; hence, we introduced a diagonal log regularization term to improve the prediction which was defined as:

$$\gamma = \sqrt{1 + \frac{n}{\sum\_{i=1}^{n} A\_{ii} + \epsilon}} \tag{8}$$

$$\mathcal{L}\_{\mathrm{dI}} = -\frac{\gamma}{n^2} \sum\_{i=1}^n \log \left( |A\_{ii}|\_{[0,1]} + \epsilon \right) \tag{9}$$

where the subscript [0, 1] indicates that *Aii* is clamped to [0, 1], and is a fixed and small positive tiny parameter and ( = 10−5). We adopted the Kullback–Leibler divergence, which measures the difference between the distribution of latent variables and the unit Gaussian distribution [42] to be the part of loss function, and the details of Kullback–Leibler divergence were as follows:

$$\mathcal{L}\_{kl} = -\frac{1}{2NK} \sum\_{i=1}^{N} \sum\_{j=1}^{K} \left( 1 + \log \left( D\_{ij} \right)^2 - M\_{ij}^2 - \left( D\_{ij} \right)^2 \right) \tag{10}$$

where *D* is the standard deviation matrix. In addition, we adopted an adaptive multi-class weighting (ACW) loss function [26] to address the highly imbalanced distribution of the classes. The detail of L*acw* is as follows:

$$\mathcal{L}\_{\text{acw}} = \frac{1}{|Y|} \sum\_{i \in Y} \sum\_{j \in \mathcal{C}} \vec{w}\_{ij} \cdot p\_{ij} - \log \left( \text{MEAN} \{ d\_j \mid j \in \mathcal{C} \} \right) \tag{11}$$

where *Y* includes all the labeled pixels and *dj* denotes the dice coefficient:

$$d\_{\vec{j}} = \frac{2\sum\_{i \in \mathcal{Y}} y\_{ij}\vec{y}\_{ij}}{\sum\_{i \in \mathcal{Y}} \mathbf{y}\_{ij} + \sum\_{i \in \mathcal{Y}} \vec{y}\_{ij}} \tag{12}$$

where *yi*,*<sup>j</sup>* and *y*˜*i*,*<sup>y</sup>* denote the *ij*th ground truth and prediction of class *j* respectively. *pij* is positive and negative balanced factor of node *i* and node *j* and its detail as follows:

$$p = (y - \bar{y})^2 - \log(\frac{1 - ((y - \bar{y})^2)}{1 + (y - \bar{y})^2}) \tag{13}$$

*w*˜*ij* is a weight about the frequency of all categories, and the detail of it as follows:

$$w\_{ij} = \frac{w\_j^t}{\sum\_{j \in \mathcal{C}} \binom{w\_j^t}{j}} \cdot \left(1 + y\_{ij} + \mathfrak{g}\_{ij}\right) \tag{14}$$

$$w\_j^t = \frac{\text{MEDIAN}\left(\left\{f\_j^t \mid j \in \mathbb{C}\right\}\right)}{f\_j^t + \epsilon} \tag{15}$$

$$f\_j^t = \frac{\hat{f}\_j^t + (t - 1) \cdot f\_j^{t-1}}{t} \tag{16}$$

where is a fixed parameter and = 10−5; *C* indicates the number of class; *t* is the iteration number; *f <sup>t</sup> <sup>j</sup>* represents the pixel sum of class j at the *t*th training step, which can be computed as SUM(*yj*) <sup>∑</sup>*j*∈*<sup>C</sup>* SUM(*yj*) , and when *t* = 0, *f <sup>t</sup> <sup>j</sup>* = 0.

For refining the final prediction result, we adopted the sum of three kinds of loss function as the final loss function in our framework, which are L*kl*, L*dl*, and L*acw* respectively. The loss function can be formulated as below:

$$\text{Loss} = \mathcal{L}\_{kl} + \mathcal{L}\_{dl} + \mathcal{L}\_{acw} \tag{17}$$

#### **4. Experiments**

*4.1. Datasets*

We used two public benchmark the ISPRS 2D semantic labeling contest datasets as our datasets. The ISPRS datasets consisted of aerial images in two German cities: Potsdam and Vaihingen. They are labeled with six common land cover classes:impervious surfaces, buildings, low vegetation, trees, cars and clutter.


#### *4.2. Evaluation Metrics*

To acquire reasonable and impartial results, we adopted the mean Intersection over Union (mIoU), the F1 score (F1) and accuracy (Acc) to evaluate performance, all of which are widely applied in semantic segmentation. In addition, based on the accumulated confusion matrix, these evaluation indicators were computed as:

$$\text{mIoU} = \frac{1}{N} \sum\_{k=1}^{N} \frac{TP\_k}{TP\_k + FP\_k + FN\_k} \tag{18}$$

$$\text{F1} = 2 \times \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \text{\textasciicircum} \tag{19}$$

$$\text{Acc} = \frac{\sum\_{k=1}^{N} TP\_k + TN\_k}{\sum\_{k=1}^{N} TP\_k + FP\_k + TN\_k + FN\_k} \tag{20}$$

where *TPk*, *FPk*, *TNk*, and *FNk* are the true positive, false positive, true negative, and false negatives, respectively, and *k* indicates the number of object index. Acc was computed for all categories except for clutter.

#### *4.3. Experimental Setting*

We achieved the proposed SGA-Net as well as all baselines working with PyTorch on a Linux cluster. Models were trained in a single Nvidia GeForce RTX 3090 with a batch size of 5. We applied AMSGrad [43] with adam as the optimizer with weight decay 2 × <sup>10</sup><sup>−</sup>5. The weight decay was used in all learnable parameters except batch-norm and bias parameters. Polynomial learning rate (LR) decay was <sup>1</sup> <sup>−</sup> *cur*−*iter max*−*iter* 0.9 with the maximum iterations of <sup>10</sup>8, and learning rate decay set to 0.9. The learning rate of the bias parameters is 2 × LR. The initial learning rate was set to 1.5 <sup>×</sup> <sup>10</sup>−<sup>4</sup> <sup>√</sup><sup>3</sup> . We sampled the patches of size 512 <sup>×</sup> 512 as input, and set the node size of graph to 1024 × 1024.

#### *4.4. Baselines and Comparison*

Our model was compared with several works as follows:


#### 4.4.1. Prediction on Potsdam Dataset

We compared our model with five baselines on the Potsdam dataset. Table 1 presents the evaluation metrics of prediction in semantic segmentation. Obviously, Table 1 shows that the proposed SGA-Net outperformed the other models.

The SGA-Net was 3.4% higher than the MSCG-Net in mean F1 score, because a selfconstructing graph attention network can acquire long-range global spatial dependency of images and channel linear attention to obtain a correlation among all channels. In addition, the proposed framework outperformed other model, which showed that the self-constructing graph had the ability to extract the spatial dependency of images well. In fact, we applied a self-constructing graph, obtained by neural network rather than prior knowledge, to a GAT. Our model performed better than DANet for prediction in all categories, indicating that a self-constructing graph attention neural network can dig the global long-range spatial correlation of nodes for the channel linear attention. Moreover, the multiviews of feature maps in remote sensing images can ensure the geometric consistency of spatial patterns. The reasons for the 3% improvement in average F1 score and 2.6% improvement in mIoU of SGA-Net over Deeplab V3 were that the self-constructing graph

neural network obtaied the spatial similarity of each two nodes, and the channel linear attention mechanism captured the correlation among the channel outputs of the graph neural network. The GAT modeled the dependencies between each two nodes, thereby increasing information entropy about spatial correlation. The channel linear attention mechanism enhanced or inhibited the corresponding channel in different tasks. Furthermore, multi-views also can get more information about initial images, which has the ability to support predicting remote sensing images.


**Table 1.** The experimental results on the Potsdam dataset (bold: best; underline: runner-up).

Figure 3 shows the ground truth and predictions of all methods in tile5\_15, and trhat the SGA-Net overmatched all baselines in the Potsdam dataset. The figure shows the overall predicting capability of our method in remote sensing images. For example, our model predicted surfaces better than that of MSCG-Net, while the proposed model outperformed all baselines in predicting buildings. The above phenomena illustrated that our framework modeled regularly shaped grounds well. Figure 4 is the result of predicting details from all baselines and the SGA-Net. The black boxes highlight the difference of results among ground truth, baselines and the SGA-Net. The first row shows that the proposed framework did much better predicting buildings compared to the other models, demonstrating that the SGA-Net can model global spatial dependency and channel correlation of remote sensing images.

The second row shows that the SGA-Net outperformed all baselines in predicting trees and buildings, which indicates that the SGA-Net can extract channel correlation in images well. The third row shows that the SGA-Net surpassed the other frameworks in predicting surfaces and low-vegetation. In addition, the last row shows that our model was superior to the other models for predicting trees and low-vegetation. The above phenomena illustrate that self-constructing graph attention network can capture long-range global spatial dependency of images, and the channel linear attention mechanism can acquire a correlation of images among channels. In addition, multiviews feature maps can ensure geometric consistency, improving the performance of predicting semantic segmentation in remote sensing images.

In conclusion, Figure 4 shows that the SGA-Net had a better performance predicting buildings, trees, low-vegetation, cars and surfaces in detail, demonstrating SGA-Net has powerful prediction in the semantic segmentation of remote sensing images.

#### 4.4.2. Prediction on Vaihingen Dataset

We compared our framework with these five baselines on Vaihingen dataset, Table 2 presents the evaluation metrics of prediction in all models. The result showed that the mean F1 score of the SGA-Net was higher than that of the other methods, indicating the powerful ability of prediction in remote sensing images.

To be specific, the F1 score of our model for road surfaces, buildings and cars exceeded all baselines, and accuracy was higher than in other models. Because the SGA-Net contains a self-constructing graph attention neural network and a channel linear attention mechanism, the framework can model the spatial dependency and channel correlation of remote sensing images. Furthermore, because the self-constructing graph attention neural network has the ability to obtain a long-range global spatial correlation of the regular grounds, the predicting result of buildings and cars from the SGA-Net surpassed all baselines. The reason for bad performance on low-vegetation and trees is that the two kinds of grounds are surrounded by many others, leading to poor extraction of spatial dependency by the self-constructing graph. The similarity of tree colors to low-vegetation and the fact that the SGA-Net captures long-range dependencies results in a segmentation performance for trees that is slightly worse than some other methods. The distribution of low-vegetation is more scattered than other objects, and the proposed model cannot extract a very complex spatial relationship of low-vegetation, leading to a poorer performance than DDCM in semantic segmentation.

**Table 2.** The experimental results on the Vaihingen dataset (bold: best; underlined: runner-up).


**Figure 3.** Visualization of tile5\_15 in the Potsdam dataset.

**Figure 4.** Visualization of prediction detail in the Potsdam dataset.

In addition, Figure 5 shows that the proposed model had a good overall prediction performance. In particular, this figure distinctly indicates that the predicting results of buildings and cars from the SGA-Net surpassed all models, showing that multi-views feature maps can enhance prediction capability, and a self-constructing graph can mine long-range spatial dependency for each image. Additionally, Figure 6 shows the details of the prediction results of the Vaihingen dataset. Because the self-constructing graph attention network can acquire the spatial dependency of each two nodes, the top three rows of Figure 6 indicate that the predictive buildings of the SGA-Net performed better than all baselines, and the last row shows that the predicting trees of our model were much better than other frameworks.

**Figure 5.** Visualization of tile35 in the Vaihingen dataset.

#### *4.5. Ablation Studies*

We conducted ample ablation experimentation to prove the effectiveness of the selfconstructing graph neural network and channel linear attention mechanism (SGA-Net) in the proposed framework. Following the main experience as closely as possible, ResNet50 was selected as the baseline and feature extraction layers in our framework. To research the effectiveness of each model component further, we compared the SGA-Net with its variants as follows:


As can be seen from Table 3, the performance of the SGA-Net-ncl significantly overmatched the baseline of ResNet50, thereby showing how effectively a self-constructing graph can model the long-range global spatial correlation of images and get a competitive result. The SGA-Net outperformed ResNet50 and SGA-Net-ncl in two datasets, which shows that channel linear attention has ability to derive a correlation among channel outputs of a graph neural network, and further improve performance of the proposed model. The SGA-Net surpassed SGA-Net-one in predicting remote sensing images, showing that the rotation of images can keep geometric consistency, which improves image prediction performance.

**Figure 6.** Visualization of prediction detail in the Vaihingen dataset.


**Table 3.** The ablation study about SGA-Net.

From Figures 7 and 8, we know that the performance of the SGA-Net-ncl surpassed ResNet50 and that the SGA-Net outperformed the baselines of the ablation study in two real-world datasets. Owing to long-range global spatial dependency extraction by a selfconstructing graph attention network, the SGA-Net-ncl had a better prediction result than ResNet50. Moreover, channel linear attention acquired a correlation among the channel outputs of the graph neural network, which is why the SGA-Net was superior to the SGA-Net-ncl in semantic segmentation.

From Figure 9, we know the target object had a strong similarity with the same object. On the right of Figure 9, the target object is a building, and the color of the building region is red, meaning that the target pixel had a strong similarity with these pixeles of the building region. On the left of Figure 9, the target objects are low-vegetation and road, and the color of all cars is blue, indicating a low similarity. This picture shows that our attention mechanism works.

**Figure 7.** Visualization in the ablation study of Potsdam dataset.

**Figure 9.** Visualization of the attention mechanism. The black dot is the target pixel or object. The red pixel color indicates that the target pixel is very similar to this pixel, and the blue color indicates that the target pixel is strongly different to this pixel.

#### **5. Conclusions**

In this paper, we proposed a novel model, SGA-Net, which includes a self-constructing graph attention network and a channel linear attention. The Self-constructing graph was obtained from feature maps of images rather than prior knowledge or elaborately designed manual static graphs. In this way, the global dependency of pixels can be extracted efficiently from high-level feature maps and present pixel-wise relationships of the remote sensing images. Then, a self-constructing graph attention network was proposed that aligned with the actual situation by using current and neighboring nodes. After that, a channel linear attention mechanism was designed to obtain the channel dependency of images and further improve the prediction performance of semantic segmentation. Comprehensive experiments were conducted on the ISPRS Potsdam and Vaihingen datasets to prove the effectiveness of our whole framework. Ablation studies demonstrated the validity of the self-constructing graph attention network to extract the spatial dependency of remote sensing images and the usefulness of channel linear attention mechanisms for mining correlation among channels. The SGA-Net achieved competitive performance for semantic segmentation in the ISPRS Potsdam and Vaihingen datasets.

In future research, we will re-evaluate the high-level feature map and the attention mechanism to improve the segmentation accuracy. Furthermore, we would like to employ our model to train other remote sensing images.

**Author Contributions:** Conceptualization, W.Z. and W.X.; Methodology, W.Z. and H.C.; Software, W.Z.; Validation, H.C., W.X. and N.J.; Data Curation, N.J.; Writing—Original Draft Preparation, W.Z.; Writing—Review and Editing, W.Z. and J.L.; Supervision, W.X.; Project Administration, H.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work in this paper is supported by the National Natural Science Foundation of China (41871248, 41971362, U19A2058) and the Natural Science Foundation of Hunan Province No. 2020JJ3042.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

