**1. Introduction**

Semantic segmentation in a medium resolution (MR) image, e.g., a Landsat-8 (LS-8) image, and very high resolution (VHR) images, e.g., aerial images, is a long-standing issue and problem in the domains of remote sensing-based information. Natural objects such as roads, water, forests, urban, and agriculture fields regions are operated in various tasks such as route optimization to create imperative remotely sensed applications.

Deep learning, especially the Deep Convolutional Neural Network (CNN), is an acclaimed approach for automatic feature learning. In previous research, CNN-based segmentation approaches are proposed to perform semantic labeling [1–5]. To achieve such a challenging task, features from various levels are fused together [5–7]. Specifically, a lot of approaches fuse low-level and high-level features together [5–9]. In remote sensing corpora, ambiguous human-made objects need high-level features for a more well-defined recognition (e.g., roads, building roofs, and bicycle runways), while fine-structured objects (e.g., low vegetations, cars, and trees) could benefit from comprehensive low-level features [10]. Consequently, the performance will be affected by the different numbers of layers and/or different fusion techniques of the deep learning model.

In recent years, the Global Convolutional Network (GCN) [11], the modern CNN, has been introduced, in which the valid receptive field and large filter enable dense connections between pixel-based classifiers and activation maps, which enhances the capability to cope with different transformations. The GCN is aimed at addressing both the localization and segmentation problems for image labeling and presents Boundary Refinement (BR) to refine the object boundaries further as well. Our previous work [12] extended the GCN by enhancing three approaches as illustrated in Figures 1 and 2. First, "Transfer Learning" [13–15] was employed to relieve the shortage problem. Next, we varied the backbone network using ResNet152, ResNet101, and ResNet50. Last, "Channel Attention Block" [16,17] was applied to allocate CNN parameters for the output of each layer in the front-end of the deep learning architecture.

Nevertheless, our previous work still disregards the local context, such as low-level features in each stage. Moreover, most feature fusion methods are just a summation of the features from adjacent stages and they do not consider the representations of diversity (critical for the performance of the CNN). This leads to unpredictable results that suffer from measuring the performance such as the *F*1 score. This, in fact, is the inspiration for this work.

In summary, although the current enhanced Global Convolutional Network (GCN152-TL-A) method [12] has achieved significant breakthroughs in semantic segmentation on remote sensing corpora, it is still laborious to manually label the MR images in river and pineapple areas and the VHR images in low vegetation and car areas. The two reasons are as follows: (*i*) previous approaches are less efficient to recover low-level features for accurate labeling, and (*ii*) they ignore the low-level features learned by the backbone network's shallow layers with long-span connections, which is caused by semantic gaps in different-level contexts and features.

In this paper, motivated by the above observation, we propose a novel Global Convolutional Network ("HR-GCN-FF-DA") for segmenting multi-objects from satellite and aerial images, as illustrated in Figure 3. This paper aims to further improve the state-of-the-art on semantic segmentation in MR and VHR images. In this paper, there are three contributions, as follows:


**Figure 1.** An overview of enhanced GCN (Global Convolution Network [11]) with transfer learning and attention mechanism (GCN152-TL-A) [12].

**Figure 2.** An Attention Mechanism (A) block (left) and the Transfer Learning (TL) approach (right) transfer knowledge (from pre-trained weights) across two corpora—medium and very high resolution images from [12].

The experiments were conducted using the widespread aerial imagery, ISPRS (Stuttgart) Vaihingen [18] data set and GISTDA (Geo-Informatics and Space Technology Development Agency (Public Organization)), organized by the governmen<sup>t</sup> in our country, data sets (captured by the Landsat-8 satellite). The results revealed that our proposed method surpasses the two baselines: Deep Convolutional Encoder-Decoder Network (DCED) [19–21] and the enhanced Global Convolutional Network (GCN152-TL-A) method [12] in terms of *F*1 score.

The remainder of this paper is organized as follows: Section 2 discusses related work. Our proposed methods are detailed in Section 3. Next, Section 4 provides the details on remote sensing corpora. Section 5 presents our performance evaluation. Then, Section 6 reports the experimental results, and Section 7 is the discussion. Last, we close with the conclusions in Section 8.

**Figure 3.** The HR-GCN-FF-DA: an enhanced GCN architecture with feature fusion and depthwise atrous convolution.

#### **2. Related Work**

The CNN has been outstandingly utilized for the data analysis of remote sensing domains, in particular, land cover classification or segmentation of agriculture or forest districts [10,12,22–26]. It has rapidly become a successful method for accelerating the process of computer vision tasks, e.g., image classification, object detection, or semantic segmentation with high precision results [4,27–33] and is a fast-growing area.

It is separated into two subsections: (*i*) we demonstrate modern CNN architectures for semantic labeling on both traditional computer vision and remote sensing tasks and (*ii*) the novel techniques of deep learning, especially playing with images, are discussed.

#### *2.1. Modern CNN Architecture for Semantic Labeling*

In early research, several DCED-based approaches have obtained a high performance in the various baseline corpora [16,19–21,26,34–36]. Nevertheless, most of them also struggle with issues with performance accuracy. Consequently, much research on novel CNN architectures has been introduced, such as a high-resolution representation [37,38] network that supports high-resolution representations in all processes by connecting high-to-low and low-to-high-resolution convolutions to keep high and low-resolution representations. CSRNet [8] proposed an atrous (dilated) CNN to comprehend highly congested scenes through crowd counting and generating high-quality density maps. They deployed the first ten layers from VGG-16 as the backbone convolutional models and dilated convolution layers as the backend to enlarge receptive fields and extract deeper features without losing resolutions. SeENet [6] enhanced shallow features to alleviate the semantic gap between deep features and shallow features and presented feature attention, which involves discovering complementary information from low-level features to enhance high-level features for precise segmentation. It also was constructed with the parallel pyramid to implement precise semantic segmentation. ExFuse [7] proposed to boost the feature fusion by bridging the semantic and resolution gap between low-level and high-level feature maps. They proposed more semantic information into low-level features with three aspects: (*i*) semantic supervision, (*ii*) semantic embedding branch, and (*iii*) layer rearrangement. They also embeded spatial information into high-level features. In the remote sensing corpus, ResUNet [25] proposed a trustworthy structure for performance effects for the job of image labeling of aerial images. They used a VGG16 network as a backbone, combined with the pyramid scene parsing pooling and dilated deep neural network. They also proposed a new generalized dice loss for semantic segmentation. TreeUNet (also known as adaptive tree convolutional neural networks) [24] proposed a tree-cutting algorithm and an adequate deep neural network with inadequate binary links to increase the classification percentage at the pixel level for subdecimeter aerial imagery segmentation, by sending kernel maps within concatenating connections and fusing multi-scale features. From the ISPRS Vaihingen Challenge and Landsat-8 corpus, the enhanced Global Convolutional Network (also known as "GCN152-TL-A"), illustrated in Figure 1, Panboonyuen et al. (2019) [12] presented an enhanced GCN for semantic labeling with three main contributions. First, "Domain-Specific Transfer Learning" (TL) [13–15], illustrated in Figure 2 (right), aims to restate the weights obtained from distinct fields' inputs. It is currently prevalent in various tasks, such as Natural Language Processing (NLP), and has also become popular in Computer Vision (CV) in the past few years. It allows you to reach a deep learning model with comparatively inadequate data. They prefaced to relieve the lack of issue on the training set by appropriating other remote sensing data sets with various satellites with an essentially pre-trained weight. Next, "Channel Attention", shown in Figure 2 (left), proposed with their network to select the most discriminative kernels (feature maps). Finally, they enhanced the GCN network by improving its backbone by using "ResNet152". "GCN152-TL-A" has surpassed state-of-the-art (SOTA) approaches and become the new SOTA. Hence, "GCN152-TL-A" is selected as our baseline in this work.

#### *2.2. Modern Technique of Deep Learning*

A novel technique of deep learning is an essential agen<sup>t</sup> for improving the precision of deep learning, especially the CNN. While the most prevalent contemporary designs tick all the boxes for image labeling responsibilities, e.g., the atrous convolution (also known as dilated convolution), channel attention mechanism, refinement residual block, and feature fusion, and have been utilized to boost the performance of the deep learning model.

Atrous convolution [5,6,9,39,40], also known as multi-scale context aggregation, is proposed to regularly aggregate multi-scale contextual information devoid of losing resolution. In this paper, we use the technique of "Depthwise Atrous Convolution (DA)" [6] to extract complementary information from very shallow features and enhance the deep features for improving feature fusion from our feature fusion step.

The channel attention mechanism [16,17] generates a one-dimensional tensor for allowed feature maps, which is activated by the softmax function. It focuses on global features found in some feature maps and has attracted broad interest in extracting rich features in the computer vision domain and offers grea<sup>t</sup> potential in improving the performance of the CNN. In previous work, GCN152-TL-A [12], the self attention and utilize channel attention modules are applied to pick the features similar to [16].

Refinement residual block [16] is part of the enhanced CNN model with ResNet-backbones, e.g., ResNet101 or ResNet152. This block is used after the GCN module and during the deconvolution layer. It is used to refine the object boundaries further. In our previous work, GCN152-TL-A [12], we employed the boundary refinement block (BR) that is based on the"Refinement Residual Block" from [11].

Feature fusion [7,41–44] is regularly manipulated in semantic labeling for different purposes and concepts. It presents a concept that combines multiplied, added, or concatenate CNN layers for improving a process of dimensionality reduction to recover and/or prevent the loss of some important features such as low-level features (e.g., lines, dots, or gradient orientation with the content of an image scene). In another way, it can also recover high-level features by using the technique of "high-to-low and low-to-high" [37,38] to produce high-resolution representations.

#### **3. Proposed Method**

Our proposed deep learning architecture, "HR-GCN-FF-DA", is demonstrated in an overview architecture in Figure 3. The network, based on GCN152-TL-A [12], consists primarily of three parts: (*i*) changing the backbone architecture (the P1 block in Figure 3), (*ii*) implementing the "Feature Fusion" (the P2 block in Figure 3), and (*iii*) using the concept of "Depthwise Atrous Convolution" (the P3 block in Figure 3).

#### *3.1. Data Preprocessing and Augmentation*

In this work, three benchmarks were used with the experiments, these were the (*i*) Landsat-8w3c, (*ii*) Landsat-8w5c, and (*iii*) ISPRS Vaihingen (Stuttgart) Challenge data sets. Before a discussion about the model, it is important to deploy a data preprocessing, e.g., pixel standardization, scale pixel values (to have unit variance), and a zero mean into the data sets. In the image domain, the mean subtraction, calculated by the per-channel mean from the training set, is executed in order to improve the model convergence.

Furthermore, a data augmentation (also known as the "ImageDataGenerator" function in TensorFlow/Keras library) is employed, since it can help the model to avoid an overfitting issue and somewhat enlarge the training data—a strategy used to increase the amount of data. To augmen<sup>t</sup> the data, each image is width and height-shifted and flipped horizontally and vertically. Then, unwanted outer areas are removed into 512 × 512 pixels with a resolution of 81 cm2/pixel in the ISPRS and 900 m2/pixel in the Landsat-8 data set.

#### *3.2. The GCN with High-Resolution Representations (HR) Front-End*

The GCN152-TL-A [12], as shown in Figure 1, is our prior attempt that surpasses a traditional semantic segmentation model, e.g., deep convolutional encoder-decoder (DCED) networks [19–21]. By using GCN as our core model, our previous work was improved in three aspects. First, its backbone was revised by varying ResNet-50, ResNet-101, and ResNet-152 networks, as shown in M1 in Figure 1. Second, the "Channel Attention Mechanism" was employed (shown in M2 in Figure 1). Third, the "Domain-Specific Transfer Learning" (TL) was employed to reuse the pre-trained weights obtained from training on other data sets in the remote sensing domain. This strategy is important in the deep learning domain to overcome the limited amount of training data. In our work, there are two main data sets: Landsat-8 and ISPRS. To train the Landsat-8 model, the pre-trained network is obtained by utilizing the ISPRS data. This can also be explained conversely—the pre-trained network can be obtained by Landsat-8 data.

Although the GCN152-TL-A network has determined an encouraging forecast performance, it can still be possible to improve it further through changing the frontend using high-resolution representation (HR) [37,38] instead of ResNet-152 [25,45]. HR has surpassed all existing deep learning methods on semantic segmentation, multi-person pose estimation, object detection, and pose estimation tasks in the COCO, which is large-scale object detection, segmentation, and captioning corpora. It is a parallel structure to enable the deep learning model to link multi-resolution subnetworks in an effective and modern way. HR connects high-to-low subnetworks in parallel. It maintains high-resolution representations through the whole process for a spatially precise heatmap estimation. It creates reliable high-resolution representations through repeatedly fusing the representations generated by the high-to-low subnetworks. It introduces "exchange units" which shuttle across different subnetworks, enabling each one to receive information from other parallel subnetworks. Representations of HR can be obtained by repeating this process. There are four stages as the 2nd, 3rd, 4th, and 5th stages are formed by repeating modularized multi-resolution blocks. A multi-resolution block consists of a multi-resolution group convolution and a multi-resolution convolution, which is illustrated as P1 in Figure 3 (backbone model) and this proposed method is named the "HR-GCN" method.

#### *3.3. Feature Fusion*

Inspired by the idea of feature fusion [41–44] that integrates multiplication, additional, or concatenate layers. Convolution with 1 × 1 filters is used to transform features with different dimensions into the shape, which can be fused. The fusion method contains an addition process. Each layer of the backbone network such as VGG, Inception, ResNet, or HR creates the feature map for specific. We proposed to combine output with low-level features (front-end network) with the deep model and refine the feature information.

As shown in Figure 4, the kernel maps after fusing will be calculated as Equation (1) :

$$Z\_{\text{add}} = X\_1 \oplus X\_2 \oplus X\_3 \dots \oplus X\_{\hat{i}} \cdot \dots \cdot \oplus X\_{\hat{j}} \tag{1}$$

where *j* adverts to the index of the layer, *Xk* is a set of output activation maps of one layer and ⊕ adverts to element-wise addition.

Hence, the nature of the addition process encourages essential information to build classifiers to comprehend the feature details. It denotes all bands of *Zadd* to hold more feature information.

**Figure 4.** The framework of our feature fusion strategy.

Equation (2) shows the relationship between input and output. Thus, we take the fusion activation map into the model again, it can be performed as Equation (4):

$$\bar{y}^{i} = \text{Re} L \varU (w^{T}x^{i} + b) \tag{2}$$

where *x* is the input and output of layer of the convolution recorded as *yi*; *b* and *w* refer to bias and weight. The cost function in this work is demonstrated via Equation (3).

$$f(w, b) = -\frac{1}{m} \times \left[ (1 - y^{(i)}) \log(1 - y^{(i)} + (y^{(i)} \log(y^{(i)})) \right] \tag{3}$$

where *y* refers to segmentation target of input (each image) and *J*, *w*, and *b* are the loss, weight, and bias value, respectively.

$$Y\_{add} = f(\mathbb{W}\_k Z\_{add} + B\_k) \tag{4}$$

The feature fusion procedure always transforms into the same thing when using additional procedures. In this work, we use addition fusion elements, as shown in Figure 4.

#### *3.4. Depthwise Atrous Convolution (DA)*

Depthwise Atrous Convolution (DA) [6,9,39] is presented to settle the contradictory requirements between the larger region of the input space that affects a particular unit of the deep network (receptive fields) and activation map resolution.

DA is a robust operation to reduce the number of parameters (weights) in the layer of the CNN while maintaining a similar performance that includes the computation cost and tunes the kernel's field-of-view in order to capture a generalized standard convolution operation and multi-scale information. An atrous filter can be a dilated kernel in varied rates, e.g., rate = 1, 2, 4, 8, by inserting zeros into appropriate positions in the kernel mask.

Basically, the DA module uses atrous convolutions to aggregate multi-scale contextual information without dissipating resolution orderly in each layer. It generalizes "Kronecker-factored" convolutional kernels, and it allows for broad receptive fields, while only expanding the number of weights logarithmically. In other words, DA can apply the same kernel at distinct scales using various atrous factors.

Compared to the ordinary convolution operator, atrous (dilated) convolution is able to achieve a larger receptive field size without increasing the numbers of kernel parameters.

Our motivation is to apply DA to solve challenging scale variations and to trade off precision in aerial and satellite images, as shown in Figure 5.

In a one-dimensional (1D) case, let *x*[*i*] denote input signal, and *y*[*i*] denote output signal. The dilated convolution is formulated as Equation (5):

$$y[i] = \sum\_{j=1}^{J} \mathbf{x}[i + a \cdot k] \cdot w[j] \tag{5}$$

where *a* is the atrous (dilated) rate, *w*[*j*] denotes the *j*-th parameter of the kernel, and *J* is the filter size. This equation reduces to a standard convolution when d = 1, 2, 4, and 8, respectively.

In the cascading mode from DeepLabV3 [46,47] and Atrous Spatial Pyramid Pooling (ASPP) [9], multi-scale contextual information can be encoded by probing the incoming features with dilated convolution to capture sharper object boundaries by continuously recovering the spatial characteristic. DA has been applied to increase the computational ability and achieve the performance by factorizing a traditional convolution into a depth-wise convolution followed by a point-wise convolution, such as 1 × 1 convolution (it is often applied on the low-level attributes to decrease the whole of the bands (kernel maps)).

To simplify notations, *HJ*,*<sup>a</sup>*(*x*) is term of a dilated convolution, and ASPP can be performed as Equation (6).

$$y = H\_{3,1}(\mathbf{x}) + H\_{3,2}(\mathbf{x}) + H\_{3,4}(\mathbf{x}) + H\_{3,8}(\mathbf{x})\tag{6}$$

To improve the semantics of shallow features, we apply the idea of multiple dilated convolution with different sampling rates to the input kernel map before continuing with the decoder network and adjusting the dilation rates (1, 2, 4, and 8) to configure the whole process of our proposed method called "HR-GCN-FF-DA", shown in P3 in Figures 3 and 5.

**Figure 5.** The Depthwise Atrous Convolution (DA) module in the proposed parallel pyramid method for improving feature fusion.

#### **4. Remote Sensing Corpora**

In our experiments, there are two main sources of data: public and private corpora. The private corpora is the medium resolution imagery received from the satellite "Landsat-8" used by the governmen<sup>t</sup> organization in Thailand called GISTDA. Since there are two variations of annotations, the Landsat-8 data is considered as two data sets: one with three classes and the other with five classes, as shown in Table 1. The public corpora is very high-resolution imagery from the standard benchmark called "ISPRS Vaihingen (Stuttgart)". Evaluations based on classification/segmentation metrics, e.g., *F*1 *Score*, *Precision*, *Recall* and *Average Accuracy* are deployed with all experiments.



#### *4.1. Landsat-8w3c Corpus*

For this corpus, there is a new benchmark that differs from our previous work. All images are taken in the area of the northern provinces (Changwat) of Thailand. The data set is made from the Landsat-8 satellite consisting of 1420 satellite images, some samples are shown in Figure 6. This data set contains a massive collection of medium resolution imagery of (20,921 × 17,472) pixels. There are three classes: para rubber (red), pineapple (green), and corn (yellow). From a total of 1390 images, the images are separated into 1000 training and 230 validation images, as well as 190 test images to compare with other baseline methods.

#### *4.2. Landsat-8w5c Corpus*

This data set is the same corpus from Landsat-8, but it is annotated with five class labels: agriculture, forest, miscellaneous (misc), urban, and water as shown in Figure 7. There are 1012 medium resolution satellite images of 17,200 × 16,300 pixels. From the total 1039 images, the images are separated into 700 training and 239 validation images, as well as 100 test images to comparison to other baseline methods.

**Figure 6.** The example of satellite images from the Landsat-8w3c corpus, northern province (left) and target image (right). The ground-truth of the medium resolution data set includes three classes: para rubber (red), pineapple (green), and corn (Yellow).

**Figure 7.** The example of satellite images from Landsat-8w5c corpus, northern province (left) and target image (right). The ground-truth of medium resolution data set includes five classes: urban (red), forest (green), water (blue), agriculture or harvested area (yellow), and miscellaneous or misc (brown).

#### *4.3. ISPRS Vaihingen Corpus*

The challenge of ISPRS semantic segmentation at Vaihingen (Stuttgart) [18] (Figures 8 and 9) is used to be our standard corpus. They were captured over Vaihingen in Germany. The data set is a subset of the data used for the test of digital aerial cameras carried out by the German Association of Photogrammetry and Remote Sensing (DGPF).

**Figure 8.** ISPRS 2D Vaihingen segmentation corpus (33 scenes).

**Figure 9.** Sample of input scene from Figure 8 (left) and target image (right). The annotated Vaihingen corpus has five categories: tree (green), building (blue), clutter/background (red), low vegetation or LV (greenish-blue), and impervious surface or imp surf (white)

It consists of three spectral bands such as NDSM, DSM, near-infrared bands, red, and green data. For our work, NDSM and DSM data are not used in this corpus. They provide 33 images of about 2500 × 2000 pixels of about 9 cm of resolution. Following other methods, four scenes such as scene 5, 7, 23, and 30 are removed from the training set as a validation set. All experimental results are announced on the validation set if not specified.

#### **5. Performance Evaluation**

The performance of "HR-GCN-FF-DA" is evaluated in all corpora for *F*1 and *AverageAccuracy*. To assess class-specific performance, the *F*1, *precision*, *recall*, and *AverageAccuracy* metric are used. It is computed as the symphonious average between recall and precision. We carry *precision*, *recall*, and *F*1 as fundamental metrics and also incorporate the *AverageAccuracy*, which calculates the number of correctly classified positions and divides it by the total number of the reference positions. The *AverageAccuracy* and *F*1 metrics can be assessed using Equations (7)–(10).

The confusion matrix for pixel-level classification [18] and the false positive (denoted as FP) are computed from the summation of the column. In contrast, the false negative (denoted as FN) is the summation of the horizontal axis, excluding the principal diagonal factor. Next, the true positive (denoted as TP) is the value of the identical oblique elements, and the true negative (denote as TN) contrasts TP.

$$Precision = \frac{TP}{TP + FP} \tag{7}$$

$$Recall = \frac{TP}{TP + FN} \tag{8}$$

$$F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \tag{9}$$

$$AverageAccuracy = \frac{TP + TN}{TP + FP + FN + TN} \tag{10}$$

#### **6. Experimental Results**

For a python deep learning framework, we use "Tensorflow (TF)" [48], an end-to-end open source platform for deep learning. The whole experiment was implemented on servers with Intel® 2066 core I9-10900X, 128 GB of memory, and the NVIDIA RTX™ 2080Ti (11 GB) x 4 cards.

For the training phrase, the adaptive learning rate optimization algorithm (extension to the stochastic gradient descent (SGD)) [49] and batch normalization [50], a technique for improving the performance, stability, and speed of deep learning, were applied and standardized to ease the training in every experiment.

For the learning rate schedules tasks, [9,16,32], we selected the polylearning rate policy. As shown in Equation (11), the learning rate is scheduled by multiplying a decaying factor to the initial learning rate (4 × <sup>10</sup>−3).

$$\text{learning rate} = \text{init\\_learning rate} \times (1 - \frac{epoch}{\text{MaxEpoch}})^{0.9} \tag{11}$$

All deep CNN models are trained for 30 epochs on the Landsat-8w3c corpus and ISPRS Vaihingen data sets. It is increased to be 50 epochs for the Landsat-8w5c data set. Each image is resized to 521 × 521 pixels along with augmented data using a randomly cropping strategy. Weights are updated using the mini-batch of 4.

This section explains the elements of our experiments. The proposed CNN architecture is based on the victor from our previous work called "GCN152-TL-A" [12]. In our work, there are three proposed improvements: (*i*) adjusting backbones using high-resolution representations, (*ii*) the feature fusion module, and (*iii*) depthwise atrous convolution. From all proposed policies, there are four acronyms of procedures, as shown in Table 2.


**Table 2.** Acronyms of our proposed deep learning approaches.

There are three subsections to discuss the experimental results of each data set: (*i*) Landsat-8w3c, (*ii*) Landsat-8w5c, and (*iii*) ISPRS Vaihingen data sets.

There are two baseline models of the semantic labeling task in the domains of remote sensing-based information on the computer vision. The first baseline is DCED, which is commonly used in much segmentation work [19–21]. The second baseline is the winner of our previous work called "GCN152-A-TL" [12]. Note that "GCN-A-TL" is abbreviated using just "GCN", since we always employ the attention and transfer-learning strategies into our proposed models.

Each proposed tactic can elevate the completion of the baseline approach shown via the whole experiment. First, the effect of our new backbone (HRNET) is investigated by using HRNET on the GCN framework called "HR-GCN". Second, the effect of our feature fusion is shown by adding it into our model, called "HR-GCN-FF". Third, the effect of the depthwise atrous convolution is explained by using it on top of a traditional convolution mechanism, called "HR-GCN-FF-DA".

#### *6.1. The Results of Landsat-8w3c Data Set*

The Landsat-8w3c corpus was used in all experiments. We distinguished between the alterations of the proposed approaches and CNN baselines. "HR-GCN-FF-DA", the full proposed method, is the winner with *F*1 of 0.9114. Furthermore, it is also the winner of all classes. More detailed results are given in the next subsection. Presented in Tables 3 and 4 are the results of this corpus, Landsat-8w3c.

#### 6.1.1. HR-GCN Model: Effect of Heightened GCN with High-Resolution Representations on Landsat-8w3c

The previous enhanced GCN network is improved to increase the *F*1 score by using the High-Resolution Representations (HR) backbone instead of the ResNet-152 backbone (best frontend network from our previous work). *F*1 of HR-GCN (0.8763) outperforms that of the baseline methods. DCED (0.8114) and GCN152-TL-A (0.8727) refer to Tables 3 and 4. The result returns a higher *F*1 at 6.50% and 0.36%, respectively. Hence, it means the features extracted from HRNET are better than those from ResNet-152.

For the analysis of each class, HR-GCN achieved an average accuracy on para rubber, pineapple, and corn for 0.8371, 0.8147, and 0.8621, consecutively. Compared to DCED, it won in two classes: para rubber and corn. However, it won against our previous work (GCN152-TL-A) only in the pineapple class.

#### 6.1.2. HR-GCN-FF Model: Effect of Using "Feature Fusion" on Landsat-8w3c

Next, we apply "Feature Fusion" to capture low-level features to decorate the feature information of CNN. HR-GCN-FF (0.8852) is higher than that of HR-GCN (0.8763), GCN152-TL-A (0.8727), and DCED (0.8113), shown in Tables 3 and 4. It gives a higher *F*1 score at 0.89%, 1.26%, and 7.39%, consecutively.

It is interesting that the FF module can really improve the performance in all classes, especially in the para rubber and pineapple classes. It outperforms both HR-GCN and all baselines in all classes. To further investigate the results, Figures 10e and 11e show that the model with FF can capture pineapple (green area) surrounded in para rubber (red area).


**Table 3.** Effects on the testing set of the Landsat-8w3c data set.

**Table 4.** Effects on the testing set of the Landsat-8w3c data set among each class with our proposed procedures in terms of *AverageAccuracy*.


**Figure 10.** Comparisons between "HR-GCN-FF-DA" and other published methods of the Landsat-8w3c corpus testing set.

**Figure 11.** Comparisons between "HR-GCN-FF-DA" and beyond baseline methods of the Landsat-8w3c corpus testing set.

6.1.3. HR-GCN-FF-DA Model: Effect of Using "Depthwise Atrous Convolution" on Landsat-8w3c

The last strategy aims to use an approach of "Depthwise Atrous Convolution" (details in Section 3.4) by extracting complementary information from very shallow features and enhancing the deep features for improving feature fusion of the Landsat-8w3c corpus. The "HR-GCN-FF-DA" method is the victor. *F*1 is obviously more distinguished than DCED at 10.00% and GCN152-TL-A (the best benchmark) at 3.87%, as shown in Tables 3 and 4.

For an analysis of each class, our model is clearly the winner in all classes with an accuracy beyond 90% in two classes: para rubber and corn. Figures 10 and 11 show twelve sample outputs from our proposed methods (column (*d to f*)) compared to the baseline (column (*c*)) to expose improvements in its results. From our investigation, we found that the dilated convolutional concept can make our model have better overview information, so it can capture larger areas of data.

There is a lower discrepancy (peak) in the validation data of "HR-GCN-FF-DA", Figure 12a, than that in the baseline, Figure 13a. Moreover, Figures 13b and 12b show three learning graphs such as precision, recall, and F1 lines. The loss graph of the "HR-GCN-FF-DA" model seems flatter (very smooth) than the baseline in Figure 13a. The epoch at number 27 was selected to be a pre-trained model for testing and transfer learning procedures.

**Figure 12.** Graph (learning curves) of the Landsat-8w3c data set of the proposed approach, "HR-GCN-FF-DA"; x refers to epochs, and y refers to different measures (**a**) Plot of model loss (cross-entropy) on training and validation corpora; (**b**) performance plot on the validation corpus.

**Figure 13.** Graph (learning curves) of the Landsat-8w3c data set of the baseline approach, GCN152-TL-A; x refers to epochs, and y refers to different measures (**a**) Plot of model loss (cross-entropy) on training and validation corpora; (**b**) performance plot on the validation corpus. [12].

#### *6.2. The Results on Landsat-8w5c Data Set*

In this subsection, the Landsat-8w5c corpus was conducted on all experiments. We compare "HR-GCN-FF-DA" network (column (*f*)) to CNN baselines via Tables 5 and 6. "HR-GCN-FF-DA" is the winner with a *F*1 of 0.9111. Furthermore, it is also the winner in all classes especially water and urban class that are composed with low-level features. More detailed results are described in the next subsection and are presented in Tables 5 and 6 for the results of this data set, Landsat-8w5c.

6.2.1. HR-GCN Model: Effect of Heightened GCN with High-Resolution Representations on Landsat-8w5c

The *F*1 score of HR-GCN (0.8897) outperforms that of baseline methods: DCED (0.8505) and GCN152-TL-A (0.8791); *F*1 at 3.92% and 1.07% respectively. The main reason is due to both higher recall and precision. This can imply that features extracted from HRNET are also better than those from ResNet-152 on Landsat-8 images as well, shown in Tables 5 and 6.


**Table 5.** Effects on the testing set of Landsat-8w5c data set.

**Table 6.** Effects on the testing set of Landsat-8w5c data set among each class with our proposed procedures in terms of *AverageAccuracy*.


For the analysis on each class, HR-GCN achieved an averaging accuracy in agriculture, forest, miscellaneous, urban, and water for 0.9755, 0.9501, 0.8231, 0.9133, and 0.7972, consecutively. Compared to DCED, it won in three classes: forest, miscellaneous and water. However, it won against our previous work (GCN152-TL-A) in the pineapple class, and it showed about the same performance in the agriculture class.

6.2.2. HR-GCN-FF Model: Effect of Using "Feature Fusion" on Landsat-8w5c

The second mechanism focuses on utilizing 'Feature Fusion" to fuse each level feature for enriching the feature information. From Tables 5 and 6, the *F*1 of HR-GCN-FF (0.9195) is greater than that of HR-GCN (0.8897), GCN152-TL-A (0.8791), and DCED (0.8505). It produces a more precise *F*1 score at 2.97%, 4.04%, and 6.89%.

To further analyze the results, Figures 14e and 15e show that the FF module can better capture low-level details. Especially in the water class, it can recover the missing water area, resulting in an improvement of accuracy from 0.7972 to 0.8282 (3.1%).

6.2.3. HR-GCN-FF-DA Model: Effect of Using "Depthwise Atrous Convolution" on Landsat-8w5c

The last policy points to the performance of the method of "Depthwise Atrous Convolution" by enhancing the features of CNN for improving the previous step. The *F*1 score of the "HR-GCN-FF-DA" approach is the conqueror. It is more eminent than DCED and GCN152-TL at 8.56% and 5.71%, consecutively, shown in Tables 5 and 6.

In the dilated convolution, filters are boarder, which can capture better overview details resulting in (*i*) larger coverage areas and (*ii*) connected small areas together.

For an analysis of each class, our final model is clearly the winner in all classes with an accuracy beyond 95% in two classes: agriculture and urban classes. Figures 14 and 15 show twelve sample outputs from our proposed methods (column (*d to f*)) compared to the baseline (column (*c*)) to expose improvements in its results and that founds that Figures 14f and 15f are likewise to the ground images. From our investigation, we found that the dilated convolutional concept can make our model have better overview information, so it can capture larger areas of data.

**Figure 14.** Comparisons between "HR-GCN-FF-DA" and beyond baseline methods on the Landsat-8w5c corpus testing set.

Considering the loss graphs, our model in Figure 16a can learn smoother than the baseline (our previous work) in Figure 17a, since the discrepancy (peak) in the validation error (green line) is lower in our model. There is a lower discrepancy (peak) in the validation data of "HR-GCN-FF-DA", Figure 16a, than that in the baseline, Figure 17a. Moreover, Figures 17b and 16b show three learning graphs such as precision, recall, and F1 lines. The loss graph of the "HR-GCN-FF-DA" model seems flatter (very smooth) than the baseline in Figure 17a and the epoch at number 40 out of 50 was selected to be a pre-trained model for testing and transfer learning procedures.

**Figure 15.** Comparisons between "HR-GCN-FF-DA" and beyond baseline methods on the Landsat-8w5c corpus testing set.

**Figure 16.** Graph (learning curves) on Landsat-8w5c data set of the proposed approach, "HR-GCN-FF-DA"; x refers to epochs, and y refers to different measures (**a**) Plot of model loss (cross-entropy) on training and validation corpora; (**b**) performance plot on the validation corpus.

**Figure 17.** Graph (learning curves) on Landsat-8w5c data set of the baseline approach, GCN152-TL-A [12]; x refers to epochs, and y refers to different measures (**a**) Plot of model loss (cross-entropy) on training and validation corpora; (**b**) performance plot on the validation corpus.

#### *6.3. The Results in ISPRS Vaihingen Challenge Data Set*

In this subsection, the ISPRS Vaihingen (Stuttgart) Challenge corpus was used in all experiments. The "HR-GCN-FF-DA" is the winner with *F*1 of 0.9111. Furthermore, it is also the winner of all classes. More detailed results will be provided in the next subsection, and the consequences of our proposed method with CNN baselines for this data set are shown in Tables 7 and 8.

6.3.1. HR-GCN Model: Effect of Heightened GCN with High-Resolution Representations in ISPRS Vaihingen

The *F*1 score of HR-GCN (0.8701) exceeds that of the baseline methods: DCED (0.8580) and GCN152-TL-A (0.8620). It complies a higher *F*1 at 1.21% and 0.81%, respectively. This shows that the enhanced GCN with HR backbone is also more significantly streamlined than the GCN152-TL-A style, shown in Tables 7 and 8.

The goal of the HR module is to help prevent the loss of some important features, such as low-level features, so it can significantly improve the accuracy of the car class from 0.8034 to 0.8202 (1.68%) and the building class from 0.8725 to 0.9282 (5.57%).


**Table 7.** Effects on the testing set of ISPRS Vaihingen (Stuttgart) challenge data set.

**Table 8.** Effects on the testing set of ISPRS Vaihingen (Stuttgart) challenge data set among each class with our proposed procedures in terms of *AverageAccuracy*.


6.3.2. HR-GCN-FF Model: Effect of Using "Feature Fusion" on ISPRS Vaihingen

Next, we propose "Feature Fusion" to fuse each level feature for enriching the feature information. From Tables 7 and 8, the *F*1 of HR-GCN-FF (0.8895) is greater than that of HR-GCN (0.8701), GCN152-TL-A (0.8620), and DCED (0.8580). It also returns a higher *F*1 score at 1.95%, 2.76%, and 3.16%, respectively.

The goal of the FF module is to capture low-level features, so it can significantly improve the accuracy of the low vegetation class (LV) from 0.8114 to 0.9264 (11.5%), the accuracy of the tree class from 0.8945 to 0.9475 (5.3%), and the accuracy of the car class from 0.8202 to 0.8502 (3%). This finding is shown in Figures 18e and 19e.

6.3.3. HR-GCN-FF-DA Model: Effect of Using "Depthwise Atrous Convolution" on ISPRS Vaihingen

Finally, our last approach is to apply "Depthwise Atrous Convolution" to intensify the deep features from the previous step. From Tables 7 and 8 we see that the *F*1 of the "HR-GCN-FF-DA" method is also the conqueror in this data set. The *F*1 score of "HR-GCN-FF-DA" is also more precise than the DCED and GCN152-TL-A at 5.31% and 4.96%, consecutively.

**Figure 18.** Comparisons between "HR-GCN-FF-DA" and beyond baseline methods on the ISPRS Vaihingen (Stuttgart) challenge corpus testing set.

**Figure 19.** Comparisons between "HR-GCN-FF-DA" and beyond baseline methods on the ISPRS Vaihingen (Stuttgart) challenge corpus testing set.

It is very impressive that our model with all its strategies can improve the accuracy in almost all classes to be greater than 90%. Although the accuracy of car is 0.8710, it improves on the baseline (0.8034) by 6.66%.
