1. Introduction
The study of remote sensing image processing technology is crucial for the area of remote sensing. The dynamic transformation information of a region, such as changes in vegetation or land use and urban changes, can be obtained from remote sensing images produced in different periods. In computer vision, one of the most significant challenges is the precise and effective segmentation of images in terms of their semantic content. It assigns categories to all pixels of the target image and labels them. It is often utilized in a variety of areas, including medical imaging [
1], geographic information systems [
2], remote sensing [
3], and unmanned driving [
4].
At an early stage, semantic segmentation was mainly implemented based on traditional image segmentation methods or random forest (RDF) [
5], support vector machine (SVM) [
6], and other technologies. There are two main categories of traditional image segmentation methods: edge detection-based methods and region-based methods. Image segmentation methods based on edge detection detect the target boundary points through calculation with local differential operators. Common edge detection operators include gradient operators, the Roberts operator, and the Canny edge detector. Region-based image segmentation methods perform image segmentation using the differences in attributes between the target region and the background region. Common image segmentation methods based on region include the region splitting and merging technique, as well as the region growing technique [
7]. While these techniques can address the issue of discontinuous image segmentation, they have the potential to over-segment the image during the segmentation process. These methods use techniques such as RDF and SVM to extract image features and to classify pixels.
Deep learning has advanced significantly in semantic segmentation because of its rapid development. For deep-learning-based semantic segmentation, there are three typical architectures, namely, the encoder–decoder segmentation structure, the pyramid structure, and the multi-branch structure. In 2015, Long et al. advocated for the use of deep learning in semantic segmentation by proposing a full convolutional network (FCN) [
8] and for using it to extend semantic segmentation to the pixel level. The U-Net model [
9] was presented, which effectively integrated deep and shallow semantic information. Other examples of such networks include the PSPNet [
10], SegNet [
11], decoupleNet [
12], and Deeplab [
13]. These architectures enhance the connection between pixels by increasing the receptive field and can better obtain contextual information. However, these methods often lead to the omission of detailed targets and produce discontinuous segmentation boundaries.
In the existing network structures, deep networks prioritize the extraction of semantic information, whereas shallow networks tend to be concerned with specific information [
14,
15]. The crucial factor for achieving precise image segmentation is the effective fusion of both shallow and deep networks. However, the fusion methods used in the existing semantic segmentation algorithms tend to cause a loss of details [
16]. The existing networks can perform semantic segmentation of a wide range of images very well, but when ultra-high-resolution images need to be segmented at the pixel level, these networks will be unable to achieve accurate segmentation due to excessively small ranges or the lack of relevant contextual information, resulting in unsatisfactory segmentation accuracy and results.
We propose a model that shows high accuracy in semantic segmentation of remote sensing images, particularly those representing urban environments and natural landscapes, as depicted in
Figure 1.
These images include intricate information and encompass diverse scenarios encountered in daily life. The ambition of our model lies in effectuating the high-precision segmentation of such high-resolution images, thereby contributing meaningfully to applications spanning autonomous driving, urban planning, smart city development, and the construction of digital Earth representations.
The focus of this article is on presenting a fresh model for picture segmentation training that combines branches from the local, surrounding, and global areas. The new model consists of three branches that analyze the global picture after being downsampled, the local image after being cropped, and the surrounding factors in the local area. They are fused through a two-level fusion mechanism. First, the local and surrounding branches are fused. Then, the output results are fused with the input image of the global branch. This structure can effectively balance the usage of GPU memory, and most importantly, it can significantly raise semantic segmentation’s accuracy level. If very small local areas are of concern, they can be segmented precisely by the local and surrounding branches. The design of this structure enables the seamless integration of high-precision local specific and global contextual information, which is balanced by learning to maintain accurate segmentation. The main method is summarized below.
A high-precision multi-branch network structure is proposed for ultra-high-resolution picture semantic segmentation.
This structure can effectively combine the global contextual information and fine local features, introduce surrounding branches to ensure high-precision local segmentation while preserving the spatial relationship, and reduce the influence of noise to a certain extent.
The network is designed with a two-level fusion mechanism and uses the SENet and transformer structure to further improve the accuracy of semantic segmentation.
2. Related Work
2.1. Semantic Segmentation
An FCN [
8] can adapt to the input of images of any size. It substitutes convolutional layers for all of the fully connected layers in the convolutional neural network and uses the softmax function to classify pixels and to achieve semantic segmentation at the pixel level. The U-Net [
9] performs segmentation by cell superposition and improves the segmentation accuracy by connecting the feature images in the encoder and decoder. The PSPNet [
10] uses a pyramid pooling layer to connect global information at different scales, and it can integrate background information very effectively with an overall accuracy of around 90%. Based on FCN, Google has proposed a series of image semantic segmentation models known as Deeplab [
13]. These models can obtain image feature information based on multi-scale perception. The DeepLabV3+ [
17,
18] in the Deeplab series captures more contextual information by increasing the receptive field using atrous spatial pyramid pooling (ASPP [
19]), and its mean f1 scores can reach 89.57. However, while the receptive field is increased, small-scale objects can be lost, resulting in discontinuous boundary segmentation. The semantic segmentation methods described above require many parameters and long computing time, and they neglect factors such as computing efficiency and memory consumption. For these reasons, their application in ultra-high-resolution image segmentation is restricted to a certain extent. Parallel asymmetric convolution modules, such as LedNet [
20], DABNet [
21], RegSeg [
22], UHRSNet [
23], and Dense2Net [
24], have attracted wide attention. These modules prioritize context-based data while potentially overlooking the influence of global information on image segmentation.
2.2. Multi-Branch Networks
Multi-branch networks can perform calculations independently through different branches. Such networks are often used to learn multi-perspective and multi-scale information, and they can ensure the real-time performance and high efficiency of the network structure. They have been extensively utilized in computer vision [
25,
26]. Most of the existing two-branch networks use the combinations of “deep network and low-resolution input” and “shallow network and high-resolution input”, which can greatly reduce the computing cost. The characterization ability of a global multi-branch RNN can be improved by modeling the time delay in time series data [
16]. Wang et al. [
27] proposed a multi-branch network structure with joint channel attention, which is mostly made up of global branch, target, and component branches and can obtain abundant local feature information. Herzog et al. [
28] introduced a multi-branch network architecture that utilizes the OSNet [
29] as a foundation, which consists of global, local, top erase, and channel branches and can further extract finer features. Wu et al. [
30] presented a multi-branch network structure with local modules to increase the capacity for generalization and stability of the network based on the uncertainty of channel attention.
Distinguished from the prevailing dual-branch structures, our model employs a three-level branch architecture. By incorporating both the local and surrounding branches, our approach enhances the model’s ability to capture fine-grained image details. Additionally, the global branch is utilized to establish spatial relationships, addressing the challenge of spatial contextual information loss. Furthermore, this design choice alleviates some of the computational burdens associated with high-precision image segmentation.
2.3. Attention Mechanism
The concept of the attention mechanism is to use computers to mimic human vision, to assign weights to different levels, to enable computers to automatically be mindful of the critical information contained in the input image, to adaptively suppress other useless information, and to efficiently minimize the noise interference brought by the background image. The attention mechanism was first used in recursive neural networks (RNNs) to encode input statements. The attention mechanism is employed to obtain feature information from feature graphs in convolutional neural networks. The channel attention module SENet [
31] was suggested by Hu et al., in 2019, which uses the channel attention mechanism for learning through global pooling and highlights crucial information while de-emphasizing other information. Based on the U-Net, Yang et al. added the SE channel attention module to optimize image segmentation. Woo et al. proposed CBAM [
32], which integrates both channel and temporal attention mechanisms to augment the model’s effectiveness and to capture comprehensive attention information. The ECANet [
33] was designed with a local cross-channel interaction strategy, which did not require dimensionality reduction and which achieved remarkable results. For the Gsop-net [
34], a GSoP module is incorporated into the backbone network to obtain high-order statistics efficiently. The CCNet [
35] captures dependencies between pixels through a cross-attention module. In the area of computer vision, the transformer [
36] utilized for natural language processing has also received extensive attention. Others in incremental segmentation are also involved [
37,
38].
Our model incorporates two attention mechanisms and achieves the fusion of features from different branches. The local and surrounding branches employ SENet to highlight detailed information and to facilitate feature map fusion. Moreover, we introduce a transformer structure with a single-head attention mechanism to capture the contextual information of the image, enabling high-precision segmentation of high-resolution images.
2.4. Comparison
In our proposed model, we emphasize both local and contextual information, utilizing a multi-branch network as the overarching framework. Our approach distinctively contemplates the influence of the target details’ surrounding environment on segmentation. To this end, we introduce an attention mechanism module and implement hierarchical fusion. On one side, the SENet adaptively calculates weight coefficients. By adding the results from different SENet modules directly, noise is mitigated, enabling finer segmentation of local details. On the other side, we integrate a transformer structure with global branches. Rather than employing an encoder–decoder structure or multi-head attention, we transform the transformer into a single-head attention module. Through processing the input of the multi-branch structure, a significant amount of contextual information can be captured, which bolsters segmentation accuracy.
4. Experiments
4.1. Details
4.1.1. Data Set
In this paper, the Vaihingen and Potsdam datasets are used. Images for both datasets [
43] were taken using a digital aerial camera of the German Association for Photogrammetry and Remote Sensing (DGPF) [
44] and Mosaic using Trimble INPHO OrthoVista. The two datasets consist of images falling into six categories, including roads, buildings, low-growing plants, trees, cars, and backgrounds, which are widely used in the field of remote sensing image recognition.
The Vaihingen dataset is a collection of aerial images of a town in Germany with a high-resolution orthophoto, digital surface model (DSM), and accurate ground truth data for urban areas. It consists of 33 aerial images of different sizes, and 16 of them have labels. Each image contains an average of 2494 × 2064 pixels. The six categories of images in this dataset are unevenly distributed. The pixel count of each category is shown in
Table 2.
The Potsdam dataset is a collection of typical urban scenes with large buildings, tiny streets, and dense structures in the settlement. This dataset is unique due to its combination of orthophoto imagery and DSM derived from LiDAR data, offering a composite view of the urban landscape. This intricate detail provides a robust foundation for advanced research, especially in fields such as machine learning and artificial intelligence for object detection, semantic segmentation, and change detection in urban areas. This dataset contains 38 aerial images, 24 of which have been labeled. All images in it have the same size of 6000 × 6000 pixels. The six categories of images in this dataset are also unevenly distributed. The pixel count of each class in the dataset is depicted in
Table 2.
4.1.2. Evaluation Index
Based on the overall accuracy (OA), F1 score (F1) and mean intersection over union (mIoU) are utilized as evaluation indices.
F1: The harmonic mean of both precision and recall is defined in Equation (
5)
OA: Predict the ratio of the correct pixel value to the total pixel value, as shown in Equation (
6)
mIoU: Add up the IDs of each category and then divide by the total number of categories, as shown in Equation (
7).
In the above formula, TP denotes instances where the model accurately forecasts a positive result. FP represents situations where the model predicts a positive outcome erroneously, as the actual result is negative. FN arises when the model incorrectly foresees a negative outcome while the true result is positive. TN signifies cases in which the model correctly anticipates a negative outcome.
4.1.3. Experimental Details
Based on the existing literature [
45,
46], in the Vaihingen dataset, we followed the benchmark organizer’s recommendation of utilizing 16 images for training and 17 images for testing. Similarly, in the Potsdam dataset, we employed a training set consisting of 24 tiles and a testing set comprising 14 tiles.
For our experiments, the basis of the network is an FPN (feature pyramid network) with ResNet101. The feature map sharing technique was used for the top–down ResNet101 feature maps from conv2 to conv5 and for smooth phases of the FPN. Both the local picture that was cropped and the downsampled global image are 500 × 500 pixels in size. An overlap of 50 pixels between adjacent patches was set to avoid the problem of vanishing boundaries for all convolutional layers. We utilized the command-line utility “gpustat”, set the batch size to 1, and avoided gradient computation to assess the GPU memory usage of the model. All research was completed using workstations with NVIDIA 1080Ti GPUs, and only one GPU was utilized for training and reasoning. Our experimental framework is based on PyTorch, and the Adam optimizer was used. The global branch is trained at a learning rate of 1 × , while the local and surrounding branches are trained at a learning rate of 2 × . The learning rates associated with the global, local, and surrounding branches are determined through an exhaustive process of parameter adjustment. This method ensures the achievement of an optimal training velocity and superior end performance for the model. During the experiments, a small batch size of six was used to be trained.
4.2. Result
Table 3 and
Table 4 include a list of the experimental findings for the Vaihingen and Potsdam datasets. The quantitative index verifies the validity of our model. Specifically, the OA and mIoU results produced by the proposed model are 92.40% and 84.43% for the Vaihingen dataset and 92.36% and 87.73% for the Potsdam dataset, which are significantly better than those of most ResNet-based methods. Our model is far superior to current contextual information fusion techniques such as DeepLab V3+ and PSP Net, and the average OA and mIoU results of our model are 1.54% and 2.77% higher than those of the aggregation methods mentioned above. Meanwhile, our approach outperforms some multi-scale feature fusion models, such as the EaNet, especially in terms of the recognition of buildings and low vegetation. The results concerning semantic segmentation of the images of buildings and low vegetation are 98.90% and 89.32%, which have been improved by 3.15% and 4.87%, respectively. Our model’s overall accuracy is higher than that of transformer neural networks such as the BoTNet, and the OA and mIoU results of our model are 1.74% and 2.75% higher than those of transformer neural networks. However, for the recognition of certain types of ground objects such as trees, the accuracy of our model is similar to that of other models compared with it. In general, our model’s accuracy is significantly better than that of other models.
When the target information is situated at the image periphery, our model’s segmentation accuracy may experience a slight reduction. However, this is primarily in comparison to the central regions of the image and is generally within acceptable margins. To mitigate the impact of image edges on semantic segmentation accuracy, we implement padding at the image periphery. This helps to reduce the influence of image edges on semantic segmentation accuracy.
We stipulate that an image is an edge region if it contains anything else within seven pixels of it. By conducting experiments on the Vaihingen dataset and the Potsdam dataset, we obtained comparative results, as shown in
Table 5. In the Vaihingen dataset, the mIoU for the edge of the image is 62.40, while the mIoU for the central region of the image is 89.17. In the Potsdam dataset, the values were 76.34 and 89.85.
In terms of computational capacity, we measure it using Floating Point Operations (FLOP). Compared to the concatenated transformer structure, our network achieves a FLOP of 451.2, while the concatenated transformer structure requires 824.7 FLOP. This reduction in computational resources helps alleviate constraints on the network while also reducing computation time. Furthermore, our model is not significantly affected by the large volume of data during the training process. Compared to the concatenated transformer structure, our model’s training time is merely two-thirds of its duration.
Our model endeavors to accomplish high-precision semantic segmentation of ultra-high-resolution remote sensing images. To reach such elevated levels of accuracy, trade-offs are inevitably necessitated, particularly with computational complexity. Given the intricate computations involved in processing the extensive information embedded within ultra-high-resolution images, it is both understandable and acceptable that our model demands a measure of computational power and processing time. Consequently, the performance of the model might be curtailed under circumstances where computational resources are limited, or where rapid image processing is indispensable. Furthermore, opportunities for future exploration and enhancement lie in potential efficiency improvements within our model.
4.3. Ablation Study
We propose a model that utilizes two-level fusion to improve image segmentation performance. Therefore, it is worthwhile to carry out an ablation study and look into how each model component affects precision, the experimental results are shown in
Table 6. First, ResNet101 was selected for the ablation study. From the results for the two datasets, it is evident that the mIoU findings obtained by the ResNet101 alone are lower. Then, low-level fusion was performed, and the local and surrounding features were fused through the SENet structure. The results show that only low-level fusion has greatly improved the mean F1 by 3.48% on average, and it has a positive effect on OA and has greatly improved the mIoU by 4.99% on average. When the transformer structure is used to fuse the features after low-level fusion, and for the global branch, with high-level fusion only, the model’s mean F1 and mIoU performance shows a slight improvement compared to low-level fusion. However, the OA results are similar to those obtained by low-level fusion. Finally, the low-level and high-level fusion processes were combined. The SENet was used to improve accuracy in local areas, and then, the transformer was used to connect the contextual information. The multi-level fusion approach outperforms the single low-level or high-level fusion methods in terms of both OA and mIoU, which were 3.02% and 7.43% higher than those obtained by the ResNet101.
The ablation study indicates that both the low-level and high-level fusion processes are essential and cannot be removed without degrading the performance of the model, and accurate semantic segmentation can be achieved only when these two fusion processes are combined to process images.
5. Conclusions and Future Work
In the task of semantic segmentation for high-resolution images, achieving accurate segmentation results necessitates a comprehensive analysis of both detailed features and contextual information within the image. This paper presents a novel model specifically designed for performing ultra-high-resolution remote sensing imagery’s semantic segmentation. This model uses an attention mechanism and a multi-branch structure to perform feature fusion at two levels, and by adding the SENet module and transformer, it can effectively complete fine image processing through the local and surrounding branches and enhance the precision of semantic segmentation using the global contextual information captured by the global branch. We conducted extensive testing on the Vaihingen and Potsdam datasets, which encompass a diverse range of urban scenes and natural landscapes, ensuring the robustness of our approach. Compared to the majority of ResNet-based models, this model has higher segmentation precision.
Our research holds immense practical significance, as it offers valuable insights that can be applied across various domains in real-world applications. In fields such as unmanned driving and smart city, accurate semantic segmentation will greatly improve the security of targets to be identified or the accuracy of information systems, and it has significant real implications. Additionally, our approach enables the processing of remote sensing images within a specific time frame, allowing for the extraction of intricate information. This enables us to discern the patterns of development and changes in ground objects within a designated area, thus providing essential decision support for relevant stakeholders.
Our future work will focus on enhancing computational efficiency and promoting model lightweights. We aim to optimize the model architecture to reduce computational burdens while maintaining accuracy. This includes utilizing techniques such as deep convolution and point convolution instead of standard convolution, as well as integrating void convolution into the SE module to minimize computational parameters. Additionally, we will adjust the multi-branch structure and introduce a dynamic weight adjustment mechanism to accelerate processing speed without significantly impacting performance.