AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images

Zhu, Bowen; Yu, Ding; Xiao, Xiongwu; Shen, Jian; Cui, Zhigao; Su, Yanzhao; Li, Aihua; Li, Deren

doi:10.3390/rs17091481

Open AccessArticle

AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images

by

Bowen Zhu

^1,2,†,

Ding Yu

^3,†

,

Xiongwu Xiao

^2,*

,

Jian Shen

⁴

,

Zhigao Cui

³

,

Yanzhao Su

³,

Aihua Li

³ and

Deren Li

²

¹

School of Remote Sensing and Information Engineering, Wuhan University, Wuhan 430072, China

²

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430072, China

³

Key Laboratory of the Ministry of Education on Application of Artificial Intelligence in Equipment, Xi’an Research Institute of High Technology, Xi’an 710025, China

⁴

School of Computer Science, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(9), 1481; https://doi.org/10.3390/rs17091481

Submission received: 28 February 2025 / Revised: 19 April 2025 / Accepted: 21 April 2025 / Published: 22 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

The automatic extraction of buildings from remote sensing images is crucial for various applications such as urban planning and management, emergency response, and map making and updating. In recent years, deep learning (DL) methods have made significant progress in this field. However, due to the complex and diverse structures of buildings and their interconnections, the accuracy of extracted buildings remains insufficient for high-precision applications such as maps and navigation. To address the issue of enhancing building boundary extraction, we propose a modified instance segmentation model, AP-PointRend (Adaptive Parameter-PointRend), to improve the performance of building instance extraction. Specifically, the model can adaptively select the number of iterations and points based on the size of buildings to improve the segmentation accuracy of large buildings. By introducing regularization constraints, discrete small patches are removed, preserving boundaries better during the segmentation process. We also designed an image merging method to eliminate seams, ensure the recall rate, and improve the extraction accuracy. The Vaihingen and WHU benchmark datasets were used to evaluate the performance of the AP-PointRend method. The experimental results showed that the proposed AP-PointRend approach generated better building extraction results compared with other state-of-the-art methods.

Keywords:

building extraction; instance segmentation; deep learning; remote sensing

1. Introduction

Automatically extracting buildings from remote sensing images is extremely important in various applications such as urban planning, environmental management, risk assessment, and emergency response [1,2,3,4]. In particular, the location, size, and number of buildings are essential for building-oriented applications including building change detection [5,6], and digital city construction [7]. Traditionally, building outline extraction is conducted through manual sketching and vectorization, which can be time-consuming and laborious. With the rapid development of sensors, the resolution of remote-sensing images has gradually increased. High-resolution remote sensing images allow for the precise and automatic extraction of buildings. However, extracting building outlines from remote sensing images is still challenging. The diversity of geometric shapes between buildings and the high complexity of image backgrounds increase the difficulty of accurate extraction. Therefore, efficient automatic building extraction has become an increasingly popular research topic in remote sensing.

With the training of many samples to learn the high-level features representative of images, deep learning has also been introduced into building extraction research. Vakalopoulou et al. [8] achieved the detection of construction targets in remote sensing images based on convolutional neural networks in 2015. Chen et al. [9] designed a 27-layer deep convolutional neural network with convolutional and deconvolutional functions for the pixel-level extraction of buildings on high-resolution images, taking into account the characteristics of diverse building shapes, appearances, and complex distributions. Deep learning networks have powerful feature extraction capabilities. Following the proposal of fully convolutional networks (FCN) [10], Maggiori et al. [11] and Zuo et al. [12] alternately used FCN to model multiscale features and achieve the pixel-to-pixel prediction of building roofs. These FCN-based methods have achieved relatively accurate recognition and positioning in building extraction by restoring shallow spatial details through skip connections. Nevertheless, because the structure or contextual information of buildings has not been fully explored, there is still a lack of strong discriminative clues [13]. An improved direction is to encode more attributes of each building [14] such as boundaries, distance transformation masks, or roof rotations. Such methods still primarily focus on refining the local details of segmented regions.

Inspired by Mask R-CNN [10], advanced models derived from instance segmentation methods have been explored to extract objects from remote sensing images with remarkable results in the past few years [15,16,17,18,19]. However, the existing methods still have two main limitations. First, the complex background of remote sensing images and the diversity of building structures can significantly affect the extraction of building outlines by mask regression methods [20,21]. Second, some small building instances can be misidentified due to the scale differences of high-resolution remote sensing images. In discussing the limitations of the existing methods, we paid particular attention to the PointRend method, which is an effective method for instance segmentation, but its fixed parameters result in low boundary accuracy for large buildings and may produce many discrete patches. In this paper, we developed a new AP-PointRend network to address the limitations of the PointRend method. The model can adaptively choose the number of iterations and points based on the size of the building to improve the segmentation accuracy of large buildings. The introduction of regularization constraints removes discrete small blocks and preserves boundaries better during the segmentation process. We also designed an image merging method to eliminate seams, ensure recall, and improve the extraction accuracy.

The main contributions of this paper are as follows:

(1): The proposed image annotation method can annotate and cut the entire image to any size and overlapping degree, which solves the problem of existing datasets being of fixed sizes and the difficulty in keeping up with the rapid development of hardware.
(2): An improved deep learning network, AP-PointRend, is employed in the extraction of building roof outlines. This approach addresses the issue of discrete patches in building extraction using PointRend, and improves the edge effect for large buildings.

A new method, AP-PointRend, was proposed to merge test images by performing dilation followed by erosion, which helps to eliminate seams during the stitching process. This approach ensures both accuracy and precision in building instance extraction.

2. Related Work

Building extraction methods include pixel-based and object-oriented deep learning methods [22] and can be further divided into semantic segmentation-based building extraction and instance segmentation-based building extraction.

2.1. Segmentation-Based Buildings Extraction

Pixel-based semantic segmentation methods have improved considerably in extracting buildings from remote sensing images. Numerous studies have explored different semantic segmentation methods to improve the accuracy of building extraction. For example, Xu et al. [23] designed a neural network for image segmentation based on a deep residual network using a guided filter to extract buildings. Chen et al. [24] enhanced building extraction from HRSI by using residual channel attention and multidilated convolution blocks. Awad et al. [25] applied a high-pass filter to emphasize high-frequency components within feature maps. However, this method has difficulties in extracting some of the blurry and irregular building boundaries.

With the development of fully convolutional networks (FCNs), many variants of FCNs have been successfully used for building extraction. Huang et al. [26] designed GRRNet, which fuses high-resolution aerial images and LiDAR point cloud data to learn multilevel features for building extraction. Using the original image and its downsampling image as input, Ji et al. [3] proposed a Siamese U-Net that shared weights in two branches and created an open-source, high-quality, multisource dataset WHU for building detection, significantly improving the segmentation accuracy. Xiao et al. [27] fused the U-Net with an encoding enhancer of Swin Transformer to achieve the feature-level fusion of local and large-scale semantics, resulting in better building extraction accuracy. Zhao et al. [28] improved multiscale building extraction by using a multiscale receptive field encoder and multipath decoder to enhance feature capture and edge accuracy. Zhu et al. [29] enhanced building footprint extraction by using an interactive dual-stream decoder to learn the semantic-contour correlations. Yuan et al. [30] combined the Lite Swin Transformer with convolutional neural networks (CNNs) to capture both the global and local features, improving the building extraction precision. Wang et al. [31] introduced an attentional feature fusion (AFF) module to bridge the semantic gap between high-level and low-level features, improving building extraction in complex backgrounds. Nie et al. [32] proposed a clustering-guided semantic decoupling module, consistency-based anti-interference feature extraction module, relevance-based anti-interference feature extraction module, and optional decoder module based on semantic category balance to improve the accuracy of semantic segmentation. Zuo et al. [33] proposed a cross-stage features fusion network (CFF-Net) for building extraction from remote sensing images. Ye et al. [34] introduced two feature fusion modules, spatial scale adaptive fusion and semantic guided fusion, enabling the network to adaptively extract and intentionally select multiscale features from multimodal data. Dai et al. [35] improved building segmentation by using location channel attention and multiscale fusion to enhance the edge details. Li et al. [36] improved building footprint extraction by decoupling body and boundary features using a multiscale fusion and feature decoupling-recoupling module. Zhang et al. [37] enhanced weakly supervised building extraction by using feature-level flipping and a slice-and-merge module to achieve superior accuracy and robustness, while Wu et al. [38] introduced terrain perception loss (TAL) to enhance the ability of deep CNNs to learn heterogeneous architectural features to better preserve boundaries during segmentation. Holail et al. [39] used an ensemble spatial channel attention fusion (ESCAF) module and a depth supervision (DS) module to enable satellite image-based building change detection. Holail et al. [40] achieved the high-precision extraction of building damage and automatic assessment of damage level, the extraction of farmland damage, and the accurate extraction of missile craters based on the intelligent extraction of features by change detection. These breakthroughs in semantic segmentation techniques for building extraction benefited largely from the rapid growth in available remote sensing data and computing power. However, many of these approaches failed to separate individual building instances, causing significant errors in building silhouettes.

2.2. Instance Segmentation-Based Buildings Extraction

Research on building instance extraction has progressed with the development of instance segmentation methods based on Mask R-CNN. For example, Zhao et al. [2] proposed a method that combined Mask R-CNN with building boundary regularization to generate better-regularized polygons. Wen et al. [6] proposed a modified Mask R-CNN method to detect rotated bounding boxes of buildings and simultaneously segment them from complex backgrounds. Zhu et al. [41] improved building extraction by generating accurate polygons and recovering missing vertices, achieving significant performance gains with reduced inference time. Liu et al. [16] designed a multiscale U-shaped CNN architectural instance extraction framework with edge constraints to extract accurate architectural masks. Chen et al. [42] used parallel contour and DCT branches to improve the boundary accuracy and segmentation of small targets. Xie et al. [43] proposed a stepwise urban building use identification framework that integrated remote sensing and social sensing data with spatial constraints based on the association of buildings with point of interest (POI), area of interest (AOI), and remote sensing data. Guo et al. [44] enhanced the deepest encoded features with a feature enhancement module and used skip connections and transposed convolutions to restore resolution and reduce information loss. Saleh et al. [45] proposed the graph neighbor module, dimension-wise interactive attention module, and attentive supervised learning module to achieve high accuracy detection. Zhang et al. [46] used multidomain style transfer, feature approximation, and cascaded instance extraction to bridge semantic gaps and improve performance across multiple domains. Saleh et al. [47] enhanced detection in SAR imagery by prioritizing significant changes and distinguishing genuine changes from noise. Chen et al. [48] proposed a method based on the foundational SAM model and incorporated semantic category information to achieve automatic instance segmentation of remote sensing images. The Swin Transformer can be used for building extraction, and is a novel self-attention mechanism model with a structure that includes multiple stages of feature extraction modules that can effectively improve the efficiency and accuracy of image processing. It can adapt to buildings of different sizes and shapes, but the extracted building boundaries may not be clear enough, especially for larger buildings. He et al. designed a novel instance segmentation method that used point sampling and hierarchical prediction to improve the accuracy and efficiency of instance segmentation. Wang et al. [49] enhanced cross-scene generalization and task universality through a multilevel feature sampler, cross-attention decoder, and federated training strategy. PointRend can accurately locate and segment building instances in building extraction, but due to fixed parameters, it has lower boundary accuracy for larger buildings and may produce many discrete patches. Fang et al. [50] proposed a coarse-to-fine contour optimization network, introducing channel attention into each layer of the original feature pyramid network (FPN) to improve the recognition of small buildings. Qiu et al. [51] optimized interaction features adaptively using click information, maximizing prior knowledge for building extraction, and reducing the annotation costs while enhancing model generalization. While these methods have contributed to the progress of building instance extraction, they are still insufficient in accurately extracting the building boundaries.

3. Materials and Methods

3.1. Datasets

The Vaihingen datasets [52] contain 33 remote sensing images of different sizes extracted from larger orthoimages. The spatial resolution of the orthoimages and digital surface model (DSM) was 9 cm. The remote sensing images are shown in 8-bit TIFF format files consisting of three bands (near-infrared, red, and green), while the DSM images are in TIFF format files consisting of a single band with gray levels (corresponding to DSM heights) encoded as 32-bit floating point values.

ISPRS provided two state-of-the-art airborne image datasets for urban classification and 3D building reconstruction, which included a DSM through matching with high-resolution orthoimages and corresponding dense images. Both datasets covered urban scenes. The Vaihingen images included small villages with many individual buildings and small multistory buildings. As shown in Figure 1, the images were manually classified into six land cover types: impervious surface (RGB: 255, 255, 255), building (RGB: 0, 0, 255), low vegetation (RGB: 0, 255, 255), tree (RGB: 0, 255, 0), car (RGB: 255, 255, 0), and clutter/background (RGB: 255, 0, 0). The clutter/background class included water bodies (present in two images with part of a river) and other objects that looked very different from everything else (e.g., containers, tennis courts, swimming pools).

3.2. The Flow of DL-Based Buildings Outlines Extraction from Remote Sensing Images

The automatic extraction of buildings from remote sensing images using deep learning algorithms is shown in Figure 2. The processing steps are as follows:

(1): The original images are manually annotated to generate single-scene training data for large scenes. All the large scene aerial images and satellite images in the dataset are cropped according to any overlap degree, and the training data are automatically generated based on the manual labeling data of the original images.
(2): The building boundary annotation coordinates are recalculated in the cropped training data based on the mapping relationship between the coordinates of the cropped image and the original image. As shown in Figure 3, when a single building is cut into multiple images, the intersection point of the cutting line and the building boundary line must be calculated for coordinate insertion and reordering.
(3): Based on the training data generated in the previous step, the network model is iteratively trained on the basis of the pre-trained model, and the model file is optimized through a continuous iteration. When a predefined number of epochs is reached, the iteration model is stopped. The network model includes the basic framework of building instance segmentation, which can complete the multiscale feature extraction. At the same time, mask upsampling based on fine-grained pixel segmentation and adaptive parameter selection method is carried out.
(4): Buildings are extracted from remote sensing images using the model files generated by iterative training, which meets the accuracy requirements of the training datasets.
(5): The trained model is used to extract the two-dimensional outline of the building from multiple cropped test targets. The corresponding extraction results of the multiscene test data are obtained, and the extracted multiple images are merged to eliminate the seam.

3.3. Adaptive Cropping and Automatic Generation of Training Data

Since high-resolution aerial remote sensing images have the characteristics of large format and high spatial resolution, if the single-scene large-format aerial images are directly input into the network for training and testing, the computer memory would not be able to meet the requirements during the continuous iteration process. Meanwhile, in the process of building feature extraction, the images are extracted through a reshaping operation. If the large-size image of a single scene is directly input, most of the features will be lost, so most of the buildings cannot be detected. Thus, the original images must be cropped using the methods shown in Figure 3. However, after cropping the original image with a different overlapping ratio, the annotation coordinate of the building in the original single-scene training data will have changed and would have to be re-marked, as it does not meet the speed requirements of multiscale training data. The single-scene image labeling, image adaptive cropping, and training dataset generation strategy is as follows:

(1): The original large-scale aerial images or satellite images are manually labeled to obtain the initial training data of the building (i.e., the building boundary point set [(x1, y1), (x2, y2), (x3, y3), (x4, y4), …, (xn, yn)]) and the attribute information (e.g., number, category, and range).
(2): Then, cutting of the original image with overlap is completed, and the original building boundary point set and attribute value are recalculated according to the cutting results.

In the cropping of large scene aerial and satellite images, some buildings will be divided into two or more contours on the clipping line. As shown in Figure 4, the original image building had seven boundary points. The outline became two parts after cropping, and there were two intersection points with the cropping line, so the label data needed to be regenerated. Cropped images require the recalculation of new coordinates for building outlines.

The building is cut into two parts. Since the image attributes have changed, the building coordinates in the new image are calculated and updated. When the overlap ratio of the cutting (overlap_X × overlap_Y) and map sheet after cutting (crop_X × crop_Y) are selected, the number of rows and columns of the cropping frame can be calculated using the formula:

\{\begin{cases} i_\max = \frac{i n i t_X}{(1 - o v e r l a p_X) \times c r o p_X} + 1 \\ j_\max = \frac{i n i t_Y}{(1 - o v e r l a p_Y) \times c r o p_Y} + 1 \end{cases}

(1)

Using the original image init_X × init_Y, the initial coordinates (beg_x, beg_y) of the left corner in the original single scene image can be calculated using the following formula:

\{\begin{cases} b e g_x = i \times (1 - o v e r l a p_X) \times c r o p_X, i \in N, i \in [0, i_\max) \\ b e g_y = j \times (1 - o v e r l a p_Y) \times c r o p_Y, j \in N, j \in [0, j_\max) \end{cases}

(2)

The cut images are obtained using the initial coordinates of the left corner of the original single-scene image. The zeroing operation is performed when the cropped image is not divisible, which means that the pixels in the original image cannot be obtained. The coordinates (x′, y′) of the cropped image are calculated as follows:

\{\begin{cases} x' = x - b e g_x \\ y' = y - b e g_y \end{cases}

(3)

In cases where a single building is clipped in multiple frames, the coordinates [(m₁, n₁), (m₂, n₂), (m₃, n₃), (m₄, n₄), …, (m_n, n_n)] of the intersection between the clipping boundary and the building boundary have to be calculated.

When the single building occupies two cutting frames, the positional relationship between the clipping line and the polygon boundary point should be evaluated. The intersection coordinates (2N) of the cutting line and the building boundary are calculated, and the intersection point is inserted into the original building boundary point set according to the positional relationship. The building segmentation is completed according to the intersection relationship between the intersection point and the original point, and the single point set is divided into n + 1 independent point sets. Finally, the new coordinate of each point is calculated based on the translation relationship of the clipping. When a single building occupies three or more cutting frames, Equation (2) is executed repetitively to clip the single building. The cropped image intersects two clipping lines at the edges and four clipping lines for the rest.

3.4. Automatic Extraction of Building Roof Contour

AP-PointRend, which was used in this study, is an improved deep learning instance segmentation algorithm based on PointRend [53]. With innovations such as the use of computer graphics technology to regularize extraction results and a self-adaptive method for parameter selection, AP-PointRend achieves more accurate and reliable segmentation. The algorithm structure flowchart is shown in Figure 5. With local feature expression based on fine-grain pixel segmentation, bilinear interpolation upsampling is performed on each building candidate frame mask to obtain the prediction results. The point with the most inaccurate prediction is selected to predict again in order to obtain a higher pixel-quality prediction mask. This step is repeated until the required resolution is reached. First, a lightweight coarse mask prediction head is utilized to generate a coarse mask prediction. Then, points on uncertain edges are selected as prediction points in the generated coarse mask. Finally, the fine-grain features from the coarse prediction of the points selected are extracted at each selected point to generate new mask predictions with higher pixel quality.

Testing refers to efficiently rendering the image by calculating the points different from the surrounding pixels. Other output values of positions are calculated by interpolation. The coarse-to-fine process refers to the iteration of the rendered mask of each step. Coarse predictions are obtained from the segmented head. In each iteration, bilinear interpolation is used to upsample the previous segmentation predictions, and then the most uncertain N points are selected in a dense feature map. The feature representation of N points is then calculated to predict the category. This process is repeated k times until the upsampling results reach the required size. One of the iterations is shown in Figure 6.

Given that some misjudgments can occur during the iterative process of point selection, computer graphics technology can be employed to regularize the extraction results for each iteration and remove the wrong discrete patches. If the parameter patches are smaller than the set threshold, the patches are not extracted. In the PointRend method, the number of iterations k and the selected point number N are ensured according to experience. However, the fixed parameters k and N are unsuitable for extracting buildings from high-resolution remote sensing images since building areas can vary considerably. Large buildings can extend to tens of thousands of square meters, while small ones can cover as an area as small as tens of square meters. In this case, the extraction results using the same test parameters would not be good. Therefore, a self-adaption method was developed to select parameters k and N based on the building size and image resolution.

\{\begin{cases} N = a x^{3} + b x^{2} + c x + d \\ O r i g i n a l r e s o l u t i o n > 28 \times 2^{k - 1} > \sqrt{w \times h} \end{cases}

(4)

where w and h are the length and width of the bounding rectangle of each iteration process to extract the patches, respectively. Original resolution refers to the resolution of the original image.

x

is the perimeter of the extracted contour of the building, which is obtained by regular calculation of the extracted features. N is the number of points selected on the contour, as shown in Figure 6 in red, and was counted using OpenCV 4.9.0. Figure 7 is a part of the extracted contour used to calculate the contour coefficient. We predefined an initial value of N and subsequently performed building contour extraction. From the results of the building contour extraction, we manually selected 30 well-extracted contours and randomly selected 20 contours. For these contours, we counted uncertain points selected on their boundaries during the upsampling process and the perimeter of the extracted contour of the building. Then, least squares regression was used to fit the coefficients. The fitted coefficients were used to infer and evaluate the accuracy of the dataset, which was performed 30 times, and the highest accuracy was selected and saved.

3.5. Merging Method of Extraction Results of Multiple Remote Sensing Images

When executing model testing on large-format aerial and satellite images, the computer memory may overflow due to the large input size. Thus, the target image needs to be cropped without overlapping. The strategy for merging the recognition results (mask) of each image after cropping is as follows:

(1): Use the trained model to extract the two-dimensional outline of the building from multiple cropped test targets and obtain the corresponding extraction results of the multiscene test data.
(2): Perform grayscale and binarization for each test image extraction, resulting in the mask. The mask part is 255, and the rest are set to 0.
(3): Calculate the pixel coordinates of each binary recognition result in the original image without clipping, merge multiple cropped results (which are binary images), and perform the expansion operation on the merged result image with the specific size of the convolution kernel. Based on experience, the convolution kernel size is generally 5 × 5 or 7 × 7.

Perform the erosion operation with a specific convolution kernel size on the resulting image expanded in the previous step, where the convolution kernel size is the same as the expansion operation. Use the resulting image after erosion as the result of the two-dimensional outline of the DL-based building extraction. Since the expansion and erosion operations are applied symmetrically, they primarily refine the mask while enhancing the segmentation accuracy. These operations help maintain the integrity of object boundaries, reduce noise and small artifacts, and ensure improved extraction precision, particularly in the merging process of multiple remote sensing images.

4. Experiment and Analysis

4.1. Experiment Environment

In this study, the network was developed based on the Pytorch 1.6 deep learning framework, using Python 3.7 and Cuda 10.0. The training and testing experiments were implemented on an Ubuntu 16.04 operating system. The computer used in the experiment was configured with an E5-2650 v3 Intel(R) Xeon(R) CPU, 32 GB RAM (Produced in China), and two Nvidia RTX 2080 GPUs (Produced in China). All networks were trained by the Adam optimizer. The learning rate was initially set to 0.001, the momentum was 0.9, and the batch size in the training phase was fixed at 2. The cropped image sizes were set as 448 × 448, 896 × 896, and 1792 × 1792. The network was trained for 50 epochs using the training set, and models that performed well on the validation set were stored. All models were built using MMDetection 3.2.0 [54]. Three deep learning methods were also used for comparative analysis: Pointrend, Swin Transformer [55], and Mask R-CNN. All models used the MMDetection deep learning framework.

4.2. Experiment Results

4.2.1. Generate Datasets of Arbitrary Size and Overlap

The large-format remote sensing images of the Vaihingen datasets were used in the experiment and were manually labeled using LABELME 5.3.1; only the buildings were labeled in generating the MS CoCo [56] datasets. For the instance segmentation, the large-format RS images were cropped to different image sizes (i.e., 448 × 448, 896 × 896, and 1792 × 1792) and overlap degrees (i.e., 0%, 20%, and 50%). The annotation results and datasets generated in this paper are shown in Figure 8.

4.2.2. Visualization Results

The proposed method was evaluated on two building segmentation datasets: the Vaihingen dataset and the WHU dataset.

Building Extraction on Vaihingen Dataset

The results of the different building extraction methods are presented in Figure 9. Using visual interpretation, the proposed AP-PointRend method achieved the best global extraction results compared with the other methods. As shown in the third row of Figure 9, Mask R-CNN, Swin Transformer, and PointRend achieved acceptable results for a scene with relatively small buildings. However, as the complexity and building area increased, a significant drop in performance was observed (see the first and third rows of Figure 9), with some buildings having noticeable zigzag edges. The results suggest that traditional methods have difficulties in accurately identifying buildings with irregular structures and large scales, seriously affecting the visual effect.

In contrast, almost all buildings were correctly and completely identified by AP-PointRend, indicating that the proposed model could better iteratively refine the building boundaries. While Mask R-CNN, Swin Transformer, and PointRend could predict large buildings to a certain extent, they were unable to handle large buildings with complex shapes, as shown in the fourth row of Figure 9. Although AP-PointRend is an improved version of PointRend, it is more sensitive to building scale and structure changes.

To better illustrate the detailed results, Figure 10 and Figure 11 show close-up images of selected areas (from the red rectangles in Figure 9). From the close-up view, traditional building extraction methods exhibited limited capabilities for structurally complex and large buildings, resulting in inaccurate boundary predictions. However, due to the enhanced boundary delineation and regularization that removed discrete small patches, the proposed method generated only a small number of misclassified pixels in the boundary region, performing considerably better than the traditional approaches.

Building Extraction on WHU Dataset

Figure 12 presents the segmentation results of different models on buildings in the WHU dataset. As shown in the first row, the buildings were relatively small, and the segmentation results of each model were difficult to distinguish. In the second row, there was a large building with many corners, but the extraction results of Mask R-CNN and Swin Transformer were not accurate enough for the corner parts; in comparison, PointRend and AP-PointRend performed much better, especially in terms of details. In the third row, there was a building divided into two parts with significant differences in color and texture. The extraction results of Mask R-CNN and Swin Transformer could not distinguish the corner parts clearly. In the fourth row, there were several prominent details in a large building, but Mask R-CNN and Swin Transformer failed to extract them completely, and the extraction effect of PointRend was not good enough. In summary, AP-PointRend was better able to extract details and improve the completeness and accuracy of contours when segmenting large and structurally complex buildings.

To better compare the extraction results of different models, the segmentation results in Figure 13 were displayed on a black background rather than overlaid on the original image, highlighting the edge details of the extraction results of different models. It can be observed that Mask R-CNN and Swin Transformer performed poorly in extracting the corner and edge details and failed to extract these details completely. In comparison, PointRend and AP-PointRend performed much better. Figure 14 shows a close-up of the selected area in the test image (as indicated by the red rectangle in Figure 13).

4.3. Quantitative Analysis

To quantitatively evaluate the comprehensive performance of the remote sensing image extraction algorithms, the precision, recall, and average precision (AP) (calculated using Equations (5)–(8)) were used to evaluate the test results. These indices are widely used in building extraction [57].

True positive (TP) represents the IoU of the building outlines extracted by the algorithm and manual labeling that is greater than the threshold. The false negative (FN) represents an IoU smaller than the threshold, representing the ground truth that has not been detected in quantity.

I o U = \frac{a r e a (d t \cap g t)}{a r e a (d t \cup g t)},

(5)

p r e c i s i o n = \frac{T P}{T P + F P},

(6)

r e c a l l = \frac{T P}{T P + F N},

(7)

A P = \frac{1}{11} \sum_{r \subseteq {0, 0.1, …, 1.0}} p (r)

(8)

Table 1 presents the comparative results of the different methods, where the highest values are shown in bold. Segm and Box for extracting buildings were evaluated separately. The results indicate that AP-PointRend achieved excellent performance, outperforming other methods in all indices except AR_Box.

While Swin Transformer had the highest AR_Box value, AP-PointRend had a much better AP_segm value of 0.646 compared with Swin Transformer’s 0.635. After a quantitative comparison of different models on the WHU dataset, the results are shown in Table 2, where the best accuracy is highlighted in bold. We evaluated two types of building extraction, Segm and Box. AP-PointRend performed excellently in all indicators and outperformed the comparison methods, especially when compared with the second-best method (Swin Transformer), and AP-PointRend increased the AP_segm value from 0.616 to 0.637 and APBOX from 0.618 to 0.640.

The accuracy improvements in the proposed approach were most significant over PointRend. The results suggest that AP-PointRend is robust enough to handle building extraction in aerial images with complex urban scenes.

5. Discussion

5.1. Influence on Building Extraction with Different Sizes and Overlapping Rate

When training models on large-format aerial and satellite images, the images need to be clipped, because the input image format is too large and causes memory overflow, and many features will be lost due to the reshape operation in the feature extraction process. After the model is trained, the entire large-format image is tested, because the resolution of the training and test images is too different, although the extracted parameters are adjusted: for example, by greatly increasing the number of candidate boxes, the extraction effect was still poor, and many buildings were not detected, as shown in Figure 15a. When tested with the cropped image, the extraction accuracy was significantly improved, as shown in Figure 15b.

Using the Vaihingen datasets, the cropping methods using different slice sizes and overlapping rates were evaluated, and the extraction results are shown in Table 3. When Mask R-CNN had a slice size of 448 × 448, as the degree of the overlapping rate increased, the accuracy of various indicators for extracting building Mask and Box improved, albeit slight. For the cropped image sizes of 896 × 896 and 1792 × 1792, the accuracy for the building Mask and Box improved considerably. The changing trend in PointRend was consistent with Mask R-CNN, while the changing trend in the Swin Transformer overlapping rates was consistent with Mask R-CNN and PointRend. As the extent of the cropped image increased, the accuracy change in the Swin Transformer was not as large as Mask R-CNN and PointRend. This suggests that the generalization and robustness of the Swin Transformer algorithm are comparatively better.

When the cropped image only contained a part of the building, due to insufficient context information [58,59], distinguishing objects with similar texture features became more difficult, resulting in inaccurate building extraction results. The prediction results of the edge area were not accurate enough because the edge pixels lacked enough contextual information and had a limited field of view for reliable prediction.

With cropping and overlapping, each cropped image can be input into the network to increase the amount of data and augment the context information, thus improving building extraction in remote sensing images.

5.2. Accuracy Evaluation Analysis

In terms of extraction accuracy, PointRend had the worst results, but compared with Mask R-CNN and Swin Transformer, its visual edge was smoother and fit better with the true edge. The main reasons are as follows:

(1): Some extraction results had discrete image patches, as shown in Figure 16d; the overall edge was smooth, but there were discrete small image spots locally.
(2): The existing accuracy index IoU only measures the area coverage of the evaluation area, which cannot evaluate the accuracy of the boundary.

The proposed method, AP-PointRend, improves the building extraction accuracy, particularly for large-scale buildings, by adaptively selecting different parameters based on the building perimeter. Furthermore, by regularizing the extracted buildings and removing small, discrete image spots, the extraction accuracy is significantly improved. However, in some environments with more complex structures, the building outline may be incomplete due to occlusion.

In addition, compared with traditional segmentation methods, the boundary-sensing loss function [60,61,62] has been shown to improve the extraction accuracy by explicitly directing the model to improve the edge prediction. However, this approach often relies on complex loss formulas and may require additional computational resources. At the same time, transformer-based alternative models, such as KAN (Kolmogorov–Arnold Network) [63,64,65,66,67,68,69,70], have also been shown to greatly improve the accuracy of building extraction. Offering a promising trade-off between accuracy and efficiency, it offers a path for future exploration.

5.3. Merging Method of Extraction Results of Multiple RS Images

Figure 17 shows the extraction results and merging results. From the partial close-up of the extraction results (see the enlarged red box), the seams were eliminated after merging using the proposed approach.

The traditional block merging method resulted in cracks, with no perfect fit in both directions.

When the block merging method with overlap was adopted, some directions still had cracks (e.g., Y-direction). A certain method was used to make up the cracks, but it resulted in a slight loss in the extraction accuracy of the building outline at the edge and even influenced the entire image. In contrast, the proposed method guaranteed the extraction precision and accuracy by ensuring the recall rate. While the proposed method ensures building integrity, the convolutional kernel is still empirically chosen, and in more complex environments where the separation between buildings is small, this may result in the erroneous merging of multiple buildings into a single building.

6. Conclusions

Large-sample, accurate, and multisource datasets are indispensable in developing and applying deep neural networks in remote sensing applications. In this study, we designed a method of adaptive cutting and automatic training data generation for large-scale remote sensing images with overlapping rates. This method can generate suitable datasets according to hardware performance and guarantee that more buildings can be detected, which is useful in developing and evaluating new methods. Experiments showed that the method is effective and necessary. We developed a new instance segmentation model AP-PointRend, based on PointRend, which can select parameters according to the size of the extracted building, significantly improving the accuracy, particularly in large-area buildings. By adding regularization constraints to remove discrete patches, the boundary accuracy is improved. To eliminate the seams, the block merging strategy is proposed, ensuring the recall rate and improving the extraction accuracy. This study provides a good reference for DL-based building extraction. Since the existing IoU-based evaluation method is not sensitive to boundary accuracy, an index that can evaluate the boundary accuracy will be designed in subsequent research.

Author Contributions

B.Z., D.Y., X.X., A.L., J.S. and D.L. proposed the idea and wrote the manuscript; D.Y., B.Z., X.X., J.S., Z.C. and Y.S. performed the data preprocessing; D.Y., B.Z., X.X. and J.S. designed and performed the experiments; X.X., Z.C., Y.S., A.L. and D.L. helped to revise the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the National Natural Science Foundation of China (Grant No. 42101449), the National Key Research and Development Program of China (Grant No. 2023YFB3906102), the Open Foundation of Key Laboratory of the Ministry of Education on Application of Artificial Intelligence in Equipment (Grant No. AAIE-2023-0402), the Natural Science Foundation of Hubei Province, China (Grant No. 2022CFB773), the Science and Technology Program of Southwest China Research Institute of Electronic Equipment (Grant No. JS20200500114), the Chutian Scholar Program of Hubei Province, and the LIESMARS Special Research Funding. We sincerely thank the editors and anonymous reviewers for their outstanding comments.

Data Availability Statement

Vaihingen dataset: https://www.isprs.org/education/benchmarks/UrbanSemLab/2d-sem-label-vaihingen.aspx, accessed on 12 January 2023. WHU building dataset: http://gpcv.whu.edu.cn/data/building_dataset.html, accessed on 30 June 2023.

Acknowledgments

The authors thank ISPRS for providing the open-access and free aerial image dataset. The authors would also like to thank the anonymous reviewers and editors for their insightful comments and helpful suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mayer, H. Automatic Object Extraction from Aerial Imagery—A Survey Focusing on Buildings. Comput. Vis. Image Underst. 1999, 74, 138–149. [Google Scholar] [CrossRef]
Zhao, K.; Kang, J.; Jung, J.; Sohn, G. Building Extraction from Satellite Images Using Mask R-CNN with Building Boundary Regularization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 242–2424. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Shrestha, S.; Vanneschi, L. Improved Fully Convolutional Network with Conditional Random Fields for Building Extraction. Remote Sens. 2018, 10, 1135. [Google Scholar] [CrossRef]
Bi, Q.; Qin, K.; Zhang, H.; Zhang, Y.; Li, Z.; Xu, K. A Multi-Scale Filtering Building Index for Building Extraction in Very High-Resolution Satellite Imagery. Remote Sens. 2019, 11, 482. [Google Scholar] [CrossRef]
Li, W.; He, C.; Fang, J.; Zheng, J.; Fu, H.; Yu, L. Semantic Segmentation-Based Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source GIS Data. Remote Sens. 2019, 11, 403. [Google Scholar] [CrossRef]
Zhang, B.; Wang, C.; Shen, Y.; Liu, Y. Fully Connected Conditional Random Fields for High-Resolution Remote Sensing Land Use/Land Cover Classification with Convolutional Neural Networks. Remote Sens. 2018, 10, 1889. [Google Scholar] [CrossRef]
Vakalopoulou, M.; Karantzalos, K.; Komodakis, N.; Paragios, N. Building Detection in Very High Resolution Multispectral Data with Deep Learning Features. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 1873–1876. [Google Scholar]
Chen, K.; Fu, K.; Gao, X.; Yan, M.; Sun, X.; Zhang, H. Building Extraction from Remote Sensing Images with Deep Learning in a Supervised Manner. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 1672–1675. [Google Scholar]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Convolutional Neural Networks for Large-Scale Remote-Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 645–657. [Google Scholar] [CrossRef]
Zuo, T.; Feng, J.; Chen, X. HF-FCN: Hierarchically Fused Fully Convolutional Network for Robust Building Extraction. In Proceedings of the Computer Vision—ACCV 2016, Taipei, Taiwan, 20–24 November 2016; Springer: Cham, Switzerland, 2017; pp. 291–302. [Google Scholar]
Ji, S.; Wei, S.; Lu, M. A Scale Robust Convolutional Neural Network for Automatic Building Extraction from Aerial and Satellite Imagery. Int. J. Remote Sens. 2019, 40, 3308–3322. [Google Scholar] [CrossRef]
Bischke, B.; Helber, P.; Folz, J.; Borth, D.; Dengel, A. Multi-Task Learning for Segmentation of Building Footprints with Deep Neural Networks. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1480–1484. [Google Scholar]
Wen, Q.; Jiang, K.; Wang, W.; Liu, Q.; Guo, Q.; Li, L.; Wang, P. Automatic Building Extraction from Google Earth Images under Complex Backgrounds Based on Deep Instance Segmentation Network. Sensors 2019, 19, 333. [Google Scholar] [CrossRef]
Liu, Y.; Chen, D.; Ma, A.; Zhong, Y.; Fang, F.; Xu, K. Multiscale U-Shaped CNN Building Instance Extraction Framework with Edge Constraint for High-Spatial-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6106–6120. [Google Scholar] [CrossRef]
Zhu, Y.; Huang, B.; Gao, J.; Huang, E.; Chen, H. Adaptive Polygon Generation Algorithm for Automatic Building Extraction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4702114. [Google Scholar] [CrossRef]
He, S.; Jiang, W. Boundary-Assisted Learning for Building Extraction from Optical Remote Sensing Imagery. Remote Sens. 2021, 13, 760. [Google Scholar] [CrossRef]
Jin, Y.; Xu, W.; Zhang, C.; Luo, X.; Jia, H. Boundary-Aware Refined Network for Automatic Building Extraction in Very High-Resolution Urban Aerial Images. Remote Sens. 2021, 13, 692. [Google Scholar] [CrossRef]
Zhang, H.; Liao, Y.; Yang, H.; Yang, G.; Zhang, L. A Local–Global Dual-Stream Network for Building Extraction From Very-High-Resolution Remote Sensing Images. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 1269–1283. [Google Scholar] [CrossRef]
Yu, Y.; Ren, Y.; Guan, H.; Li, D.; Yu, C.; Jin, S.; Wang, L. Capsule Feature Pyramid Network for Building Footprint Extraction From High-Resolution Aerial Imagery. IEEE Geosci. Remote Sens. Lett. 2021, 18, 895–899. [Google Scholar] [CrossRef]
Shao, Z.; Tang, P.; Wang, Z.; Saleem, N.; Yam, S.; Sommai, C. BRRNet: A Fully Convolutional Neural Network for Automatic Building Extraction From High-Resolution Remote Sensing Images. Remote Sens. 2020, 12, 1050. [Google Scholar] [CrossRef]
Xu, Y.; Wu, L.; Xie, Z.; Chen, Z. Building Extraction in Very High Resolution Remote Sensing Imagery Using Deep Learning and Guided Filters. Remote Sens. 2018, 10, 144. [Google Scholar] [CrossRef]
Chen, M.; Mao, T.; Wu, J.; Du, R.; Zhao, B.; Zhou, L. SAU-Net: A Novel Network for Building Extraction From High-Resolution Remote Sensing Images by Reconstructing Fine-Grained Semantic Features. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6747–6761. [Google Scholar] [CrossRef]
Awad, B.; Erer, I. FAUNet: Frequency Attention U-Net for Parcel Boundary Delineation in Satellite Images. Remote Sens. 2023, 15, 5123. [Google Scholar] [CrossRef]
Huang, J.; Zhang, X.; Xin, Q.; Sun, Y.; Zhang, P. Automatic Building Extraction from High-Resolution Aerial Images and LiDAR Data Using Gated Residual Refinement Network. ISPRS J. Photogramm. Remote Sens. 2019, 151, 91–105. [Google Scholar] [CrossRef]
Xiao, X.; Guo, W.; Chen, R.; Hui, Y.; Wang, J.; Zhao, H. A Swin Transformer-Based Encoding Booster Integrated in U-Shaped Network for Building Extraction. Remote Sens. 2022, 14, 2611. [Google Scholar] [CrossRef]
Zhao, Y.; Sun, G.; Zhang, L.; Zhang, A.; Jia, X.; Han, Z. MSRF-Net: Multiscale Receptive Field Network for Building Detection From Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5515714. [Google Scholar] [CrossRef]
Zhu, X.; Zhang, X.; Zhang, T.; Tang, X.; Chen, P.; Zhou, H.; Jiao, L. Semantics and Contour Based Interactive Learning Network for Building Footprint Extraction. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5623513. [Google Scholar] [CrossRef]
Yuan, W.; Zhang, X.; Shi, J.; Wang, J. LiteST-Net: A Hybrid Model of Lite Swin Transformer and Convolution for Building Extraction from Remote Sensing Image. Remote Sens. 2023, 15, 1996. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Q.; Wu, Y.; Tian, W.; Zhang, G. SCA-Net: Multiscale Contextual Information Network for Building Extraction Based on High-Resolution Remote Sensing Images. Remote Sens. 2023, 15, 4466. [Google Scholar] [CrossRef]
Nie, J.; Wang, Z.; Liang, X.; Yang, C.; Zheng, C.; Wei, Z. Semantic Category Balance-Aware Involved Anti-Interference Network for Remote Sensing Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4409712. [Google Scholar] [CrossRef]
Zuo, X.; Shao, Z.; Wang, J.; Huang, X.; Wang, Y. A cross-stage features fusion network for building extraction from remote sensing images. Geo-Spat. Inf. Sci. 2024, 27, 1–15. [Google Scholar] [CrossRef]
Ye, Z.; Li, Y.; Li, Z.; Liu, H.; Zhang, Y.; Li, W. Attention Multiscale Network for Semantic Segmentation of Multimodal Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5610315. [Google Scholar] [CrossRef]
Dai, X.; Xia, M.; Weng, L.; Hu, K.; Lin, H.; Qian, M. Multiscale Location Attention Network for Building and Water Segmentation of Remote Sensing Image. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5609519. [Google Scholar] [CrossRef]
Li, Y.; Hong, D.; Li, C.; Yao, J.; Chanussot, J. HD-Net: High-resolution decoupled network for building footprint extraction via deeply supervised body and boundary decomposition. ISPRS J. Photogramm. Remote Sens. 2024, 209, 51–65. [Google Scholar] [CrossRef]
Zhang, X.; Su, Q.; Xiao, P.; Wang, W.; Li, Z.; He, G. FlipCAM: A Feature-Level Flipping Augmentation Method for Weakly Supervised Building Extraction From High-Resolution Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4402917. [Google Scholar] [CrossRef]
Wu, Y.; Xu, L.; Chen, Y.; Wong, A.; Clausi, D.A. TAL: Topography-Aware Multi-Resolution Fusion Learning for Enhanced Building Footprint Extraction. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6506305. [Google Scholar] [CrossRef]
Holail, S.; Saleh, T.; Xiao, X.; Li, D. AFDE-Net: Building Change Detection Using Attention-Based Feature Differential Enhancement for Satellite Imagery. IEEE Geosci. Remote Sens. Lett. 2023, 20, 6006405. [Google Scholar] [CrossRef]
Holail, S.; Saleh, T.; Xiao, X.; Xiao, J.; Xia, G.; Shao, Z.; Wang, M.; Gong, J.; Li, D. Time-series satellite remote sensing reveals gradually increasing war damage in the Gaza Strip. Natl. Sci. Rev. 2024, 11, 9. [Google Scholar] [CrossRef] [PubMed]
Zhu, Y.; Huang, B.; Fan, Y.; Usman, M.; Chen, H. Iterative Polygon Deformation for Building Extraction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704314. [Google Scholar] [CrossRef]
Chen, Z.; Liu, T.; Xu, X.; Leng, J.; Chen, Z. DCTC: Fast and Accurate Contour-Based Instance Segmentation with DCT Encoding for High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8697–8709. [Google Scholar] [CrossRef]
Xie, Z.; Wu, Y.; Ma, Z.; Chen, M.; Qian, Z.; Zhang, F.; Sun, L.; Peng, B. An urban building use identification framework based on integrated remote sensing and social sensing data with spatial constraints. Geo-Spat. Inf. Sci. 2024, 27, 1–25. [Google Scholar] [CrossRef]
Guo, N.; Jiang, M.; Hu, X.; Su, Z.; Zhang, W.; Li, R.; Luo, J. NPSFF-Net: Enhanced Building Segmentation in Remote Sensing Images via Novel Pseudo-Siamese Feature Fusion. Remote Sens. 2024, 16, 3266. [Google Scholar] [CrossRef]
Saleh, T.; Holail, S.; Zahran, M.; Xiao, X.; Xia, G.-S. LiST-Net: Enhanced Flood Mapping with Lightweight SAR Transformer Network and Dimension-Wise Attention. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5211817. [Google Scholar] [CrossRef]
Zhang, F.; Liu, K.; Liu, Y.; Wang, C.; Zhou, W.; Zhang, H.; Wang, L. Multitarget Domain Adaptation Building Instance Extraction of Remote Sensing Imagery with Domain-Common Approximation Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4702916. [Google Scholar] [CrossRef]
Saleh, T.; Holail, S.; Xiao, X.; Xia, G.-S. High-precision flood detection and mapping via multi-temporal SAR change analysis with semantic token-based transformer. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 1569–8432. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, H.; Zhang, H.; Li, W.; Zou, Z.; Shi, Z. RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation Based on Visual Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4701117. [Google Scholar] [CrossRef]
Wang, M.; Su, L.; Yan, C.; Xu, S.; Yuan, P.; Jiang, X.; Zhang, B. RSBuilding: Toward General Remote Sensing Image Building Extraction and Change Detection with Foundation Model. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4707417. [Google Scholar] [CrossRef]
Fang, F.; Wu, K.; Liu, Y.; Li, S.; Wan, B.; Chen, Y.; Zheng, D. A Coarse-to-Fine Contour Optimization Network for Extracting Building Instances from High-Resolution Remote Sensing Imagery. Remote Sens. 2021, 13, 3814. [Google Scholar] [CrossRef]
Qiu, Y.; Wu, F.; Qian, H.; Zhai, R.; Gong, X.; Yin, J.; Liu, C.; Wang, A. AFL-Net: Attentional Feature Learning Network for Building Extraction from Remote Sensing Images. Remote Sens. 2023, 15, 95. [Google Scholar] [CrossRef]
Gerke, M.; Rottensteiner, F.; Wegner, J.; Sohn, G. ISPRS Semantic Labeling Contest. 6 September 2014. Available online: https://www.researchgate.net/profile/Markus-Gerke/publication/267150834_ISPRS_Semantic_Labeling_Contest/links/54462b390cf2d62c304da962/ISPRS-Semantic-Labeling-Contest.pdf (accessed on 27 February 2025).
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image Segmentation As Rendering. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9796–9805. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. ISBN 978-3-319-10601-4. [Google Scholar]
Zhu, Q.; Liao, C.; Hu, H.; Mei, X.; Li, H. MAP-Net: Multiple Attending Path Neural Network for Building Footprint Extraction From Remote Sensed Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 6169–6181. [Google Scholar] [CrossRef]
Li, R.; Wang, L.; Zhang, C.; Duan, C.; Zheng, S. A²-FPN for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. Int. J. Remote Sens. 2022, 43, 1131–1155. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land Cover Classification from Remote Sensing Images Based on Multi-Scale Fully Convolutional Network. Geo-Spat. Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
Bokhovkin, A.; Burnaev, E. Boundary Loss for Remote Sensing Imagery Semantic Segmentation. In Advances in Neural Networks—ISNN 2019; Springer: Cham, Switzerland, 2019; Volume 11555. [Google Scholar] [CrossRef]
Hosseinpour, H.; Samadzadegan, F.; Javan, F.D. A Novel Boundary Loss Function in Deep Convolutional Networks to Improve the Buildings Extraction From High-Resolution Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 4437–4454. [Google Scholar] [CrossRef]
Ma, S.; Li, T.; Zhai, S. Adaptive Layer Selection and Fusion Network for Infrastructure Contour Segmentation Using UAV Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2025, 22, 7500205. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljacic, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2404.19756. [Google Scholar]
Cheon, M.; Mun, C. Combining KAN with CNN: KonvNeXt’s Performance in Remote Sensing and Patent Insights. Remote Sens. 2024, 16, 3417. [Google Scholar] [CrossRef]
Jamali, A.; Roy, S.K.; Hong, D.; Lu, B.; Ghamisi, P. How to Learn More? Exploring Kolmogorov–Arnold Networks for Hyperspectral Image Classification. Remote Sens. 2024, 16, 4015. [Google Scholar] [CrossRef]
Li, Y.; Liu, S.; Wu, J.; Sun, W.; Wen, Q.; Wu, Y.; Qin, X.; Qiao, Y. Multi-Scale Kolmogorov-Arnold Network (KAN)-Based Linear Attention Network: Multi-Scale Feature Fusion with KAN and Deformable Convolution for Urban Scene Image Semantic Segmentation. Remote Sens. 2025, 17, 802. [Google Scholar] [CrossRef]
Ma, X.; Wang, Z.; Hu, Y.; Zhang, X.; Pun, M. Kolmogorov-Arnold Network for Remote Sensing Image Semantic Segmentation. arXiv 2025, arXiv:2501.07390. [Google Scholar]
Wu, Z.; Lu, H.; Paoletti, M.E.; Su, H.; Jing, W.; Haut, J.M. KACNet: Kolmogorov-Arnold Convolution Network for Hyperspectral Anomaly Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5506514. [Google Scholar] [CrossRef]
Teymoor, S.S.; Sadegh, M.; Chanussot, J. Kolmogorov–Arnold Network for Hyperspectral Change Detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5505515. [Google Scholar] [CrossRef]
Xu, G.; Yang, S.; Feng, Z. Dual-Semantic Graph Convolution Network for Hyperspectral Image Classification with Few Labeled Samples. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5508815. [Google Scholar] [CrossRef]

Figure 1. Subdivisions in the Vaihingen datasets: (a) original image and (b) annotation.

Figure 2. The flow of the DL-based building outline extraction from the remote sensing images.

Figure 3. Remote sensing image cutting method. (a) Cutting results without overlap and (b) cutting results with overlap.

Figure 4. Example of the automatic generation of training data with cropping.

Figure 5. Building outline extraction based on the PointRend DL network. (a) coarse prediction and extract fine-grain features; (b) upsampling generates new mask predictions with higher pixel quality.

Figure 6. Local feature representation of building boundaries based on fine-grain pixel segmentation.

Figure 7. Selected part of the extracted contours.

Figure 8. Adaptive cropping large-format RS image with different overlapping rate: (a) annotation images; (b) cropped image with the size of 1792 × 1792 and a 50% overlapping rate.

Figure 9. Examples of the building extraction results obtained by different methods on the Vaihingen datasets: (a) original image, (b) Mask R-CNN, (c) Swin Transformer, (d) PointRend, and (e) AP-PointRend. The red rectangle in the first column of Figure 9 is the selected building for the close-up in Figure 10 and Figure 11.

Figure 10. Close-up of the individual building extraction results obtained by different methods: (a) original image; (b) Mask R-CNN, (c) Swin Transformer, (d) PointRend, and (e) AP-PointRend.

Figure 11. Close-up comparison of the individual building extraction results obtained by different methods and ground truth. The images shown in (a–e) are from the selected buildings marked in Figure 10a: (a) ground truth, (b) Mask R-CNN, (c) Swin Transformer, (d) PointRend, and (e) AP-PointRend.

Figure 12. Examples of the building extraction results obtained by different methods on the WHU datasets: (a) original image, (b) Mask R-CNN, (c) Swin Transformer, (d) PointRend, and (e) AP-PointRend.

Figure 13. The building extraction results obtained by various methods on the WHU dataset not displayed on the original images: (a) ground truth, (b) Mask R-CNN, (c) Swin Transformer, (d) PointRend, and (e) AP-PointRend. The red rectangle in the first column of Figure 12 was the selected building for the close-up in Figure 13.

Figure 14. Close-up comparison of the individual building extraction results obtained by different methods and ground truth. The images shown in (a–e) are from the selected buildings marked in Figure 13a: (a) ground truth, (b) Mask R-CNN, (c) Swin Transformer, (d) PointRend, and (e) AP-PointRend; yellow, green, and red indicate the true positive, false negative, and false positive, respectively.

Figure 15. Close-up comparison of the building extraction results using whole and cropped images: (a) whole image testing, and (b) cropped image testing.

Figure 16. PointRend-based building extraction results: (a) original image; (b) extraction result on the original image; (c) extraction result on a black background; (d) partial close-up of the yellow rectangular frame.

Figure 17. Mosaic of image blocks: (a) block merged image unprocessed; (b) images merged by our method.

Table 1. Comparison of different building extraction methods on the Vaihingen dataset. The best results are highlighted in bold.

Methods	AP_segm	AP_segm50	AP_segm75	AR_segm	AP_box	AP_Box50	AP_Box75	AR_Box
PointRend	0.558	0.732	0.625	0.735	0.558	0.721	0.628	0.655
SwinTransformer	0.635	0.820	0.710	0.743	0.635	0.821	0.710	0.750
Mask R-CNN	0.603	0.799	0.690	0.741	0.603	0.796	0.695	0.730
AP-PointRend	0.646	0.836	0.724	0.750	0.648	0.837	0.725	0.735

Table 2. Comparison of different building extraction methods on the WHU dataset. The best results are highlighted in bold.

Methods	AP_segm	AP_segm50	AP_segm75	AR_segm	AP_box	AP_Box50	AP_Box75	AR_Box
PointRend	0.544	0.736	0.641	0.591	0.559	0.726	0.625	0.632
SwinTransformer	0.616	0.794	0.710	0.662	0.618	0.791	0.699	0.665
Mask R-CNN	0.579	0.778	0.670	0.628	0.578	0.774	0.658	0.634
AP-PointRend	0.637	0.825	0.725	0.670	0.641	0.814	0.727	0.675

Table 3. Comparison of the accuracy of different building extraction methods on datasets with different cropping sizes and overlap ratios.

Algorithm	Cropping Sizes	Overlapping Rate (%)	Mask AP₅₀	Mask AP₇₀	Mask AR	Box AP₅₀	Box AP₇₀	Box AR
Mask R-CNN	448 × 448	0	51.1	31.1	52.8	51.5	28.5	50.6
		20	54.5	33.8	54.2	55.1	31.7	52.9
		50	55.5	37.0	53.3	55.4	35.2	53.4
	896 × 896	0	68.4	59.9	70.1	68.9	58.8	69.0
		20	69.3	60.0	70.1	69.0	59.7	69.4
		50	69.9	60.6	70.9	69.3	59.6	68.0
	1792 × 1792	0	72.4	61.5	73.0	85.9	66.3	72.0
		20	73.2	62.5	73.5	79.6	69.5	73.0
		50	73.6	62.9	74.1	80.2	69.6	73.6
PointRend	448 × 448	0	62.6	49.8	60.8	61.4	51.1	60.8
		20	61.0	48.9	59.4	60.7	50.2	59.7
		50	61.7	48.9	59.7	60.6	50.2	60.3
	896 × 896	0	66.4	51.4	70.5	66.4	51.0	70.4
		20	67.2	59.3	69.9	67.1	59.0	69.8
		50	67.5	60.1	69.9	67.5	60.6	70.0
	1792 × 1792	0	72.9	61.2	73.2	717	61.6	73.3
		20	73.2	62.5	73.5	72.1	62.8	73.5
		50	73.8	63.2	74.1	72.9	63.1	74.9
Swin Transformer	448 × 448	0	77.7	65.3	73.7	77.7	65.5	73.6
		20	78.5	65.1	73.7	78.2	65.8	74.1
		50	78.8	65.2	73.8	79.6	65.9	74.5
	896 × 896	0	80.6	68.7	74.1	80.7	68.9	74.8
		20	80.9	68.9	74.8	80.9	69.1	74.9
		50	81.1	69.2	74.9	81.4	69.7	74.9
	1792 × 1792	0	81.7	69.5	73.8	81.8	69.6	74.6
		20	82.0	71.0	74.3	82.1	71.0	75.0
		50	82.3	71.7	74.9	82.4	71.9	76.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, B.; Yu, D.; Xiao, X.; Shen, J.; Cui, Z.; Su, Y.; Li, A.; Li, D. AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images. Remote Sens. 2025, 17, 1481. https://doi.org/10.3390/rs17091481

AMA Style

Zhu B, Yu D, Xiao X, Shen J, Cui Z, Su Y, Li A, Li D. AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images. Remote Sensing. 2025; 17(9):1481. https://doi.org/10.3390/rs17091481

Chicago/Turabian Style

Zhu, Bowen, Ding Yu, Xiongwu Xiao, Jian Shen, Zhigao Cui, Yanzhao Su, Aihua Li, and Deren Li. 2025. "AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images" Remote Sensing 17, no. 9: 1481. https://doi.org/10.3390/rs17091481

APA Style

Zhu, B., Yu, D., Xiao, X., Shen, J., Cui, Z., Su, Y., Li, A., & Li, D. (2025). AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images. Remote Sensing, 17(9), 1481. https://doi.org/10.3390/rs17091481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AP-PointRend: An Improved Network for Building Extraction via High-Resolution Remote Sensing Images

Abstract

1. Introduction

2. Related Work

2.1. Segmentation-Based Buildings Extraction

2.2. Instance Segmentation-Based Buildings Extraction

3. Materials and Methods

3.1. Datasets

3.2. The Flow of DL-Based Buildings Outlines Extraction from Remote Sensing Images

3.3. Adaptive Cropping and Automatic Generation of Training Data

3.4. Automatic Extraction of Building Roof Contour

3.5. Merging Method of Extraction Results of Multiple Remote Sensing Images

4. Experiment and Analysis

4.1. Experiment Environment

4.2. Experiment Results

4.2.1. Generate Datasets of Arbitrary Size and Overlap

4.2.2. Visualization Results

Building Extraction on Vaihingen Dataset

Building Extraction on WHU Dataset

4.3. Quantitative Analysis

5. Discussion

5.1. Influence on Building Extraction with Different Sizes and Overlapping Rate

5.2. Accuracy Evaluation Analysis

5.3. Merging Method of Extraction Results of Multiple RS Images

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI