Benchmark for Building Segmentation on Up-Scaled Sentinel-2 Imagery

Illarionova, Svetlana; Shadrin, Dmitrii; Shukhratov, Islomjon; Evteeva, Ksenia; Popandopulo, Georgii; Sotiriadi, Nazar; Oseledets, Ivan; Burnaev, Evgeny

doi:10.3390/rs15092347

Open AccessArticle

Benchmark for Building Segmentation on Up-Scaled Sentinel-2 Imagery

by

Svetlana Illarionova

^1,*

,

Dmitrii Shadrin

^1,2

,

Islomjon Shukhratov

¹

,

Ksenia Evteeva

¹

,

Georgii Popandopulo

¹,

Nazar Sotiriadi

³,

Ivan Oseledets

^1,4

and

Evgeny Burnaev

^1,4

¹

Skolkovo Institute of Science and Technology, Bolshoy Boulevard 30, bld. 1, 121205 Moscow, Russia

²

Institute of Information Technology and Data Science, Irkutsk National Research Technical University, 664074 Irkutsk, Russia

³

Public Joint-Stock Company (PJSC) Sberbank of Russia, 127006 Moscow, Russia

⁴

Autonomous Non-Profit Organization Artificial Intelligence Research Institute (AIRI), 105064 Moscow, Russia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(9), 2347; https://doi.org/10.3390/rs15092347

Submission received: 24 February 2023 / Revised: 24 April 2023 / Accepted: 27 April 2023 / Published: 29 April 2023

(This article belongs to the Special Issue Deep Learning Meets Remote Sensing for Earth Observation and Monitoring)

Download

Browse Figures

Versions Notes

Abstract

:

Currently, we can solve a wide range of tasks using computer vision algorithms, which reduce manual labor and enable rapid analysis of the environment. The remote sensing domain provides vast amounts of satellite data, but it also poses challenges associated with processing this data. Baseline solutions with intermediate results are available for various tasks, such as forest species classification, infrastructure recognition, and emergency situation analysis using satellite data. Despite these advances, two major issues with high-performing artificial intelligence algorithms remain in the current decade. The first issue relates to the availability of data. To train a robust algorithm, a reasonable amount of well-annotated training data is required. The second issue is the availability of satellite data, which is another concern. Even though there are a number of data providers, high-resolution and up-to-date imagery is extremely expensive. This paper aims to address these challenges by proposing an effective pipeline for building segmentation that utilizes freely available Sentinel-2 data with 10 m spatial resolution. The approach we use combines a super-resolution (SR) component with a semantic segmentation component. As a result, we simultaneously consider and analyze SR and building segmentation tasks to improve the quality of the infrastructure analysis through medium-resolution satellite data. Additionally, we collected and made available a unique dataset for the Russian Federation covering area of 1091.2 square kilometers. The dataset provides Sentinel-2 imagery adjusted to the spatial resolution of 2.5 m and is accompanied by semantic segmentation masks. The building footprints were created using OpenStreetMap data that was manually checked and verified. Several experiments were conducted for the SR task, using advanced image SR methods such as the diffusion-based SR3 model, RCAN, SRGAN, and MCGR. The MCGR network produced the best result, with a PSNR of 27.54 and SSIM of 0.79. The obtained SR images were then used to tackle the building segmentation task with different neural network models, including DeepLabV3 with different encoders, SWIN, and Twins transformers. The SWIN transformer achieved the best results, with an F1-score of 79.60.

Keywords:

remote sensing; semantic segmentation; image processing; super-resolution; Sentinel-2

1. Introduction

Automatic building recognition through remote sensing observations has numerous applications in the geographical and social sciences. These include collecting and updating data in Geographic Information System databases, detecting building damage related to disasters, monitoring urban settlements, mapping land-use/land-cover patterns, and managing environmental resources [1,2,3,4]. The auxiliary information obtained during building detection, such as the spatial distribution of buildings, their size, and quantity, plays a significant role in urban planning and demographic analysis [3,5].

Early research on automatic building detection was typically based on aerial imagery due to its spatial resolution up to 0.05 m [6]. However, acquiring aerial imagery for large areas, e.g., a whole city, is time-consuming [4] and faces other limitations [7]. Moreover, it can not be applied for damage assessment when imagery before and after an event is required. In recent years, the increased availability of remote sensing imagery with a wide coverage area has solved this problem [8]. Remote sensing imagery is represented mainly by multi-spectral and synthetic aperture radar (SAR) data. Compared with multi-spectral data, the processing of SAR data is more complicated due to noise and blurry boundaries, particularly in urban areas with severe geometric distortions such as layover and shadowing [9]. As a result, multi-spectral satellite data are more commonly used for the development of building recognition algorithms.

For the discrimination of separate building blocks from remote sensing imagery, the spatial resolution plays a more significant role than the number of spectral bands or a narrower wavelength interval [4,10]. High-resolution (HR) imagery is typically more expensive than lower-resolution options, as it requires more advanced instruments and systems on earth observation satellites [11]. Middle-resolution images are often freely available but may not contain important details [12]. Therefore, it is highly important for a number of remote sensing tasks to ensure data with both high spatial and temporal resolution is available. To produce HR remote sensing images from middle- or low-resolution (LR) data, super-resolution (SR) methods can be used [13].

Although building segmentation tasks and super-resolution tasks for remote sensing data are often investigated as separate challenges [14,15], it is crucial to consider them as part of a common pipeline to achieve higher recognition results. Moreover, recent advances in computer vision have introduced powerful tools for object recognition, such as transformers and diffusion models, which hold great promise in the general computer vision domain. Their implementation in the remote sensing domain, specifically in super-resolution and building segmentation tasks, requires special attention. It is important to integrate these advanced methods into a unified pipeline to improve the accuracy of remote sensing data analysis. Another significant issue that arises in such tasks is the availability of relevant datasets for particular regions that support representability and enable researchers to test and compare their algorithms.

In this study, our aim is to address several challenges in the building segmentation task. The first challenge is the limited availability of annotated datasets for specific geographic regions. Although there are a number of open-access datasets for building recognition via satellite data, transfer from one study area to another location can be inapplicable due to infrastructure specificity. Therefore, we focus on collecting a unique dataset to support building assessment in several regions of Russia. Another important challenge in infrastructure analytics using remote sensing is the availability, cost, and temporal resolution of satellite data. To address this, we focus specifically on Sentinel-2 data, which is a valid choice for rapid remote sensing observations due to its high temporal resolution (approximately 5 days). However, the spatial resolution for RGB bands is only 10 m. To provide more precise building segmentation on 2.5 m per pixel, we set an objective to develop a pipeline that comprises Sentinel-2 imagery adjustment by a factor of 4 and segmentation of HR imagery. To simultaneously analyze image super-resolution and building segmentation, we consider state-of-the-art models such as transformers and diffusion models and verify their performance on different spatial resolutions. One possible practical application of the proposed pipeline is to conduct a more detailed inventory of infrastructure objects. In our study, we demonstrate the possibility of segmenting medium-sized standalone buildings. The usage of satellite data with spatial resolution of 10 m sometimes leads to the situation where two or more separated buildings are recognized as one instance. Thus, it is difficult to make accurate quantitative assessments. Such assessments would be useful, for example, to match addresses with actual buildings or to accurately assess damages after cataclysms such as flooding. Our intention is to use only freely available medium-resolution data, such as that obtained from Sentinel-2, due to its easy accessibility and ability to cover large territories. It is possible to adapt our pipeline to other types of infrastructure if the specific labeled dataset is available. In summary, our goals are:

To create and share a unique dataset covering several regions in Russia;
To conduct a comprehensive overview of benchmarks for the building segmentation task;
To propose an efficient pipeline for building segmentation involving image SR to leverage Sentinel-2 data adjusted to 2.5 m;
To make a comparative study among different SR and segmentation algorithms for solving similar problems in the remote sensing domain.

2. Related Work

2.1. Super-Resolution Methods for Satellite Imagery

In the field of computer vision, the primary objective of SR methods is to enhance the resolution of images using various algorithms. In supervised artificial intelligence algorithms, HR images are used as a reference to validate the generated SR images. In this section, we provide a brief overview of the major approaches to the SR task.

SRCNN is one of the first CNN-based networks proposed for super-resolution in the general computer vision domain [16]. Several improvements of SRCNN have been proposed [17,18,19,20]. Another popular model is the Residual Channel Attention Network (RCAN) [21]. It utilizes a deep residual network architecture that incorporates channel attention mechanisms to selectively highlight important features in the image. The attention mechanism of the RCAN model learns to assign weights to the features in various channels, allowing it to produce high-quality images that are sharp and realistic.

Recent advances in the SR task are also connected with Generative Adversarial Networks (GANs) [22]. In the Super-Resolution Generative Adversarial Network (SRGAN), two models are trained—one to generate images and the other to distinguish between original and generated images [23]. The generative model is optimized using content and adversarial loss functions [24]. ESRGAN is an architecture improvement to SRGAN for realistic image SR, proposed in [25]. The authors introduce a residual-in-residual dense block, along with adversarial and perceptual loss functions. Further improvements are based on training ESRGAN with pure synthetic data using high-order degradation modeling, which is closer to real-world degradation [26].

One of the recent and modern approaches is diffusion models, which have become state-of-the-art in many areas of computer vision. For instance, the SR3 (Super-Resolution via Repeated Refinement) model is used for SR tasks [27]. It uses a recursive refinement approach with residual learning, multi-scale features, and attention mechanisms to generate high-quality images with fine details. The model has a series of sub-networks that generate residual images at different resolutions, which are combined to produce the final HR image. The model also employs a perceptual loss function to optimize for HR and perceptually realistic images. The SR3 model has been implemented and tested on various benchmark datasets, where it has outperformed other state-of-the-art SR models in terms of both quantitative metrics and visual quality. Its promising performance makes it suitable for applications in image and video enhancement, medical imaging, and remote sensing.

Generative-based approaches in the general domain are used for photo-enhancing SR tasks that rely on recreating high-frequency textures and objects. Tasks occurring in the remote sensing domain also require a high signal-to-noise ratio (SNR). Data is supposed to have sharp edges and to mitigate generative artifacts. To meet the demands of the remote sensing domain, GAN-based approaches have also been developed and improved [28,29]. Remarkable performance is also achieved by Multi-class Cyclic super-resolution Generative adversarial network with Residual feature aggregation (MCGR) [30]. The MCGR comprises two independent but interconnected GAN models. The first GAN translates images from one domain to another, while the second GAN performs the reverse translation. Generative models are applied to adjust various satellite data, particularly the freely available Sentinel-2 data, which has high temporal resolution. In [31], they demonstrate a Sentinel-2 SR approach with a scaling factor of 5. A scaling factor of 2 is presented in [32].

2.2. Datasets for Building Recognition on Remote Sensing Imagery

In addition to the need for HR and accurate satellite data, precise markup is also crucial for the development of artificial intelligence algorithms. Therefore, numerous studies are devoted to collecting datasets and using neural network approaches for benchmarking. Among the key features of remote sensing datasets, target object types, geographical areas, spatial resolution, and the type of computer vision tasks that can be performed (such as semantic or instance segmentation, object detection, etc.) can be identified. For each practical task, a specific dataset should be selected that is more suitable for the specific problem requirements. There are several well-known remote sensing datasets covering different regions, including building labels.

The Massachusetts Building dataset [33] is one of the popular benchmarks for building segmentation tasks. It includes 151 aerial images of Massachusetts residential areas with a 1 m spatial resolution. The Inria Aerial Image Labeling dataset covers urban settlements in the United States and Austria, with marked building and background classes [34]. The dataset contains 360 tiles with a size of 1500 × 1500 pixels and a spatial resolution of 0.3 m. The authors benchmarked the dataset using CNNs, including a fully convolutional network [35], multi-layer perceptron (MLP) [36], and a skip network [36]. The xView dataset [37] is a popular annotated collection of 1127 WorldView-3 satellite images, each with a spatial resolution of 0.3 m per pixel. The dataset includes 60 object detection classes and covers a total area of over 1400 km². In 2018 and 2019, CosmiQ Works, Radiant Solutions and NVIDIA presented two datasets for building segmentation collected from WorldView-2 and WorldView-3 satellites: SpaceNet [38] and SpaceNet MVOI (MultiView Overhead Imagery) [39]. The first dataset includes three SpaceNet challenges: (1) the first challenge uses WorldView-2 satellite imagery with 0.5 m spatial resolution for 2544 km² of Rio de Janeiro area collected between 2011 and 2014; (2) the second challenge aims at extracting building footprints from WorldView-3 satellite imagery and includes 24,586 images of size 650 × 650 pixels for Las Vegas, Paris, Shanghai, and Khartoum with total area of 3011 km²; (3) the third challenge uses the imagery from the second challenge tiled into 1300 × 1300 pixels chips. The SpaceNet dataset is benchmarked using several algorithms. The second dataset, SpaceNet MVOI [39], aims at detecting buildings on off-nadir views. In total, the dataset contains 60,000 images of the United States collected by the WorldView-2 satellite on 22 December 2009. The images have a spatial resolution ranging from 0.46 m to 1.67 m and a width of 900 pixels. The High Resolution Remote Sensing Detection (HRRSD) dataset [40] for object detection tasks is derived from Google Earth and Baidu Map. It consists of 26,722 images classified into 13 categories, with spatial resolutions ranging from 0.15 m to 1.2 m.

While high-resolution datasets may provide accurate spatial and textural properties for computer vision algorithms, freely available, medium-resolution satellite data such as Sentinel-2 is also being considered. Sentinel-2 data, with its spatial resolution of 10 m per pixel, is effectively applied for significant spatial feature extraction in various land cover segmentation and object recognition tasks. For instance, Sentinel2GlobalLULC [41] provides a global dataset for identifying 29 classes. Although Sentinel-2 images are typically used to identify built-up areas or settlements rather than individual buildings [42,43,44], specific practical tasks may require the localization of each building footprint.

3. Materials and Methods

3.1. Problem Statement

The building recognition task can be described as the classification of each image pixel into two categories: building or background. In terms of machine learning algorithms, this task is called semantic segmentation [5,45]. Semantic segmentation tasks can be solved using two approaches: (1) traditional classification methods or per-pixel classifiers (e.g., support vector machine and random forest classifier) and (2) deep learning (DL)-based methods or object-based classifiers [10,45]. Classical approaches have been popular for several decades. However, most of the studies based on traditional methods focused on relatively small regions of interest. This decision was dictated by the necessity of features extraction, such as spectra, shape, texture, color, edge, and shadow. Along with the importance of professional knowledge for feature extraction, features are heavily dependent on sensors, imaging conditions, image acquisition parameters, and location. This leads to instability, reduced accuracy, and limited usability. Recently, DL-based methods have been applied to overcome these drawbacks. DL algorithms allow researchers to combine low-level features with high-level features that represent more semantic information, which makes them robust [3,5,10].

The research consists of two parts, namely, developing a SR model and a semantic segmentation model (Figure 1). Such decomposition allows us to solve specific, resolution-sensitive problems that could not be solved directly using LR data, such as images from the Sentinel-2 satellite. In the first stage, we adjust the images by a factor of 4. We use initially high-resolution images (Mapbox basemap) to create pairs of low-resolution and high-resolution images, with resolutions of 10 m and 2.5 m, respectively. LR images are obtained from HR images using simple downscaling. The dataset with HR and LR images is used to train an SR model to upscale images and increase their resolution from 10 m to 2.5 m. We then apply the trained model to Sentinel-2 images, which have a spatial resolution of 10 m, to create a new dataset consisting of Sentinel-2 images with 2.5 m resolution, accompanied by the prepared OpenStreetMap (OSM)-based markup. In the next step, we use this new collected dataset to train a semantic segmentation model. During the inference step, a 10 m RGB Sentinel-2 image is passed through both the SR and segmentation neural network models.

3.2. Dataset for Building Segmentation and Super-Resolution Tasks

In this study, we create a dataset covering four regions in Russia: Moscow and the surrounding suburban area, Krasnoyarsk, Irkutsk and Novosibirsk. These territories represent diverse geographical conditions and urban characteristics. The entire study area equals to 1091.2 km² and is composed of 30 individual sites, consisting mostly of multi-storey buildings, adjacent territories, and squares. For our research problem statement of SR image segmentation, we collected satellite images of high and middle spatial resolution for the same area. The first set is based on the Mapbox product with the spatial resolution of 1 m per pixel, collected during the summer period of 2022. The basemap product is RGB images with values in the standard range from 0 to 255. We also use remote sensing images derived from the Sentinel-2 satellite, considering the RGB bands with a spatial resolution of 10 m per pixel. There are a number of observations available for the Sentinel-2 satellite. Therefore, in contrast to Mapbox images, we collect a few Sentinel-2 images on different dates during the summer period of 2022 to extend the training dataset. We use a 10% cloud threshold to discard cloudy images. The API of SentinelHub service [46] is used for filtering, downloading, and satellite data preprocessing. In total, we acquired 124 images.

SR models are trained in a supervised manner and require both LR and HR images. However, such pairs of images cannot be obtained from the same satellite and the same sensor. Therefore, a commonly used approach is to collect images from two satellites with higher and lower spatial resolution and then to match them [32,47]. Although transferring from a LR data to a HR one shows remarkable performance, its main limitation is mismatches that can occur between observations for different dates and natural conditions such as atmospheric and lightning effects. Therefore, in this study, we investigate transferring between different resolutions rather than different sensors to train a model. For the SR task, we combine open access datasets representing urban areas and the set of images collected for the study area. Images from the xView dataset and Massachusetts Road and Building Detection datasets [33] are brought to the spatial resolution of 2.5 m by means of interpolation. Examples of images from these datasets are presented in Figure 2. We also down-scaled images from Mapbox to 2.5 m to perform further experiments of the resolution adjustment. Open access data for various regions aims at supporting pattern diversity in remote sensing observations, while the selected Mapbox images are more specific for the target urban areas. We leverage this data to create pairs of HR and LR pairs of 2.5 m and 10 m, respectively. The entire study area amounts to 2940 km² for Massachusetts Road and Building Detection datasets, 1400 km² for the xView dataset, and 1091.2 km² for our dataset. The Sentinel-2 images collected for the same sites as Mapbox observations are also involved in the SR task (see Figure 3). However, Sentinel-2 samples are considered just as LR images without corresponding HR pairs.

We supply the satellite imagery by building a segmentation markup. It is acquired based on OpenStreetMap (OSM) database [48]. It provides vector geographical data updated and maintained via open collaboration using data from surveys, aerial imagery and freely licensed geographic data sources. We extract elements with non-empty building tags from the OSM database by querying the Overpass API server using the overpass Python library [49]. The obtained XML data is converted into GeoJSON format and manually corrected for inaccurate boundaries or missing objects. This study focuses on segmenting multi-storey buildings that can be visually identified on medium-resolution satellite data, such as Sentinel-2 imagery, with an original RGB band resolution of 10 m per pixel. Therefore, territories containing private garages (Figure 4) and low-rise buildings (Figure 5) with areas less than 5 m² are beyond the scope of the present study and they are removed from the resulting markup. The remaining polygons are rasterized into GeoTIFF binary masks using Python GDAL package [50] based on size, affine transformation coefficients, projection coordinate system and spatial reference system derived from the original satellite images. Since the vector from OSM and the satellite images are brought to a single coordinate system, we can assume that the rasterized masks obtained coincide with real buildings with high accuracy, which means they can be used as ground-truth for the building segmentation experiments. In markup, 0 represents background and 1 represents target class. An example of obtained markup is presented in Figure 6. We present the properties of the collected data in Figure 7; it includes a frequency plot of building localization on the scaled sites and shows the sizes of the buildings. For the experiments, we create two sets with markup for the same areas but with two different spatial resolutions of 2.5 m and 1 m per pixel. We aim at analyzing the importance of spatial features and the potential of lower spatial resolution data-sources than 1 m per pixel. A markup of 1 m resolution is used with Mapbox data, while 2.5 m resolution markup is used both with Mapbox images of 2.5 m and Sentinel-2 images up-scaled to 2.5 m. Details on the up-scaling approach are presented in Section 3.3.

The dataset is divided into three subsets: train, validation, and test. We select 15 territories as the train set, with a total area of 679.5 km², including 119.4 km² of the target class (17% of the train area). Another five territories represent the validation set, with an area of 116.6 km² and 23.3 km² of the target label (20%). We use eight territories as test data, comprising an area of 295.1 km² with 74.77 km² of the target class (27%). The final split has a ratio of 57/16/27 for train, validation, and test territories, respectively. To make sure that validation and test images have versatile urban topographies, at least one image from each region was chosen. Information and statistics about collected data are given in Table 1.

We share the collected dataset comprising original Sentinel-2 RGB images of 10 m, SR Sentinel-2 images of 2.5 m, and markup of 2.5 m. It can be used independently as a benchmark for SR algorithms and for building segmentation models, or for evaluating pipelines combining both stages for robust building recognition through upscaled satellite data.

3.3. Experiments for SR

We consider two GAN-based approaches for the SR task that have already shown remarkable results for image resolution adjustment both in general and remote sensing domains. We also conduct experiments with attention-based model RCAN and diffusion-based model SR3.

For SRGAN architecture, we use a two-stage strategy. Firstly, we train only the generator using the reduced loss-function that just consists of mean-squared error. It takes 10 epochs. After that, we continue the training process in the GAN common mode which assumes training both the discriminator and generator simultaneously. The weights are updated based on values of binary cross-entropy loss function and combination of binary cross-entropy, mean-squared error, total variance and perceptual loss functions. This stage takes 100 epochs.

For MCGR architecture, we do not pre-train the generator model, rather we train the whole model from scratch. We make several runs for each batch using the following approach. The LR domain is denoted as A and the HR domain is denoted as B, then, for calculating different components of complex loss functions, we consider:

G_{A B} (A)

,

G_{B A} (B)

,

G_{A B} (G_{B A} (B)

,

G_{B A} (G_{A B} (A)

,

D_{A} (G_{B A} (B))

,

D_{B} (G_{A B} (A))

,

D_{A} (G_{B A} (G_{A B} (A))

and

D_{B} (G_{A B} (G_{B A} (B))

, where

G_{A B}

represents generator from A domain to B domain and

D_{A}

represents discriminator for A domain. Using all these values, we calculate the following components of the loss-function: binary cross-entropy, mean-squared error, total variance, perceptual and cycle losses, which summed up with certain coefficients and used for updating parameters of both generators in the same time. We update the parameters of the discriminators in the same way as in SRGAN, but independently. The whole training process lasts for 100 epochs.

Furthermore, for both models, we utilize a cosine annealing schedule to adjust the learning rate during training. We set the initial learning rate to

10^{- 3}

with a restart period of 10, which increases by a factor of 2 after each restart. Additionally, we set the batch size to 6 for the experiment.

The SR3 model is trained using a combination of loss functions, which includes a mean squared error (MSE) loss and a perceptual loss that optimizes for HR and perceptually realistic images. We use the Adam optimizer with a cosine annealing scheduler for adaptively adjusting the learning rate during training. The training duration is set to 1.000.000 iterations. For the diffusion process, we set the number of iterations to 2000.

We train the RCAN model in two stages. The first stage involves training for 100 epochs using a mean absolute error (MAE) loss function. In the second stage, we replace the loss function with a weighted sum of SSIM loss and MAE. The second stage lasts for 30 epochs. For RCAN, we also use Adam as optimizer and cosine annealing scheduler as learning rate scheduler.

In Section 3.5.1, we describe evaluation metrics that are computed based on the test areas from the collected dataset within four Russian regions.

Further, we apply the developed algorithm to up-scale Sentinel-2 RGB images from 10 m to 2.5 m. The adjusted images are used to perform the building segmentation task.

3.4. Experiments for Building Segmentation

To assess the potential of different data-sources and spatial resolution in the building segmentation task, we select three state-of-the-art neural network architectures: DeepLabv3 [51], SWIN transformer [52], and Twins transformer [53].

DeepLabv3 is a semantic segmentation architecture that presents an improved version of its DeepLab-family predecessors. DeepLab models have already been successfully used in remote sensing tasks [54,55]. DeepLabv3 uses atrous convolutional networks at different scales to capture the features of objects that need to be segmented. The model shows good performance in the semantic segmentation of urban scenes. To ensure the diversification of our models, we train the DeepLabv3 network with three encoders of different sizes, namely, Resnet18, Resnet50, and Resnet101.

We also consider the SWIN Transformer architecture, which uses shifted windows with a hierarchical transformer to capture pixel-level visual entities in an image and generate segmentation masks. The Swin Transformer is a fast and effective model that has been applied to various segmentation tasks [56].

Another vision transformer that we use in the study is the Twins transformer. Two architecture modifications, PCPVT and SVT, are considered for image segmentation. PCPVT uses conditional positional encoding to tackle the problem of input data of different dimensions. SVT reduces computation complexity by employing spatially separable self-attention (SSSA). SSSA consists of locally-grouped self-attention to capture fine-grained and local features and globally grouped self-attention to capture global information.

The study comprises experiments with three sets of images. The first one is the Mapbox dataset with a spatial resolution of 1 m; the second one is Mapbox images with a spatial resolution of 2.5 m; the third one is Sentinel-2 images brought from 10 m to 2.5 m using the developed SR models. Therefore, each neural network model is trained and validated on each dataset.

Individual areas in the collected dataset are represented by the large sites. Therefore, image sizes are not suitable to train a model without data preprocessing. We crop patches with the shape of 512 × 512 and 256 × 256 pixels for the spatial resolutions of 1 m, and 2.5 m, respectively. The number of patches for Mapbox with a 1 m spatial resolution amounts to 4318 samples, while Sentinel-2 splitting into smaller patches results in 10,164 samples.

To support a meaningful comparison of different architectures, we use the base implementations for each model from Open MMLab’s MMSegmentation repository [57]. The models are trained using the weighted cross-entropy loss function, Adam optimizer for Swin and Twins transformers, and SGD for DeepLabV3 models. The polynomial learning rate scheduler is set with the maximum of 300 epochs. We save the model with the highest mean IoU on the validation set for further analysis and performance assessment on the test set. Each batch size ranges from 4 to 16 depending on the architecture. The computations are conducted on a Linux machine equipped with an Intel Xeon processor and a Tesla V100-SXM2 GPU with 16 GB of memory.

3.5. Evaluation Metrics

3.5.1. Super-Resolution

To evaluate the developed SR models, we compute the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), both commonly used metrics for image adjustment tasks. We also compute the Frechet Inception Distance (FID) [58]. FID is a widely used metric to evaluate the quality of generated images in GANs. It measures both the quality and diversity of the generated images by comparing their feature representations to those of real images. To handle efficiently large images, we split them into smaller patches with the size of 64 × 64 pixels. Then, we average the metrics for all patches to achieve the ultimate value. The metrics are computed according to the following equations:

P S N R (X, Y) = 20 l o g_{10} \frac{M A X_{X}}{M S E (X, Y)},

(1)

where

M A X

represents the maximum possible pixel value of the image (i.e., 255 for an 8-bit grayscale image), and

M S E

represents the mean squared error between the original image and the compressed or distorted image.

S S I M (X, Y) = \frac{(2 μ_{X} μ_{Y} + C_{1}) (2 σ_{X Y} + C_{2})}{(μ_{X}^{2} + μ_{Y}^{2} + C_{1}) (σ_{X}^{2} + σ_{Y}^{2} + C_{2})},

(2)

where

μ_{X}

and

μ_{Y}

are the pixel sample means,

σ_{X}

and

σ_{Y}

are variances,

σ_{X Y}

is covariance,

C_{1} = {(0.01 L)}^{2}

and

C_{2} = {(0.03 L)}^{2}

are variables to stabilize the division with weak denominator, where L is the dynamic range of the pixel values.

F I D = ‖ μ_{r e a l} - μ_{f a k e} ‖ 2^{2} + T r (Σ r e a l + Σ_{f a k e} - 2 {(Σ_{r e a l} Σ_{f a k e})}^{1 / 2})

(3)

where

μ_{r e a l}

and

μ_{f a k e}

are the mean vectors of the feature representations of real and fake images, respectively.

Σ_{r e a l}

and

Σ_{f a k e}

are the covariance matrices of the feature representations of real and fake images, respectively.

T r (\cdot)

denotes the trace of a matrix.

3.5.2. Building Segmentation

To evaluate the performance of building segmentation models, we utilize two metrics: Intersection over Union (IoU, also known as Jaccard index) and F1-score (also known as Dice Score). The equation for computing the IoU is the following:

I o U = \frac{T P}{(T P + F P + F N)}

(4)

The equation for computing the F1-score is:

F 1 - s c o r e = \frac{T P}{(T P + \frac{1}{2} (F P + F N))},

(5)

where

T P

,

F P

and

F N

are true positives, false positives and false negatives, respectively.

We also present scores for the building class only along with the mean of all classes to consider the issue of class imbalance (background and buildings).

4. Results

As described in the previous sections, we use test images to evaluate our model on Mapbox images using PSNR and SSIM metrics. We also make inference of our trained model on all collected Sentinel-2 images. It is impossible to draw conclusions about the performance based on Sentinel-2 images due to absence of HR ground truth images, thus only a visual evaluation is possible. Figure 8 depicts the occurrence of additional spatial features compared with the original 10 m spatial resolution image.

In Table 2, we highlight that the best results are achieved using the MCGR model, for which evaluation metrics are 27.54 and 0.79 for PSNR and SSIM, respectively. However, these results are not much better than those achieved by SRGAN. On the other hand, SR3 performs worse in terms of SSIM and PSNR metrics, but achieves a FID value of 1.40, which is much higher than the values achieved by other models.

If we visually compare the results of the algorithms for Sentinel-2 images, it is clear that both SRGAN and MCGR models perform well in sharpening the boundaries of objects, even for objects with complex shapes. However, both of these models produce images with some artifacts, while the results for the SR3 model are much clearer.

We train DeepLabv3, SWIN, and Twins models for binary semantic image segmentation on Mapbox and Sentinel-2 datasets. For Sentinel-2 images, we compare three SR models, MCGR, RCAN, and SR3, to upscale images and bring them to 2.5 m resolution. We evaluate the models’ performance based on the IoU and F1-score metrics. The results of the evaluation for different scales on Mapbox images and Sentinel-2 enhanced images are reported in Table 3, Table 4, Table 5, Table 6 and Table 7. For DeepLabv3 models, we achieve the average IoU of 76.2% and the F1-score of 85.2% for Mapbox images with 1 m resolution. The best result for Mapbox dataset with 2.5 m resolution is 72.0% (IoU) and 81.7% (F1-score). For Sentinel-2 dataset with image resolution brought to 2.5 m using MCGR model, IoU and F1-score equal on average to 68.0% and 78.3%, respectively. The results demonstrate that DeepLabv3 is capable of accurate segmenting objects in the images and achieving good performance in terms both of the IoU and F1-score metrics.

The SWIN models show an average IoU equal to 75.8% and F1-score equal to 84.9% for the Mapbox dataset with 1 m resolution (see Figure 9), 71.5 and 81.4 for the Mapbox dataset with 2.5 m resolution (see Figure 10), and 69.4 and 79.6 for the Sentinel-2 dataset created using the MCGR model (see Figure 11 and Figure 12). The results indicate that the SWIN architecture is effective in capturing fine-grained details in the images leading to improved performance of the IoU metric.

The Twins model leads to the average IoU of 75.5% and F1-score of 84.7% for the Mapbox dataset with 1 m resolution, 71.4 and 81.3 for the Mapbox dataset with 2.5 m resolution, and 69.2 and 79.4 for the Sentinel-2 dataset created using the MCGR model. The experiments demonstrate that the Twins architecture is effective in capturing both local and global features in the images, resulting in improved performance of the F1-score metric.

Overall, the results show that the DeepLab architecture outperforms other models in terms of the IoU and F1-score metrics for the Mapbox datasets with 1 m and 2.5 m resolution. Additionally, two cross-tests experiments are conducted to evaluate the generalization capability of the models through different data sources. Firstly, the best-performing model for the Mapbox dataset with resolution of 2.5 m is applied to the Sentinel-2 dataset with the same resolution. The resulting F1-score for the building class drops drastically to 28.36. Secondly, the best-performing model for the Sentinel-2 dataset is tested on the Mapbox 2.5 m dataset, the F1-score declines from 62.88 to 50.52 for the building class. It depicts that model trained on larger Sentinel-2 dataset is more robust for images from the previously unseen domain, namely Mapbox images.

To assess the need for collecting datasets from diverse geographical regions, we conducted an additional experiment using the Massachusetts Buildings Dataset [33]. We trained a DeepLab-v3 model with a ResNet50 encoder and validated the model on the test regions from our Mapbox dataset, with the same spatial resolution of 1 m. We achieved an IoU of 0.57 and an F1-score of 0.73 for the building class. However, the model fails in building class recognition for new urban areas, the achieved IoU equals to 11% and the F1-score equals to 19% on the test set of the Massachusetts Buildings Dataset.

5. Discussion

Upon analysis of the quantitative and qualitative outcomes obtained from the SR process, it is observed that the numerical values and visual quality of the results derived from Mapbox images exceeded those derived from Sentinel-2 images for the same region of interest (RoI). It is believed that this discrepancy can be attributed to the dissimilar domains of the two image sources. Accordingly, the utilization of domain adaptation techniques for the Sentinel-2 images, alongside the inclusion of training images used in the SR model, is anticipated to yield enhanced visual quality of the output images and improve effectiveness of the post-segmentation models.

Additionally, it is worth highlighting certain characteristics of the SR3 SR diffusion model. Owing to the idiosyncrasies of this model, the images produced through its application do not simply constitute an enlarged resolution of the original images, but rather high-resolution images that closely resemble the original ones. Certain small objects may be absent from the final images or, conversely, they may be slightly altered. Nevertheless, within the ambit of the current task, minor deviations do not significantly impact the quality of the building segmentation since their size is much larger.

A number of studies aim at developing robust algorithms to reduce the differences between the real scene and its perception in various computer vision domains [59]. The results of the study indicate that the quality of satellite image data has also a significant impact on the accuracy of building segmentation. The Mapbox dataset, which has the highest spatial resolution of 1 m, produces the best results in terms of IoU and F1-scores. It suggests that higher spatial resolution data is capable of better capturing building features in images and leads to more accurate segmentation results. Although Sentinel-2 data upscaled to 2.5 m shows lower results than Mapbox with 1 m spatial resolution (see Figure 13), freely-available data with high temporal resolution is significant for practical applications and should be studied deeper. One of the key points in such studies is spatial features that advanced SR architectures are supposed to extract.

Another factor that affects semantic segmentation results is the presence of inaccurate and/or outdated labels in datasets. OSM data is used to create the markup in this study. Although it is a powerful tool for environmental geo-spatial studies, one can face inaccuracies in collected data for neural network training. The main factors are off/on-nadir satellite observations, changes in building features over time, and new buildings that are not presented in polygons but occur in remote sensing images. This highlights the importance of having accurate and up-to-date annotations for training deep learning models, as errors in the labels can lead to lower performance. Manual markup collection and updating is a time-consuming and labor-intensive process. Therefore, one of the promising study topics is weakly supervised learning when markup limitations are addressed automatically [60]. Such approaches have already shown significant results in the remote sensing domain [61].

We also compute metrics individually for each region to provide better insight into geographical and urban diversity (Figure 14). The best model for each dataset is presented. For Sentinel-2 with the resolution of 2.5 m, the results are almost the same for each region on the test subset. However, for Mapbox datasets, IoU values vary slightly. It poses a relevant option for the future study on the model transferability between these regions and fine-tuning to achieve higher results.

In terms of the choice of deep learning models for segmentation, the study evaluates several popular state-of-the-art models, including DeepLabv3 with different Resnet encoders, as well as SWIN and Twins transformers. The performance of the models do not vary significantly across the different datasets, with similar IoU and F1-score metrics observed for each model. This observation suggests that the quality of the image data has a greater impact on segmentation accuracy than the choice of deep learning model. However, further experimentation with a wider range of models and architectures may be necessary to fully explore their potential for building segmentation.

Figure 15 demonstrates the ability of the proposed approach to identify separately the buildings that are very close to each other (10–30 m). However, analyzing the performance of the algorithms on smaller buildings would require a different annotation approach, which is not feasible on the low resolution considered in this article. One of the main objectives of the study is to demonstrate the potential of utilizing medium-resolution satellite imagery to obtain accurate results that enable quantitative assessments.

The presented study focuses on large multi-storey buildings, but practical applications may require the recognition of different types and sizes of buildings, such as small country houses. To extend the dataset for various tasks, masks of small buildings can be additionally included. Moreover, it is necessary to conduct further research to determine the minimum size of objects that can be accurately detected on different resolutions.

One limitation of the proposed building segmentation approach based on Sentinel-2 images is the use of only the RGB bands. This choice was made to comply with the RGB bands used in the Mapbox basemap. However, a wider spectral range can provide more information for building segmentation. To address this limitation, applying super-resolution techniques to the wider spectral range is a promising avenue for further investigation.

In this study, we consider two sequential tasks with independent models. To optimize the computationally intensive processes of image super-resolution and further image segmentation, one can develop a neural network model that integrates both stages. It will facilitate both training and inference processes. The dataset we have collected provides valuable support for these types of studies.

It is challenging to compare results achieved on different datasets with diverse sensing properties and spatial resolutions. However, the numerical results obtained in our study are consistent with the similar research. For instance, in [34], the authors achieve an IoU of 59.31 for the building segmentation task using the MLP model on the Inria Aerial Image Labeling dataset. Similarly, on the SpaceNet dataset [38], a reported F1-score of 0.69 is obtained for multiple cities. However, it is worth noting that the dataset comprises images with a spatial resolution of 1 m. In [13], the F1-score varies from 42.06 to 53.67 depending on the test regions for Sentinel-2 images up-sampled to 2.5 m.

We summarize the observed remote sensing datasets in Table 8. In addition to building semantic segmentation datasets, general datasets with other landcover classes and man-made objects are presented. The amount of training data is a common issue in satellite image analysis. Although larger datasets provide a more comprehensive evaluation of artificial intelligence algorithms performance, geographical characteristics and remote sensing data properties are also of high significance. Another avenue to explore is an image augmentation application. In the present work, we use just basic color and geometrical transformations. More advanced techniques such as an object-based augmentation with various natural backgrounds are suggested [62]. It allows one to extend a training dataset significantly and transfer samples from one geographical region to another one [55]. Another approach to select more appropriate backgrounds for artificially generated samples is proposed in [63]. In addition, multispectral augmentation techniques for Sentinel-2 data can be used to boost model performance through data diversity [64].

Overall, the study highlights the importance of considering image quality and label accuracy when training deep learning models for building segmentation. It provides insights into the relative performance of different models and image datasets. The conducted experiments give a vision about the design of more accurate and effective building segmentation models with potential applications in urban planning, disaster response, and environmental monitoring, using satellite data. Alternative generic monitoring approaches typically involve installing a vast wireless monitoring system with sensors, which can be very costly and challenging to upkeep [65] or even impossible for particular regions.

6. Conclusions

The present research brings up an important topic related to remote sensing data analysis using computer vision techniques and the availability of relevant HR datasets. In our study, we focus on the common problem of building segmentation. We perform the overview on the datasets used for solving this problem and propose our own dataset based on freely available Sentinel-2 and OSM data. In order to provide HR data, we train the different SR models, MCGR, SR3, SRGAN, and RCAN, on the 315.8 m² territory. The best achieved FID measure equals to 1.40 for the diffusion-based SR3 model, while in terms of PSNR and SSIM, the MCGR model shows the best results (27.54 and 0.79, respectively). Using these models, we up-scale the resolution of Sentinel-2 images to 2.5 m on a large area which is 1091.2 km² and merge the labels obtained from OSM. We share this labeled dataset with the community. We also train and compare several NN models to solve building segmentation task in order to establish a baseline for the possible achievable accuracy on the provided dataset. Among the models considered state-of-the-art are DeepLabv3 with different encoders, Twins transformers, and SWIN transformers. For the Sentinel-2 images up-scaled with the RCAN model, the highest F1-score of 79.69 is achieved by the SWIN Tiny model. The proposed pipeline facilitates the segmentation of similar infrastructure objects on a large scale by leveraging the availability and good coverage of Sentinel-2 data. Furthermore, the resulting dataset can be utilized for testing and comparing other computer vision approaches.

Author Contributions

Conceptualization, S.I. and D.S.; methodology, S.I., D.S. and N.S.; software, K.E.; validation, S.I., D.S. and K.E.; formal analysis, I.S. and G.P.; investigation, S.I., D.S., I.S., K.E. and G.P.; resources, I.O. and E.B.; data curation, K.E., I.S. and G.P.; writing—original draft preparation, S.I., D.S., I.S., K.E. and G.P.; writing—review and editing, all authors; visualization, I.S., G.P. and S.I.; supervision, I.O. and E.B.; project administration, N.S.; funding acquisition, N.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Analytical center under the RF Government (subsidy agreement 000000D730321P5Q0002, Grant No. 70-2021-00145 02.11.2021). The authors acknowledge the use of the Skoltech CDISE supercomputer Zhores [66] in obtaining the results presented in this paper.

Data Availability Statement

The dataset presented in this study can be found at: https://fb.appliedai.tech/share/hSTbJrD3 (accessed on 22 April 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AP	Average precision
CNN	Convolutional neural network
DL	Deep learning
DRCN	Deeply-Recursive Convolutional Network
ESPCN	Efficient Sub-Pixel Convolutional Neural Network
ESRGAN	Enhanced Super-Resolution Generative Adversarial Network
FID	Frechet Inception Distance
GAN	Generative adversarial network
GSD	Ground sampling distance
HBB	Horizontal bounding boxes
HR	High resolution
IoU	Intersection over Union
LR	Low resolution
MAE	Mean absolute error
mAP	Mean average precision
MARSGAN	Multi-Scale Adaptive Weighted Dense Residual SR GAN
MCGR	Multi-class Cyclic SR GAN with Residual feature aggregation
MSE	Mean squared error
NN	Neural network
OBB	Oriented bounding boxes
RCAN	Residual channel attention network
PSNR	Peak signal-to-noise ratio
RFAN	Residual Feature Aggregation Network
RoI	Region of interest
SAR	Synthetic aperture radar
SGD	Stochastic gradient descent
SNR	Signal-to-noise ratio
SOTA	State-of-the-art
SR	Super-Resolution
SR3	Super-Resolution via Repeated Refinement
SRCNN	Super-Resolution Convolutional Neural Network
SRGAN	Super-Resolution generative adversarial network
SSIM	Structural similarity index measure
SSSA	Spatially separable self-attention
TARSGAN	Terrestrial image deblurring with Adaptive weighted dense Residual SR GAN
VDSR	Very Deep Super-Resolution

References

Xu, J.Z.; Lu, W.; Li, Z.; Khaitan, P.; Zaytseva, V. Building damage detection in satellite imagery using convolutional neural networks. arXiv 2019, arXiv:1910.06444. [Google Scholar]
Mayer, H. Automatic object extraction from aerial imagery—A survey focusing on buildings. Comput. Vis. Image Underst. 1999, 74, 138–149. [Google Scholar] [CrossRef]
Li, W.; He, C.; Fang, J.; Zheng, J.; Fu, H.; Yu, L. Semantic segmentation-based building footprint extraction using very high-resolution satellite images and multi-source GIS data. Remote Sens. 2019, 11, 403. [Google Scholar] [CrossRef]
Hu, Q.; Zhen, L.; Mao, Y.; Zhou, X.; Zhou, G. Automated building extraction using satellite remote sensing imagery. Autom. Constr. 2021, 123, 103509. [Google Scholar] [CrossRef]
Liu, H.; Luo, J.; Huang, B.; Hu, X.; Sun, Y.; Yang, Y.; Xu, N.; Zhou, N. DE-Net: Deep encoding network for building extraction from high-resolution remote sensing imagery. Remote Sens. 2019, 11, 2380. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
Lindner, L.; Sergiyenko, O.; Rivas-López, M.; Ivanov, M.; Rodríguez-Quiñonez, J.C.; Hernández-Balbuena, D.; Flores-Fuentes, W.; Tyrsa, V.; Muerrieta-Rico, F.N.; Mercorelli, P. Machine vision system errors for unmanned aerial vehicle navigation. In Proceedings of the 2017 IEEE 26th International Symposium on Industrial Electronics (ISIE), Edinburgh, UK, 19–21 June 2017; pp. 1615–1620. [Google Scholar]
Illarionova, S.; Nesteruk, S.; Shadrin, D.; Ignatiev, V.; Pukalchik, M.; Oseledets, I. Object-based augmentation for building semantic segmentation: Ventura and santa rosa case study. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1659–1668. [Google Scholar]
Sun, Y.; Hua, Y.; Mou, L.; Zhu, X.X. CG-Net: Conditional GIS-aware network for individual building segmentation in VHR SAR images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–15. [Google Scholar] [CrossRef]
Neupane, B.; Horanont, T.; Aryal, J. Deep learning-based semantic segmentation of urban features in satellite images: A review and meta-analysis. Remote Sens. 2021, 13, 808. [Google Scholar] [CrossRef]
Kosari, A.; Sharifi, A.; Ahmadi, A.; Khoshsima, M. Remote sensing satellite’s attitude control system: Rapid performance sizing for passive scan imaging mode. Aircr. Eng. Aerosp. Technol. 2020, 92, 1073–1083. [Google Scholar] [CrossRef]
Razzak, M.; Mateo-Garcia, G.; Gómez-Chova, L.; Gal, Y.; Kalaitzis, F. Multi-Spectral Multi-Image Super-Resolution of Sentinel-2 with Radiometric Consistency Losses and Its Effect on Building Delineation. arXiv 2021, arXiv:2111.03231. [Google Scholar] [CrossRef]
Zhang, T.; Tang, H.; Ding, Y.; Li, P.; Ji, C.; Xu, P. FSRSS-Net: High-resolution mapping of buildings from middle-resolution satellite images using a super-resolution semantic segmentation network. Remote Sens. 2021, 13, 2290. [Google Scholar] [CrossRef]
Khan, S.D.; Alarabi, L.; Basalamah, S. An encoder-decoder deep learning framework for building footprints extraction from aerial imagery. Arab. J. Sci. Eng. 2023, 48, 1273–1284. [Google Scholar] [CrossRef]
Wang, P.; Bayram, B.; Sertel, E. A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Sci.Rev. 2022, 232, 104110. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Liu, J.; Zhang, W.; Tang, Y.; Tang, J.; Wu, G. Residual feature aggregation network for image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2359–2368. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the European Conference on Computer Vision, Berlin/Heidelberg, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 286–301. [Google Scholar]
Chen, H.; He, X.; Qing, L.; Wu, Y.; Ren, C.; Sheriff, R.E.; Zhu, C. Real-world single image super-resolution: A brief review. Inf. Fusion 2022, 79, 124–145. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 694–711. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 20–25 June 2021; pp. 1905–1914. [Google Scholar]
Tao, X.; Gao, H.; Shen, X.; Wang, J.; Jia, J. SR3: Super-Resolution via Recursive Residual Refinement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3146–3155. [Google Scholar]
Tao, Y.; Xiong, S.; Song, R.; Muller, J.P. Towards Streamlined Single-Image Super-Resolution: Demonstration with 10 m Sentinel-2 Colour and 10–60 m Multi-Spectral VNIR and SWIR Bands. Remote Sens. 2021, 13, 2614. [Google Scholar] [CrossRef]
Tao, Y.; Conway, S.J.; Muller, J.P.; Putri, A.R.; Thomas, N.; Cremonese, G. Single image super-resolution restoration of TGO CaSSIS colour images: Demonstration with perseverance rover landing site and Mars science targets. Remote Sens. 2021, 13, 1777. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Salgueiro Romero, L.; Marcello, J.; Vilaplana, V. Super-resolution of sentinel-2 imagery using generative adversarial networks. Remote Sens. 2020, 12, 2424. [Google Scholar] [CrossRef]
Michel, J.; Vinasco-Salinas, J.; Inglada, J.; Hagolle, O. SEN2VENμS, a dataset for the training of Sentinel-2 super-resolution algorithms. Data 2022, 7, 96. [Google Scholar] [CrossRef]
Mnih, V. Machine Learning for Aerial Image Labeling. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2013. [Google Scholar]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. High-resolution semantic labeling with convolutional neural networks. arXiv 2016, arXiv:1611.01962. [Google Scholar]
Lam, D.; Kuzma, R.; McGee, K.; Dooley, S.; Laielli, M.; Klaric, M.; Bulatov, Y.; McCord, B. xview: Objects in context in overhead imagery. arXiv 2018, arXiv:1802.07856. [Google Scholar]
Van Etten, A.; Lindenbaum, D.; Bacastow, T.M. Spacenet: A remote sensing dataset and challenge series. arXiv 2018, arXiv:1807.01232. [Google Scholar]
Weir, N.; Lindenbaum, D.; Bastidas, A.; Etten, A.V.; McPherson, S.; Shermeyer, J.; Kumar, V.; Tang, H. Spacenet mvoi: A multi-view overhead imagery dataset. In Proceedings of the IEEE/Cvf International Conference on Computer Vision, Seoul, Republic of Korea, 26–27 October 2019; pp. 992–1001. [Google Scholar]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Benhammou, Y.; Alcaraz-Segura, D.; Guirado, E.; Khaldi, R.; Achchab, B.; Herrera, F.; Tabik, S. Sentinel2GlobalLULC: A Sentinel-2 RGB image tile dataset for global land use/cover mapping with deep learning. Sci. Data 2022, 9, 681. [Google Scholar] [CrossRef]
Papoutsis, I.; Bountos, N.I.; Zavras, A.; Michail, D.; Tryfonopoulos, C. Benchmarking and scaling of deep learning models for land cover image classification. ISPRS J. Photogramm. Remote Sens. 2023, 195, 250–268. [Google Scholar] [CrossRef]
Syrris, V.; Hasenohr, P.; Delipetrev, B.; Kotsev, A.; Kempeneers, P.; Soille, P. Evaluation of the potential of convolutional neural networks and random forests for multi-class segmentation of Sentinel-2 imagery. Remote Sens. 2019, 11, 907. [Google Scholar] [CrossRef]
Corbane, C.; Syrris, V.; Sabo, F.; Politis, P.; Melchiorri, M.; Pesaresi, M.; Soille, P.; Kemper, T. Convolutional neural networks for global human settlements mapping from Sentinel-2 satellite imagery. Neural Comput. Appl. 2021, 33, 6697–6720. [Google Scholar] [CrossRef]
Bulatitskiy, D.; Buyval, A.; Gavrilenkov, M. Building Recognition in Air and Satellite Photos. Development 2019, 7, 4. [Google Scholar]
Ltd., S. SentinelHub: Cloud-based Processing and Analysis of Satellite Data. Available online: https://www.sentinel-hub.com/ (accessed on 17 December 2022).
Wang, J.; Gao, K.; Zhang, Z.; Ni, C.; Hu, Z.; Chen, D.; Wu, Q. Multisensor Remote Sensing Imagery Super-Resolution with Conditional GAN. J. Remote Sens. 2021, 2021, 9829706. [Google Scholar] [CrossRef]
OpenStreeMap. Available online: https://www.openstreetmap.org/ (accessed on 10 October 2022).
Python Wrapper for the OpenStreetMap Overpass API. Available online: https://pypi.org/project/overpass/ (accessed on 10 October 2022).
GDAL: Geospatial Data Abstraction Library. Available online: https://gdal.org/ (accessed on 10 October 2022).
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar] [CrossRef]
Chu, X.; Tian, Z.; Wang, Y.; Zhang, B.; Ren, H.; Wei, X.; Xia, H.; Shen, C. Twins: Revisiting the Design of Spatial Attention in Vision Transformers. arXiv 2021, arXiv:2104.13840. [Google Scholar] [CrossRef]
Venugopal, N. Automatic semantic segmentation with DeepLab dilated learning network for change detection in remote sensing images. Neural Process. Lett. 2020, 51, 2355–2377. [Google Scholar] [CrossRef]
Illarionova, S.; Shadrin, D.; Ignatiev, V.; Shayakhmetov, S.; Trekin, A.; Oseledets, I. Augmentation-Based Methodology for Enhancement of Trees Map Detalization on a Large Scale. Remote Sens. 2022, 14, 2281. [Google Scholar] [CrossRef]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic Understanding of Scenes through the ADE20K Dataset. arXiv 2016, arXiv:1608.05442. [Google Scholar] [CrossRef]
Contributors, M. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmsegmentation (accessed on 15 January 2023).
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6626–6637. [Google Scholar]
Sergiyenko, O.Y.; Tyrsa, V.V. 3D optical machine vision sensors with intelligent data management for robotic swarm navigation improvement. IEEE Sens. J. 2020, 21, 11262–11274. [Google Scholar] [CrossRef]
Zhang, M.; Zhou, Y.; Zhao, J.; Man, Y.; Liu, B.; Yao, R. A survey of semi-and weakly supervised semantic segmentation of images. Artif. Intell. Rev. 2020, 53, 4259–4288. [Google Scholar] [CrossRef]
Illarionova, S.; Trekin, A.; Ignatiev, V.; Oseledets, I. Tree species mapping on sentinel-2 satellite imagery with weakly supervised classification and object-wise sampling. Forests 2021, 12, 1413. [Google Scholar] [CrossRef]
Nesteruk, S.; Illarionova, S.; Akhtyamov, T.; Shadrin, D.; Somov, A.; Pukalchik, M.; Oseledets, I. XtremeAugment: Getting More From Your Data Through Combination of Image Collection and Image Augmentation. IEEE Access 2022, 10, 24010–24028. [Google Scholar] [CrossRef]
Nesteruk, S.; Zherebtsov, I.; Illarionova, S.; Shadrin, D.; Somov, A.; Bezzateev, S.V.; Yelina, T.; Denisenko, V.; Oseledets, I. CISA: Context Substitution for Image Semantics Augmentation. Mathematics 2023, 11, 1818. [Google Scholar] [CrossRef]
Illarionova, S.; Nesteruk, S.; Shadrin, D.; Ignatiev, V.; Pukalchik, M.; Oseledets, I. MixChannel: Advanced augmentation for multispectral satellite images. Remote Sens. 2021, 13, 2181. [Google Scholar] [CrossRef]
Nesteruk, S.; Bezzateev, S. Location-based protocol for the pairwise authentication in the networks without infrastructure. In Proceedings of the 2018 22nd Conference of Open Innovations Association (FRUCT), Jyvaskyla, Finland, 15–18 May 2018; pp. 190–197. [Google Scholar]
Zacharov, I.; Arslanov, R.; Gunin, M.; Stefonishin, D.; Bykov, A.; Pavlov, S.; Panarin, O.; Maliutin, A.; Rykovanov, S.; Fedorov, M. “Zhores”—Petaflops supercomputer for data-driven modeling, machine learning and artificial intelligence installed in Skolkovo Institute of Science and Technology. Open Eng. 2019, 9, 512–520. [Google Scholar] [CrossRef]

Figure 1. The study workflow consists of two steps. Firstly, a super-resolution (SR) model is trained to upscale satellite images by a factor of 4. Secondly, a neural network model is developed to segment buildings using the adjusted images of the Sentinel-2 satellite, which have been enhanced to a spatial resolution of 2.5 m.

Figure 2. Examples of images from xView (a) and Massachusetts Road and Building Detection datasets (b,c).

Figure 3. An example of Mapbox image with spatial resolution of 1 m (a) and corresponding Sentinel-2 image with spatial resolution of 10 m (b).

Figure 4. An example of garage polygons excluded from the markup. Original image (a), mask (b), merged image and mask (c).

Figure 5. An example of individual low-rise buildings excluded from the markup. Original image (a), mask (b), merged image and mask (c).

Figure 6. Krasnoyarsk image (a) and mask (b) example.

Figure 7. Frequency plot for building footprint location (a) and size (b), masks a resized into one size.

Figure 8. Visual comparison and image quality assessment of LR image (a) with SRGAN (b), MCGR (c), RCAN (d), and SR3 (e) for a scale factor of 4 for Sentinel-2 images.

Figure 9. Prediction results on Mapbox test images with the spatial resolution of 1 m using Swin Tiny transformer.

Figure 10. Prediction results on Mapbox test images with the spatial resolution of 2.5 m using SWIN Tiny transformer.

Figure 11. Prediction results on Sentinel-2 test images using SWIN Base transformer.

Figure 12. Prediction results on Sentinel-2 with the spatial resolution of 2.5 m test images using SWIN Tiny transformer.

Figure 13. Comparison of MCGR inference on Mapbox and Sentinel-2 images of same RoI with Mapbox original 1 m resolution (a), Mapbox up-scaled from 10 m resolution to 2.5 m (b), Sentinel-2 original 10 m resolution (c) and Sentinel-2 up-scaled to 2.5 m (d).

Figure 14. Comparison of the metrics (mIoU) for prediction results on different study regions. Metrics are calculated for the best models for each dataset. Sentinel-2 images are upscaled using CycleGAN.

Figure 15. The red square depicts the ability of the proposed approach to effectively distinguish between multi-storey buildings that are situated in close proximity to each other. The prediction is generated using an upscaled image from the Sentinel-2 satellite.

Table 1. Overall dataset statistics collected and used in this study.

	Area (km²)	Target Class Area (km²)	Target Area (%)	Pixel Range (Image/Mask)
Train images	679.5	119.4	17
Validation images	116.6	23.3	20	0–255/Binary
Test images	295.1	74.77	27
Total	1091.2	217.4	20

Table 2. Summary of metrics of SR results on Mapbox data.

Model	PSNR	SSIM	FID
SRGAN	27.29	0.78	3.85
MCGR	27.54	0.79	4.59
RCAN	27.24	0.76	12.07
SR3	26.24	0.62	1.40

Table 3. Summary on segmentation metrics by different algorithms on Mapbox images with the spatial resolution of 1 m.

Model	Type of Architecture/Encoder	mIoU	mF1-score	IoU Building Class	F1-Score Buildings Class
DeepLabV3	Resnet18	76.12 ± 0.1	85.18 ± 0.08	57.86 ± 0.18	73.26 ± 0.14
	Resnet50	76.48 ± 0.25	85.43 ± 0.19	58.32 ± 0.4	73.63 ± 0.25
	Resnet100	75.76 ± 0.31	84.90 ± 0.23	57.24 ± 0.47	72.76 ± 0.38
SWIN	Tiny	75.10 ± 0.32	84.35 ± 0.24	56.03 ± 0.52	71.72 ± 0.42
	Base	75.84 ± 0.32	84.94 ± 0.25	57.32 ± 0.54	72.51 ± 0.45
Twins	PCPVT	75.18 ± 0.09	84.44 ± 0.07	56.26 ± 0.15	71.92 ± 0.13
	SVT	75.62 ± 0.17	84.77 ± 0.12	56.96 ± 0.24	72.50 ± 0.19

Table 4. Summary on segmentation metrics by different algorithms on Mapbox images with the spatial resolution of 2.5 m.

Model	Type of Architecture/Encoder	mIoU	mF1-score	IoU Building Class	F1-Score Buildings Class
DeepLabV3	Resnet18	72.79 ± 0.28	82.45 ± 0.24	51.70 ± 0.55	68.07 ± 0.49
	Resnet50	71.86 ± 0.36	81.67 ± 0.31	50.07 ± 0.65	66.63 ± 0.58
	Resnet100	71.07 ± 0.34	80.99 ± 0.3	48.71 ± 0.61	65.38 ± 0.57
SWIN	Tiny	71.50 ± 0.2	81.36 ± 0.17	49.44 ± 0.35	66.06 ± 0.31
	Base	71.77 ± 0.27	81.59 ± 0.23	49.95 ± 0.49	66.50 ± 0.44
Twins	PCPVT	71.27 ± 0.31	81.19 ± 0.28	49.13 ± 0.56	65.79 ± 0.52
	SVT	71.33 ± 0.16	81.23 ± 0.14	49.17 ± 0.27	65.85 ± 0.25

Table 5. Summary on segmentation metrics by different algorithms on Sentinel-2 data. Images are uspcaled to 2.5 m using the CycleGAN model.

Model	Type of Architecture/Encoder	mIoU	mF1-score	IoU Building Class	F1-Score Buildings Class
DeepLabV3	Resnet18	69.07 ± 0.11	79.29 ± 0.1	45.29 ± 0.21	62.29 ± 0.2
	Resnet50	67.79 ± 0.23	78.11 ± 0.19	43.02 ± 0.35	60.10 ± 0.34
	Resnet100	67.31 ± 0.15	77.70 ± 0.17	42.34 ± 0.38	59.43 ± 0.37
SWIN	Tiny	69.42 ± 0.05	79.60 ± 0.05	45.91 ± 0.14	62.88 ± 0.14
	Base	69.36 ± 0.16	79.53 ± 0.13	45.77 ± 0.24	62.74 ± 0.23
Twins	PCPVT	69.13 ± 0.15	79.35 ± 0.13	45.46 ± 0.25	62.44 ± 0.24
	SVT	69.28 ± 0.02	79.48 ± 0.03	45.70 ± 0.1	62.67 ± 0.1

Table 6. Summary on segmentation metrics by different algorithms on Sentinel-2 data. Images are uspcaled to 2.5 m using the RCAN model.

Model	Type of Architecture/Encoder	mIoU	mF1-score	IoU Building Class	F1-Score Buildings Class
DeepLabV3	Resnet18	68.78 ± 0.11	79.04 ± 0.1	44.9 ± 0.26	61.91 ± 0.24
	Resnet50	68.07 ± 0.06	78.37 ± 0.11	43.57 ± 0.36	60.63 ± 0.35
	Resnet100	69.02 ± 0.16	79.21 ± 0.14	45.08 ± 0.27	62.07 ± 0.26
SWIN	Tiny	69.53 ± 0.11	79.69 ± 0.09	46.11 ± 0.16	63.06 ± 0.16
	Base	69.3 ± 0.03	79.49 ± 0.04	45.7 ± 0.09	62.67 ± 0.09
Twins	PCPVT	68.75 ± 0.13	78.98 ± 0.11	44.68 ± 0.22	61.69 ± 0.2
	SVT	69.02 ± 0.16	79.21 ± 0.14	45.08 ± 0.27	62.07 ± 0.27

Table 7. Summary on segmentation metrics by different algorithms on Sentinel-2 data. Images are uspcaled to 2.5 m using the diffusion-based SR3 model.

Model	Type of Architecture/Encoder	mIoU	mF1-score	IoU Building Class	F1-Score Buildings Class
DeepLabV3	Resnet18	68.11 ± 0.17	78.43 ± 0.24	43.68 ± 0.68	60.75 ± 0.66
	Resnet50	67.21 ±0.23	77.64 ± 0.17	42.29 ± 0.29	59.39 ± 0.29
	Resnet101	66.4 ± 0.37	76.89 ± 0.4	40.95 ± 0.9	58.06 ± 0.91
SWIN	Tiny	68.74 ± 0.11	79.05 ± 0.12	45.0 ± 0.3	62.03 ± 0.29
	Base	68.89 ± 0.08	79.17 ± 0.07	45.19 ± 0.12	62.2 ± 0.11
Twins	PCPVT	68.61 ± 0.09	78.93 ± 0.07	44.77 ± 0.08	61.8 ± 0.08
	SVT	68.77 ± 0.07	79.05 ± 0.06	44.92 ± 0.14	61.94 ± 0.14

Table 8. Comparison of publicly available remote sensing datasets and the dataset collected in this study.

Name	Data Source	Categories	Number of Images	Image Width (px)	Spatial Resolution (m)	Type of Annotation	Area (km²)	Image Format	Publication Year
Massachusetts Buildings Dataset	Aerial	1	151	1500	1	Segmentation mask	340	TIF	2013
Massachusetts Roads Dataset	Aerial	1	1171	1500	1	Segmentation mask	2600	TIF	2013
NWPU VHR-10	Google Earth, Vaihingen dataset	10	800	∼1000	0.08–2	HBB	NA	JPEG	2014
RSOD	Google Earth, Tianditu	4	976	∼1000	0.5–3.0	HBB	NA	JPEG	2017
Inria	Aerial	1	360	1500	0.3	Segmentation mask	810	TIF	2017
xView	WorldView-3	60	1127	3000	0.3	HBB	1400	TIF	2018
DOTA-v1	Multiple sources	15	2806	800–4000	1–4.5, 0.81, 0.72, 0.1	OBB	NA	PNG	2018
iSAID	DOTA	15	2806	800–13,000	1–4.5, 0.81, 0.72, 0.1	Segmentation mask	NA	PNG	2019
SpaceNet	WorldView-2 & 3	1	26,586	650–1300	0.3, 0.5	Segmentation mask	5555	TIF	2018
SpaceNet MVOI	WorldView-2	1	60,000	900	0.46–1.67	Segmentation mask	665	TIF	2019
HRRSD	Google Earth, Baidu	13	26,722	800–2000	0.15–1.2	HBB	NA	JPEG	2019
FAIR1M	Google Earth, Gaofen	5	15,266	1000–10,000	0.3–0.8	OBB	NA	TIF	2022
RSSOD	Multiple sources	5	1759	∼ 1000	0.05–0.8	HBB	NA	TIF	2022
Ours	Sentinel-2	1	30	3000	2.5	Segmenation mask	1091.2	TIF	2023

NA—information is not available. The bold text depicts the dataset collected in this study.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Illarionova, S.; Shadrin, D.; Shukhratov, I.; Evteeva, K.; Popandopulo, G.; Sotiriadi, N.; Oseledets, I.; Burnaev, E. Benchmark for Building Segmentation on Up-Scaled Sentinel-2 Imagery. Remote Sens. 2023, 15, 2347. https://doi.org/10.3390/rs15092347

AMA Style

Illarionova S, Shadrin D, Shukhratov I, Evteeva K, Popandopulo G, Sotiriadi N, Oseledets I, Burnaev E. Benchmark for Building Segmentation on Up-Scaled Sentinel-2 Imagery. Remote Sensing. 2023; 15(9):2347. https://doi.org/10.3390/rs15092347

Chicago/Turabian Style

Illarionova, Svetlana, Dmitrii Shadrin, Islomjon Shukhratov, Ksenia Evteeva, Georgii Popandopulo, Nazar Sotiriadi, Ivan Oseledets, and Evgeny Burnaev. 2023. "Benchmark for Building Segmentation on Up-Scaled Sentinel-2 Imagery" Remote Sensing 15, no. 9: 2347. https://doi.org/10.3390/rs15092347

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Benchmark for Building Segmentation on Up-Scaled Sentinel-2 Imagery

Abstract

1. Introduction

2. Related Work

2.1. Super-Resolution Methods for Satellite Imagery

2.2. Datasets for Building Recognition on Remote Sensing Imagery

3. Materials and Methods

3.1. Problem Statement

3.2. Dataset for Building Segmentation and Super-Resolution Tasks

3.3. Experiments for SR

3.4. Experiments for Building Segmentation

3.5. Evaluation Metrics

3.5.1. Super-Resolution

3.5.2. Building Segmentation

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI