Rural Building Extraction Based on Joint U-Net and the Generalized Chinese Restaurant Franchise from Remote Sensing Images

Wang, Zixiong; Li, Shaodan; Zhu, Zimeng

doi:10.3390/su15054685

Open AccessArticle

Rural Building Extraction Based on Joint U-Net and the Generalized Chinese Restaurant Franchise from Remote Sensing Images

by

Zixiong Wang

^1,2,

Shaodan Li

^1,2,*

and

Zimeng Zhu

^1,2

¹

School of Geographical Sciences, Hebei Normal University, Shijiazhuang 050024, China

²

Hebei Technology Innovation Center for Remote Sensing Identification of Environmental Change, Shijiazhuang 050024, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(5), 4685; https://doi.org/10.3390/su15054685

Submission received: 28 December 2022 / Revised: 28 February 2023 / Accepted: 3 March 2023 / Published: 6 March 2023

(This article belongs to the Special Issue Application of Remote Sensing for Sustainable Development)

Download

Browse Figures

Versions Notes

Abstract

:

The extraction of rural buildings from remote sensing images plays a critical role in the development of rural areas. However, automatic building extraction has a challenge because of the diverse types of buildings and complex backgrounds. In this paper, we proposed a two-layer clustering framework named gCRF_U-Net for the extraction of rural buildings. Before the building extraction, the potential built-up areas are firstly detected, which are taken as a constraint for building extraction. Then, the U-Net network is employed to obtain the prior probability of the potential buildings. After this, the calculated probability and the satellite image are put into the generalized Chinese restaurant franchise (gCRF) model to cluster for buildings and non-buildings. In addition, it is worth noting that the hierarchical spatial relationship in the images is clarified for the building extraction. According to the compared experiments on the satellite images and public building datasets, the results show that the proposed method has a better performance, compared with other methods based on the same unified hierarchical models, in terms of quantitative and qualitative evaluation.

Keywords:

rural building extraction; U-Net; the generalized Chinese restaurant franchise; remote sensing images

1. Introduction

Compared with urban buildings, rural buildings are often surrounded with trees, farmland and bare land, and they have complex background information. In addition, rural buildings lack unified planning and management, which leads to different roof types. Moreover, many studies pay more attention to urban buildings, and only a few studies focus on rural buildings. Automatically and accurately extracting rural buildings is of great significance in various fields, such as rural land resource allocation, rural population estimates, the rural modernization process and rural development and planning.

The methods of building extraction are divided into two categories: traditional methods and deep learning methods. Traditional methods can be categorized into four types, which include: object-oriented methods [1,2]; building extraction based on low-level features such as morphological features [3] and color [4]; some methods combined with auxiliary information such as shadow [5] and elevation [6,7]; and statistical models [8,9] such as Markov models [10], conditional random fields [11] and probabilistic topic models [12]. The Chinese restaurant franchise model (CRF) is a two-layer hierarchical clustering model which is a probabilistic topic model. Mao et al. proposed the generalized Chinese restaurant franchise (gCRF) [13], which was used for the unsupervised image classification by fusing the panchromatic (PAN) image and multispectral (MS) image. Li et al. presented the gCRF_MBI, which introduced the morphological building index (MBI) [14] into the gCRF method to detect buildings from very high-resolution (VHR) satellite images [15]. Due to the flexible hierarchical structure of the probabilistic model, the model can extract buildings effectively by computing the probabilistic distribution of buildings and non-buildings. However, the process of inferring probability may be affected by different spectral information of the different buildings in the satellite images.

In terms of the application of VHR satellite images, rural buildings are complex and often crisscrossed with vegetation, roads, farmland and so on. Even if the high-resolution remote sensing images have rich details, complex types of land cover, shadows and other problems are also very difficult to overcome, making the phenomenon of different objects having the same spectrum and the same objects having different spectra in the image more common. Despite the above traditional methods being effective in some scenes, they are limited in solving the problem of building extraction in complex backgrounds, and lack sufficient generalization ability.

With the successful application of convolutional neural networks (CNN) in image classification, various network models have been widely used for building extraction from remote sensing images. The fully convolutional network (FCN) [16] is a semantic segmentation network that can obtain pixel-level detection results. Furthermore, a binary map of buildings can be produced with the same size as the input image. Compared with the FCN, the U-Net [17] added a channel feature fusion method, fusing the underlying information with advanced information in the channel dimension, which is more helpful for boundary information. The characteristics make it possible to detect the buildings. For example, Pan et al. employed the U-Net to extract urban village buildings with WorldView satellite images [18]. Li et al. proposed attention-enhanced U-Net for building extraction in farmland with Google and WorldView-2 remote sensing images [19]. Compared with the fully convolutional networks (FCN) [16] and the SegNet [20], the experimental results showed that U-Net required fewer training samples and fewer computational resources while achieving better performance. However, researchers are more concerned with the modification of network structure, such as the introduction of Residual U-Net [21,22,23] and Attention U-Net [24,25], and few are combined with other frameworks.

Many researches have proven that U-Net architecture is an effective method to obtain the prior probability of the potential buildings from remote sensing images. However, the segmentation threshold needs to be set manually. The gCRF has a flexible hierarchical structure, and the number of clusters can be determined automatically. In order to make building extraction more flexible and universal, we introduced probability from U-Net into gCRF framework and proposed the gCRF_U-Net method for rural building extraction. In the proposed method, the extraction of rural buildings is restricted by the built-up area candidates (BACs) using a spectral residuals (SR) method [26]. The proposed method consists of two steps: The U-Net network is firstly used to extract the feature map with the small training samples and obtain a probability of potential buildings, which is used as an input to the gCRF model. Then, the MS image and the calculated probability are put into the gCRF for hierarchical clustering to distinguish buildings and non-buildings. In addition, it is worth noting that the hierarchical spatial relationship from the pixels to BACs, that is, “pixels—superpixels—structures—buildings—BACs”, is clarified in this paper. During experiments, the proposed method was compared with the other relevant methods that based on the gCRF model and obtained a better performance.

The remainder of this paper is organized as follows: Section 2 introduces the overall architecture of the proposed method. In Section 3, our experiments and results are shown, and in Section 4 the results are discussed. Finally, a conclusion is drawn in Section 5.

2. Methods

As is shown in Figure 1, before the extraction of the buildings, the potential BACs were firstly detected using the SR method [26], which was taken as a constraint for building extraction. Then, the U-Net network was employed to select the high-level feature with the small samples, and the resul ts were used as the prior probability of the potential buildings. After that, the calculated probability and MS image were put into the gCRF_U-Net model to cluster for buildings and non-buildings. Finally, the clustering results were processed using a morphological filter. The details of the gCRF_U-Net method were then described.

2.1. SR

Spectral residuals is one of the visual attention models, which can effectively detect significant regions in remote sensing images [27]. In this paper, the SR method is used to extract the BACs. First, the frequency image of the input image is obtained by Fourier transform, and then the spectral residual R(f) of the image is calculated by Formula (1).

R (f) = L (f) - h (f) \times L (f),

(1)

where L(f) = log(A(f)). A(f) represents the amplitude spectrum from the Fourier transform f and is determined by A(f) = abs(f). h(f) is a local averaging filter.

The spectral residual R(f) denotes the BACs of the input image. The final saliency map in the spatial domain is computed by the inverse Fourier transform. At last, the Otsu threshold [28] is used to obtain the binary image of the BACs.

2.2. gCRF_U-Net

As is shown in Figure 2, the proposed gCRF_U-Net method can be explained by three steps, which include data processing, feature extraction and hierarchical clustering.

In the data processing step, PAN images were first oversegmented into a set of superpixels, and a superpixel with a band of pixels was represented as a supercustomer in the gCRF_U-Net, which was used as the smallest unit in the later study. Before the building extraction, MS images were fused with PAN images to obtain the same resolution as PAN images using Gram-Schmidt Pan Sharpening in ENVI software, and the resampling method selected the nearest neighbor. At the same time, the SR method was used to detect the BACs from MS image. The BACs were analogized as the restaurants in the gCRF_U-Net and were used as a constraint for feature extraction and classification.

In the second step, the architecture of U-Net was used to obtain the probability of the buildings from the images. As is shown in Figure 3, U-Net consists of a contracting path and an expansive path. The contracting path is used for the feature extraction of images and the reduction of spatial dimensions. The expansive path is used for the recovery of image detail information and spatial dimensions. U-Net utilizes symmetric skip connections to connect the feature maps of the contracting path to the output of the expansive path. Due to this network structure, U-Net has a satisfactory performance in the semantic segmentation of images by combining the information of deep features with the detail information of shallow features, which make it perform well in building extraction in the rural areas. Therefore, we used U-Net to obtain detailed information and high-level semantic information to obtain the prior probability of the potential buildings, which was used as an input in gCRF_U-Net model.

In the third step, the extracted features from the U-Net and MS image were input to the gCRF_U-Net model for hierarchical clustering to distinguish buildings and non-buildings. The gCRF_U-Net metaphor in this paper is explained as follows: Suppose there are multiple restaurants that share the same menu. Two types of customers, white supercustomers from U-Net images and color supercustomers from MS images, go to the restaurant. There are two iterative progress in the hierarchical clustering model: table selection and dish selection. It can be explained by local clustering and global clustering in the progress, respectively. In the table selection, each white supercustomer would be allocated with a local label using currently inferred statistical models based on the prior probability of the U-Net. After all the white supercustomers have chosen their tables, in the dish selection, the corresponding color supercustomers firstly replace the white supercustomers, and then all the color supercustomers with the same local label would be allocated with the global labels with MS image. The lexical correspondence between the gCRF_U-Net method and image structures is shown in Table 1.

In the algorithm of gCRF_U-Net, the superscripts U and M represent the white supercustomers from U-Net and color supercustomers from MS image, respectively. i represents the number of the building candidate areas and j represents the number of superpixels in the building candidate areas, respectively. The specific explanation of the two processes of table selection and dish selection are as follows.

Table Selection

Assume

T_{\neg i j}

is the set of table with the exception of jth white supercustomer

θ_{i j}^{U}

in the ith restaurant and

K_{\neg i j}^{U}

is the set of dishes with the exception of jth white supercustomer

θ_{i j}^{U}

in the ith restaurant, where

\neg i j

represents the set of individuals in the statistic with the exception of the individual ij. The white supercustomer

θ_{i j}^{U}

selects a table t with probability

p (t_{i j} = t | T_{\neg i j}, K_{\neg i j}) \propto \{\begin{matrix} n_{i j}^{\neg i j} \int_{k_{i t}^{P}}^{U} (x_{i j}^{U}), i f t a b l e t e x i s t e s, \\ α_{0} \int_{k_{i t}^{P}}^{U} (x_{i j}^{U}), i f t a b l e t i s a n e w t a b l e . \end{matrix}

(2)

where

\int_{k_{i t}^{P}}^{U} (x_{i j}^{U})

is the conditional likelihood of the superpixel

x_{i j}^{U}

in the U-Net image given multinomial distribution

k_{i t}^{U}

. The white supercustomer

θ_{i j}^{U}

is explained by the jth superpixel

x_{i j}^{U}

in the ith restaurant, which is generated according to the pre-assumed probability model, e.g., a multinomial distribution. If table t is a new table, a dish

k_{i t}^{P}

would be selected with the probability

p (k_{i t}^{n e w} = t | T_{\neg i j}, K_{\neg i j}) \propto α \int_{k}^{U} (x_{i j}^{U}),

(3)

where the prior

α

is the same for either the existing dish or a newly added dish.

2.: Dish Selection

After all the white supercustomers have been seated, they are directly replaced by color supercustomers at the same geographic location. Then color supercustomers select a new dish for each table using MS images.

Suppose that all of the tables with the exception of table t in the ith restaurant have been served with the dishes

K_{\neg i t}^{M}

, then a dish

k

is selected for table with the probability

p (k_{i t}^{M} = k | T_{\neg t j}, K_{\neg i t}^{M}) \propto \{\begin{matrix} m_{k} \int_{k_{i t}^{M}}^{M} (x_{i t}^{M}), i f k i s p r e v i o u s l y u s e d, \\ β \int_{k_{i t}^{M}}^{M} (x_{i t}^{M}), i f k i s a n e w d i s h e s . \end{matrix}

(4)

where

\int_{k_{i t}^{M}}^{M} (x_{i t}^{M})

is the likelihood of the superpixel

x_{i t}^{M}

in the MS images given Gaussian distribution

k_{i t}^{M}

. The color supercustomers sit at the table t in the ith restaurant

x_{i t}^{M}

is explained by the superpixel

x_{i t}^{M}

in the ith restaurant, which is according to the pre-assumed probability model, e.g., a Gaussian model.

2.3. Post-processing

Using superpixels as the basic units for building extraction can reduce the “salt and pepper effect”, but there are other problems in the results, such as hollow areas and boundary sawtooth. Therefore, the extracted results are post-processed using a morphological filtering. According to comparing different morphology filters with different sizes, an open operation with the size of 5 × 5 is applied to improve the extracted results.

3. Experiments and Results

In this section, we first describe the experimental setting and evaluation method, and then, the performance of the proposed method is evaluated in terms of both quantitative and qualitative aspects.

3.1. Experimental Setting

3.1.1. Experimental Data

To verify proposed method, the considered experimental data included very high-resolution (VHR) satellite images and public building datasets, which covered rural buildings in the north and south of China, as well as two different areas around the world. The VHR satellite images and public building datasets are as follows.

VHR satellite images

Two VHR satellite images were used in our experiments. One image covered the town of Hanwang in Sichuan province, located in the south of China, was taken by the QuickBird satellite. The other, which covered Xiong’an New Area in Hebei province, located in the north of China, was acquired by the WorldView satellite. The spatial resolutions of the two images are 0.6 m and 0.5 m, respectively. The buildings in both of the two towns are clustered into scattered blocks and the single buildings in each block are dense. In terms of spectra, the buildings in QuickBird image are relatively simple, and in the WorldView image are more complex; for example, the materials of roofs of buildings are different. The ground-truth data of VHR satellite images were constructed by manual visual interpretation in Labelme software.

2.: Public building datasets

Part of the data from two public building datasets, WHU Building Dataset [29] produced by Wuhan University and INRIA aerial image dataset [30], was used in our experiment. The selected experiment data of WHU Building Dataset were taken from East Asia with 0.45 m resolution. The aerial images of Kitsap Country in the INRIA aerial image dataset were taken from Washington with a spatial resolution of 0.3m. All of the buildings in the two towns are single individuals, which are different from those in the VHR satellite images.

3.1.2. Experimental Environment and Setting

In the experiment, the Adam optimizer was used to optimize the U-Net, for which the initial learning rate was set to 0.0001. In addition, the experiment was trained using the built-in TensorFlow framework in Python version 3.7, the training batch size was set to 2 and the number of epochs was set to 50. The loss function we used in the training stage was binary_crossentropy.

Due to the architecture of the U-Net and the limited memory size of the adopted graphics card, in the satellite images and the INRIA aerial image dataset, 30 images with the size of 512 × 512 were randomly cropped from the original images as training samples, respectively; in the WHU Building Dataset, 30 images were randomly selected as training samples. Therefore, training images and their corresponding ground truth were together put into the U-Net to train the parameters of the model.

3.2. Evaluation Method

In our experiments, we used four well-known quality measures (precision, recall, F-value and IoU) to evaluate the performance of the proposed method. Precision represents the proportion of correctly predicted pixels among the total building pixels identified with the proposed method. Recall represents the proportion of correct predictions among the total building pixels with the ground truth. The F-value is the harmonic mean of the recall and precision values; it represents the comprehensive performance of precision and recall. The IoU is the quotient of intersection and union from the prediction images and ground truth.

The definitions the evaluation indicator are given as follows:

R e c a l l = \frac{T P}{T P + F N},

(5)

P r e c i s i o n = \frac{T P}{T P + F P},

(6)

F = \frac{2 \times R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n},

(7)

I o U = \frac{T P}{T P + F P + F N},

(8)

where TP represents the building pixels detected by the method and included in the ground-truth, FP represents the building pixels detected by the method but not included in the ground-truth and FN represents the building pixels not detected by the method but included in the ground-truth.

3.3. Experimental Results

As shown in Figure 4, seven test images, including VHR satellite images and public building datasets, are used in the experiments. In the VHR satellite images, the first image is taken from QuickBird sensor, and the second and third images are obtained from WorldView sensor. In the public building datasets, the fourth and fifth images are from WHU Building Dataset, and the last two images are from INRIA aerial image dataset. The corresponding ground truths are also given as a reference by Labelme software in Figure 4.

The performances of our method will be validated based on three VHR satellite images and four public building images, and their results of qualitative and quantitative evaluation will be discussed in the next section.

3.3.1. VHR Satellite Images

In order to verify the accuracy and applicability of the proposed method, other associated methods that were based on the gCRF method were established. The gCRF [13] is a two-layer clustering model, which is used for image classification by fusing PAN image and MS image. The gCRF_MBI [15] method introduced the MBI into the second layer to replace MS image in gCRF. The performance of gCRF_MBI and gCRF has been compared with the SVM and K-means in terms of the pixels and the segmentation from qualitative and quantitative perspectives [15]. Imitating the gCRF_MBI, we designed the MBI into the first layer in gCRF, and used MS image for the global clustering in second layer, which is called the MBI_gCRF. Therefore, in the experiments, the proposed method is compared with the gCRF, gCRF_MBI, and MBI_gCRF.

Figure 5 shows the results based on three VHR satellite images. As is shown in Figure 5, the extracted buildings by gCRF_U-Net are predicted more precisely and are correctly separated from the non-buildings areas. In the first image, it is obvious that the building boundary extracted by other methods are fuzzy, and the buildings are not fully extracted, such as in the purple box. Meanwhile, the phenomenon of redundant extraction is more common in other methods. In contrast, the results obtained from the gCRF_U-Net method are better than those obtained with other methods. Specifically, the last image in the first row of Figure 5 reveals that the building extracted with our method is closer to the ground truth, and it has more correct and complete boundaries. In the other two test images in Figure 5, their building roofs are different in their spectra and surrounded with bare land and vegetation. Thus, there are many mistakes in the results extracted by the three other methods. Specifically, as is shown in the purple box, some gray buildings in the images were missing and some bare land or vegetation are extracted by mistake. This phenomenon also occurs in other areas of the image. Compared with other methods, the results of gCRF_U-Net method account the most positive inspection areas and the boundary of each building is more complete.

However, the proposed method in this paper did not perform well in some aspects. For example, the courtyard surrounded by the building may have the same spectral information as some roofs, which makes it difficult to distinguish them from buildings. Meanwhile, if the buildings are adjacent and the boundaries are fuzzy, it is difficult to determine them accurately.

In order to analyze the building extraction results of VHR satellite images quantitatively, four indicators are used to evaluate the extracted results. As is shown in Table 2, the proposed method achieves the best results based on the three test images from the VHR satellite images.

The proposed method achieves the best visual results of building extraction, and all quantitative indicators are better than those of three other methods. In our proposed method, the first image taken from QuickBird sensor has a higher value than the quantitative indicators, of which the IoU value reached 69.21% and the F score reached 81.80%. The F scores obtained in our model are 20.50%, 20.26%, and 15.88% higher than that of the other three models, respectively. The IoUs are 25.01%, 24.77% and 20.04% higher, respectively. The other two images are taken from WorldView, and they produced a similar result to that of QuickBird image. The IoU values of the two images reaches 70.77% and 72.95%, respectively. The F score reaches 82.88% and 84.36%, respectively. All the four quantitative indicators of gCRF_U-Net method are better than those obtained with the other three methods. However, due to the complex spectral features on WorldView images, the performance of other methods on WorldView images is not as good as that on QuickBird image.

3.3.2. Public Building Datasets

The experimental results of the public building datasets with different methods are shown in Figure 6. It can be seen that the building areas extracted by gCRF_U-Net method are correctly separated from the non-building areas. According to the test images, the extracted results using gCRF_U-Net have the most positive areas, and the boundaries of buildings are more complete. The first image and second image are from WHU building dataset. The images show that the omission of buildings and incorrect building extraction are more common in other methods. As is shown in the purple box of the first and second image, there are some errors in extraction using the other methods and the extracted single buildings are not complete. As a contrast, the building results from gCRF_U-Net method are more complete and there are fewer instances of error extraction. In the third and fourth images from INRIA aerial image dataset, the gCRF_U-Net method has the most similar extracted results and obtains the most positive inspection areas. In comparison, as is shown in the purple box of the last two images, some non-building areas, such as road and bare land, are extracted as building areas by mistake when we used the other methods to extract buildings, which made the results have higher recall and lower precision.

Table 3 shows the results of the quantitative evaluation for test images from public building datasets. There are four test images from two public building datasets. The buildings in public building datasets are individuals, which are different from those in VHR satellite images. As is shown in Table 3, all the F scores and IoUs computed by the gCRF_U-Net method are higher than those of other methods. Specifically, the F score of the four images reached 81.19%, 73.53%, 89.36% and 82.86%, respectively. The IoU reached 68.33%, 58.14%, 80.77% and 70.73%, respectively. In the test images, the performance of other methods is not stable. For example, the gCRF and gCRF_MBI methods perform better than MBI_gCRF in the first and second images taken from WHU Building Dataset. However, the two test images from INRIA aerial image dataset obtain different results.

4. Discussions

In this section, we discussed the proposed method from the viewpoint of two aspects: semantic segmentation network and hierarchical spatial relationship.

4.1. Semantic Segmentation Network

The deep learning method has become the most popular method for building extraction. However, most networks require complex training conditions and lots of training samples. Meanwhile, it is difficult to automatically determine the probability threshold. The gCRF model is a hierarchical clustering model, and the number of clusters can be determined automatically. Therefore, the U-Net network and the gCRF model are combined in this paper. The proposed gCRF_U-Net method is a two-layer hierarchical clustering model, which includes local clustering and global clustering. In the proposed method, the U-Net network is employed to select the high-level feature with the small samples, and the extracted results were used as the prior probability of the potential buildings. Then, the calculated probability and MS image are input to the gCRF model to cluster for buildings and non-buildings. In contrast with the original gCRF model, in order to improve the accuracy of the building extraction, the calculated probability from the U-Net replaces the PAN image, and it is input to the clustering model as a prior in the proposed method. In the hierarchical clustering model, the experiments show that the probability from the U-Net can effectively extract buildings and reduce the phenomenon of different objects having the same spectrum and the same object having different spectra.

4.2. Hierarchical Spatial Relationship

The proposed method for building extraction follows a hierarchical spatial relationship from the pixels to buildings, that is, “pixels—superpixels—structures—buildings”. In order to extract buildings more accurately, the built-up area candidates are detected using the SR method as a constraint for building extraction. Therefore, the hierarchical spatial relationship from the pixels to BACs in the images is “pixels—superpixels—structures—buildings—BACs”. In the hierarchical spatial relationship, the superpixels can eliminate the salt and pepper effect and the SR produces the BACs for the constraints of building extraction. While the structures and buildings in the spatial relationship are extracted in local clustering and global clustering, respectively.

In this paper, the BACs are detected as a constraint of the buildings, so the boundary is relatively rough. In the common scenes, the buildings in a built-up area are clustered, and the SR method is excellent in this case. However, if the buildings in the images are very scattered, which is very rare, a very small part of the boundary of BACs is incomplete, which may lead to the loss of one or two buildings.

5. Conclusions

In this paper, we proposed the gCRF_U-Net method for rural building extraction based on a semantic segmentation network and a unified Bayesian framework. Before the building extraction, the potentially built-up areas were firstly detected using the SR method, which is taken as a constraint for building extraction. Then, the U-Net network was employed to obtain a prior probability of the rural buildings, and the calculated probability and MS image were put into the gCRF model to cluster for buildings and non-buildings. In addition, an important contribution of this paper is that the proposed method clarifies the hierarchical spatial relationship of the images in building extraction.

In order to verify the effectiveness of the proposed method, we experimented on several test images from different VHR satellite images and public building datasets. The experimental results show that our method has a better performance, which is compared with other methods based on the same unified hierarchical model, in terms of recall, precision, F-value and IoU.

However, the proposed method in our paper does not work in some cases. For example, if the buildings are scattered, part of the boundary of the built-up area is incomplete, which may lead to the loss of one or two buildings. Meanwhile, if several buildings are adjacent and the boundary is fuzzy, our method finds it difficult to determine the adjacent relationship of the buildings, and the method may identify them as a complete building by mistake. Therefore, in the future, we will focus on the improvement of the spectral residuals method and study the division of adjacent buildings. In addition, in order to improve the performance of our method, the rural buildings could be further divided into different scenes.

Author Contributions

Z.W. and S.L. designed and performed the experiment. Z.W. wrote the article. S.L. revised the manuscript and provided the gCRF_MBI code. Z.Z. assisted in collating experiment data. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 41801240, and the Science and Technology Project of Hebei Education Department under grant BJK2022031.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original satellite images in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, C.; Shen, Y.; Liu, H.; Zhao, K.; Xing, H.; Qiu, X. Building Extraction from High–Resolution Remote Sensing Images by Adaptive Morphological Attribute Profile under Object Boundary Constraint. Sensors 2019, 19, 3737. [Google Scholar] [CrossRef] [Green Version]
Mohammadi, H.; Samadzadegan, F. An object based framework for building change analysis using 2D and 3D information of high resolution satellite images. Adv. Space Res. 2020, 66, 1386–1404. [Google Scholar] [CrossRef]
Ma, W.; Wan, Y.; Li, J.; Zhu, S.; Wang, M. An Automatic Morphological Attribute Building Extraction Approach for Satellite High Spatial Resolution Imagery. Remote Sens. 2019, 11, 337. [Google Scholar] [CrossRef] [Green Version]
Ghandour, A.J.; Jezzini, A.A. Autonomous Building Detection Using Edge Properties and Image Color Invariants. Buildings 2018, 8, 65. [Google Scholar] [CrossRef] [Green Version]
Gao, X.; Wang, M.; Yang, Y.; Li, G. Building Extraction From RGB VHR Images Using Shifted Shadow Algorithm. IEEE Access 2018, 6, 22034–22045. [Google Scholar] [CrossRef]
Chen, S.; Zhang, Y.; Nie, K.; Li, X.; Wang, W. Extracting Building Areas from Photogrammetric DSM and DOM by Automatically Selecting Training Samples from Historical DLG Data. ISPRS Int. J. Geo-Inf. 2020, 9, 18. [Google Scholar] [CrossRef] [Green Version]
Guo, L.; Deng, X.; Liu, Y.; He, H.; Lin, H.; Qiu, G.; Yang, W. Extraction of Dense Urban Buildings from Photogrammetric and LiDAR Point Clouds. IEEE Access 2021, 9, 111823–111832. [Google Scholar] [CrossRef]
Sadeq, H. Building Extraction from Lidar Data Using Statistical Methods. Photogramm. Eng. Remote Sens. 2021, 87, 33–42. [Google Scholar] [CrossRef]
Adelipour, S.; Ghassemian, H. Building extraction from very high-resolution synthetic aperture radar images based on statistical and structural information fusion. Int. J. Remote Sens. 2019, 40, 7113–7126. [Google Scholar] [CrossRef]
Zhao, W.; Yan, L.; Chang, Y.; Gong, L. High-Resolution Remote Sensing Image Building Extraction Based on Markov Model. ISPRS—Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2018, XLII-3, 2411–2418. [Google Scholar] [CrossRef] [Green Version]
Zhu, Q.; Li, Z.; Zhang, Y.; Guan, Q. Building Extraction from High Spatial Resolution Remote Sensing Images via Multiscale-Aware and Segmentation-Prior Conditional Random Fields. Remote Sens. 2020, 12, 3983. [Google Scholar] [CrossRef]
Blei, D.M. Introduction to Probabilistic Topic Models. Commun. ACM 2010, 55, 77–84. [Google Scholar] [CrossRef] [Green Version]
Mao, T.; Tang, H.; Wu, J.; Jiang, W.; He, S.; Shu, Y. A Generalized Metaphor of Chinese Restaurant Franchise to Fusing Both Panchromatic and Multispectral Images for Unsupervised Classification. IEEE Trans. Geosci. Remote Sens. 2016, 54, 4594–4604. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L. Morphological Building/Shadow Index for Building Extraction from High-Resolution Imagery Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 161–172. [Google Scholar] [CrossRef]
Li, S.; Tang, H.; Huang, X.; Mao, T.; Niu, X. Automated Detection of Buildings from Heterogeneous VHR Satellite Images for Rapid Response to Natural Disasters. Remote Sens. 2017, 9, 1177. [Google Scholar] [CrossRef] [Green Version]
Shelhamer, E.; Long, J.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Cham, Switzerland, 18 November 2015; pp. 234–241. [Google Scholar] [CrossRef]
Pan, Z.; Xu, J.; Guo, Y.; Hu, Y.; Wang, G. Deep Learning Segmentation and Classification for Urban Village Using a Worldview Satellite Image Based on U-Net. Remote Sens. 2020, 12, 1574. [Google Scholar] [CrossRef]
Li, C.; Fu, L.; Zhu, Q.; Zhu, J.; Fang, Z.; Xie, Y.; Guo, Y.; Gong, Y. Attention Enhanced U-Net for Building Extraction from Farmland Based on Google and WorldView-2 Remote Sensing Images. Remote Sens. 2021, 13, 4411. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Wang, H.; Miao, F. Building extraction from remote sensing images using deep residual U-Net. Eur. J. Remote Sens. 2022, 55, 71–85. [Google Scholar] [CrossRef]
Sariturk, B.; Seker, D.Z. A Residual-Inception U-Net (RIU-Net) Approach and Comparisons with U-Shaped CNN and Transformer Models for Building Segmentation from High-Resolution Satellite Images. Sensors 2022, 22, 7624. [Google Scholar] [CrossRef] [PubMed]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Guo, M.; Liu, H.; Xu, Y.; Huang, Y. Building Extraction Based on U-Net with an Attention Block and Multiple Losses. Remote Sens. 2020, 12, 1400. [Google Scholar] [CrossRef]
Hou, X.; Zhang, L. Saliency Detection: A Spectral Residual Approach. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar] [CrossRef]
Li, S.; Fu, S.; Zheng, D. Rural Built-Up Area Extraction from Remote Sensing Images Using Spectral Residual Methods with Embedded Deep Neural Network. Sustainability 2022, 14, 1272. [Google Scholar] [CrossRef]
Otsu, N. A Threshold Selection Method from Gray-Level Histograms. IEEE Trans. Syst. Man Cybern. 1979, 9, 62–66. [Google Scholar] [CrossRef] [Green Version]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In Proceedings of the 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, USA, 23–28 July 2017; pp. 3226–3229. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Three components in the gCRF_U-Net method.

Figure 2. gCRF_U-Net.

Figure 3. U-Net architecture.

Figure 4. Experimental results. (a–g) Test images; (h–n) ground truth; (o–u) our results.

Figure 5. Results of VHR satellite images using different methods. (a) Satellite images; (b) gCRF; (c) gCRF_MBI; (d) MBI_gCRF; (e) ours. Green color indicates the correct detection of buildings, red color indicates the missed detection, blue indicates the false detection, and gray indicates the background.

Figure 6. Results of public building datasets using different methods. (a) Satellite images; (b) gCRF; (c) gCRF_MBI; (d) MBI_gCRF; (e) ours. Green color indicates the correct detection of buildings, red color indicates the missed detection, blue indicates the false detection, and gray indicates the background.

Table 1. Lexical correspondence between the gCRF_U-Net and image structures.

gCRF_U-Net	Image Structures
Customers	Superpixels
White supercustomers	Superpixels from the U-Net
Color supercustomers	Superpixels from MS image
Tables	Structures
Dishes	Buildings or non-buildings
Restaurants	Built-up area candidates

Table 2. Quantitative evaluation (%) of different methods with VHR satellite images.

The First Image
Method	Precision	Recall	F	IoU
gCRF	54.12%	70.68%	61.30%	44.20%
gCRF_MBI	54.44%	70.76%	61.54%	44.44%
MBI_gCRF	55.91%	80.30%	65.92%	49.17%
gCRF_U-Net	74.96%	90.02%	81.80%	69.21%
The Second Image
Method	precision	recall	F	IoU
gCRF	41.29%	57.22%	47.97%	31.55%
gCRF_MBI	39.40%	52.15%	44.89%	28.94%
MBI_gCRF	42.12%	47.95%	44.85%	28.90%
gCRF_U-Net	80.49%	85.41%	82.88%	70.77%
The Third Image
Method	precision	recall	F	IoU
gCRF	37.04%	64.06%	46.94%	30.67%
gCRF_MBI	38.70%	57.06%	46.12%	29.97%
MBI_gCRF	37.08%	68.69%	48.16%	31.72%
gCRF_U-Net	77.15%	93.06%	84.36%	72.95%

Table 3. Quantitative evaluation (%) of different methods with public building datasets.

The First Image
Method	Precision	Recall	F	IoU
gCRF	67.60%	83.06%	74.52%	59.39%
gCRF_MBI	63.82%	72.90%	68.06%	51.56%
MBI_gCRF	44.86%	80.10%	57.51%	40.36%
gCRF_U-Net	79.72%	82.71%	81.19%	68.33%
The Second Image
Method	precision	recall	F	IoU
gCRF	57.43%	78.88%	66.47%	49.78%
gCRF_MBI	61.50%	80.84%	69.85%	53.67%
MBI_gCRF	40.32%	75.56%	52.58%	35.67%
gCRF_U-Net	64.93%	84.76%	73.53%	58.14%
The Third Image
Method	precision	recall	F	IoU
gCRF	66.97%	75.85%	71.14%	55.20%
gCRF_MBI	66.71%	75.80%	70.96%	55.00%
MBI_gCRF	43.20%	80.89%	56.32%	39.20%
gCRF_U-Net	81.58%	98.79%	89.36%	80.77%
The Fourth Image
Method	precision	recall	F	IoU
gCRF	35.46%	73.85%	47.91%	31.51%
gCRF_MBI	31.28%	61.22%	41.41%	26.11%
MBI_gCRF	36.83%	79.01%	50.24%	33.55%
gCRF_U-Net	73.67%	94.66%	82.86%	70.73%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Li, S.; Zhu, Z. Rural Building Extraction Based on Joint U-Net and the Generalized Chinese Restaurant Franchise from Remote Sensing Images. Sustainability 2023, 15, 4685. https://doi.org/10.3390/su15054685

AMA Style

Wang Z, Li S, Zhu Z. Rural Building Extraction Based on Joint U-Net and the Generalized Chinese Restaurant Franchise from Remote Sensing Images. Sustainability. 2023; 15(5):4685. https://doi.org/10.3390/su15054685

Chicago/Turabian Style

Wang, Zixiong, Shaodan Li, and Zimeng Zhu. 2023. "Rural Building Extraction Based on Joint U-Net and the Generalized Chinese Restaurant Franchise from Remote Sensing Images" Sustainability 15, no. 5: 4685. https://doi.org/10.3390/su15054685

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rural Building Extraction Based on Joint U-Net and the Generalized Chinese Restaurant Franchise from Remote Sensing Images

Abstract

1. Introduction

2. Methods

2.1. SR

2.2. gCRF_U-Net

2.3. Post-processing

3. Experiments and Results

3.1. Experimental Setting

3.1.1. Experimental Data

3.1.2. Experimental Environment and Setting

3.2. Evaluation Method

3.3. Experimental Results

3.3.1. VHR Satellite Images

3.3.2. Public Building Datasets

4. Discussions

4.1. Semantic Segmentation Network

4.2. Hierarchical Spatial Relationship

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI