A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images

Yang, Haiping; Chen, Yuanyuan; Wu, Wei; Pu, Shiliang; Wu, Xiaoyang; Wan, Qiming; Dong, Wen

doi:10.3390/rs15040928

Open AccessArticle

A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images

by

Haiping Yang

^1,*,

Yuanyuan Chen

¹,

Wei Wu

¹

,

Shiliang Pu

²,

Xiaoyang Wu

²,

Qiming Wan

² and

Wen Dong

³

¹

College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310024, China

²

Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou 310051, China

³

State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(4), 928; https://doi.org/10.3390/rs15040928

Submission received: 31 December 2022 / Revised: 30 January 2023 / Accepted: 30 January 2023 / Published: 8 February 2023

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Building change detection (BuCD) can offer fundamental data for applications such as urban planning and identifying illegally-built new buildings. With the development of deep neural network-based approaches, BuCD using high-spatial-resolution remote sensing images (RSIs) has significantly advanced. These deep neural network-based methods, nevertheless, typically demand a considerable number of computational resources. Additionally, the accuracy of these algorithms can be improved. Hence, LightCDNet, a lightweight Siamese neural network for BuCD, is introduced in this paper. Specifically, LightCDNet comprises three components: a Siamese encoder, a multi-temporal feature fusion module (MultiTFFM), and a decoder. In the Siamese encoder, MobileNetV2 is chosen as the feature extractor to decrease computational costs. Afterward, the multi-temporal features from dual branches are independently concatenated based on the layer level. Subsequently, multiscale features computed from higher levels are up-sampled and fused with the lower-level ones. In the decoder, deconvolutional layers are adopted to gradually recover the changed buildings. The proposed network LightCDNet was assessed using two public datasets: namely, the LEVIR BuCD dataset (LEVIRCD) and the WHU BuCD dataset (WHUCD). The F1 scores on the LEVIRCD and WHUCD datasets of LightCDNet were 89.6% and 91.5%, respectively. The results of the comparative experiments demonstrate that LightCDNet outperforms several state-of-the-art methods in accuracy and efficiency.

Keywords:

building change detection; multi-temporal images; high-spatial-resolution images; Siamese network; feature fusion

Graphical Abstract

1. Introduction

Buildings are a particularly significant type of human-made feature on Earth and can be inventoried using remote sensing techniques. Building specifications change over time with the influence of human activities. Such alterations can also be obtained from remotely sensed data, especially high-spatial-resolution RSIs. Thus, physical and social processes such as urban sprawl, heat island, and disaster impact [1,2,3,4] can be better understood. It is not surprising, then, that BuCD using RSIs has become a hot topic in remote sensing [2,5,6,7,8,9,10].

According to the methods of feature extraction, methods of BuCD using RSIs can be divided into two categories: traditional BuCD and deep-learning-based BuCD. Traditional BuCD approaches utilize man-designed features, such as textures and spectral features [11,12,13]. Conversely, deep-learning-based BuCD algorithms generally learn features from large amounts of high-spatial-resolution RSIs [14,15,16,17,18,19].

Traditional BuCD methods can be split into two categories according to the processing units, namely pixel-based and object-oriented BuCD. Pixel-based BuCD methods usually begin with a difference computation of multi-date RSIs, followed by change detectors. Simple differences, such as the absolute value of the subtraction of the spectra between multi-temporal images, likely lead to inconsistent results. To characterize the spectral difference of multi-date RSIs, various algorithms were devised, such as the computation of the ratio, the Euclidian distance of multi-date RSIs, and the synthesis of several results after difference computation [20]. Nevertheless, the spatial information from high-spatial-resolution RSIs is difficult to capture using only the spectral difference features because of the increase in the spatial resolution of RSIs. Hence, BuCD algorithms employ well-designed features, spectral–spatial integration, and morphological operators, for instance, to collect more geometrical and context information [12,21]. Afterward, the images containing varied features are further processed by feature-level or decision-level fusion to generate the final change map. Despite these efforts, the lack of contextual information affects the performance of traditional pixel-based BuCD methods using high-spatial-resolution RSIs.

Contrary to the pixel-based BuCD, the processing units of object-oriented BuCD methods are images objects. They are usually generated by the segmentation algorithms from high-spatial-resolution RSIs. Accordingly, the generation of image objects is fundamental in object-oriented BuCD algorithms. A common solution for object-oriented BuCD is to detect changed information after image segmentation. The input data used in segmentation algorithms typically comprise (1) bitemporal composite images [22,23,24], (2) auxiliary data, e.g., digital surface model [25], and (3) separate multi-date RSIs [26]. Approaches using the first two kinds of input data provide one segmentation result for the subsequent change-detection phase. Consequently, these approaches have difficulty identifying geometric changes using the multi-date RSIs. Approaches that utilize the separate multi-date RSIs usually generate multi-temporal image objects according to the dates of the RSIs. Zhang et al. [26] reported that the change detection method using multi-temporal image objects from separate segmentations performed better than that using one segmentation result. Among BuCD algorithms employing multi-temporal images objects, co-segmentation with decision-level fusion performs better [26,27]. Overall, object-based BuCD can incorporate more spatial features into the algorithm compared with pixel-based BuCD. Nevertheless, the dependence on manually designed characteristics remains in object-oriented BuCD.

With the burgeoning of deep-learning technology, deep neural networks (DNNs) have contributed much to various remote sensing applications, such as disaster detection/assessment, land cover/land use classification, and BuCD [7,28,29,30,31,32,33,34,35]. Specifically, the DNNs frequently used to extract features from RSIs are convolutional neural networks (CNNs). CNNs can naturally compute multilevel features on account of the network structure. Among these multilevel features, the lower levels primarily describe the characteristics of edges and locations, while the higher-level ones amplify the semantics [36]. To efficiently carry out pixel-wise training and prediction, CNNs were improved. In detail, the convolutional layers were used to replace the fully connected ones in CNNs, called fully convolutional neural networks (FCNs) [37]. Most of the existing BuCD research takes advantage of the structure of FCNs, thus accomplishing pixel-wise prediction directly. From the perspective of network structures, early studies of BuCD took inspiration from tasks such as semantic segmentation. Unlike semantic segmentation, which usually takes a single-temporal image as input, the BuCD input data are multi-temporal RSIs. As a result, for BuCD, it is necessary to modify the existing networks to match the multi-temporal RSIs input. For instance, Daudt et al. [38] presented the fully convolutional early fusion network (FCEFN). They adjusted the structure of U-Net [39] to FCEFN by using the input from the multi-date RSIs [38]. Additionally, Peng et al. [29] combined multi-date RSIs as one input in their improved U-Net++ architecture. They adopted the multiple side-output fusion strategy to combine multilevel features [29]. In summary, the input for the above networks is the concatenation of multi-date RSIs. Such architectures often lack the direct expression of differences from multi-date RSIs.

In contrast, networks with the Siamese architecture can process two input images with two separate branches [40,41]. This architecture naturally fits the multi-temporal input for BuCD. For instance, Daudt et al. [38] proposed the fully convolutional Siamese concatenation network (FCSiCN) and the fully convolutional Siamese difference network (FCSiDN). Both FCSiCN and FCSiDN are modified from U-Net. In the encoder, they both contain two streams with shared weights but differ in how to design the skip connection. To be specific, FCSiCN directly connects the encoding features from two branches, whereas FCSiDN joins the absolute value of their difference [38]. FCSiDN generally outperforms FCSiCN, according to [38]. In this type of network architecture, effectively extracting and combining features from multi-date RSIs is crucial. To achieve this aim, attention modules, feature pyramid modules, and transformers have all been included in recent BuCD studies [7,17,42]. For example, Zhang et al. [14] developed the image fusion network (IFN). In IFN, deep features were separately extracted by two branches from multi-temporal images. They were further fused with difference features by attention modules. Liu et al. [17] incorporated a dual attention module into the Siamese network for BuCD, thereby improving feature discrimination. Shen et al. [43] presented a semantic feature-constrained change detection network (SFCCDN). SFCCDN combined high-level and low-level features (LoLeFs) using a global channel attention module. Additionally, multiscale and multi-temporal features were fused using the attention mechanisms [43]. Along with attention modules, BuCD networks frequently embed the feature pyramid module. For example, Dong et al. [5] devised a multiscale context aggregation network (MCAN). To combine multiscale features, they integrated a feature pyramid module and a channel-spatial attention module in MCAN. Liu et al. [7] designed local and global feature extraction modules (LGFEM) in the BuCD network on the basis of U-Net. They combined the pyramid and attention modules in LGFEM to learn local and global representation [7]. In addition, transformers have been incorporated into BuCD networks to manage long-range feature dependencies. For instance, Feng et al. [42] introduced the intra-scale cross-interaction and inter-scale feature fusion network (ICIFNet). ICIFNet extracted local and global features simultaneously using CNNs and a transformer. Li et al. [44] presented a hybrid transformer model, TransUNetCD, which embedded a CNN-Transformer encoder to compute global context features.

Although the above approaches have made significant progress, issues remain in BuCD concerning the high-spatial-resolution RSIs. First, because of the remote sensor systems considerations, such as varied viewing angles, and environmental variables, such as atmospheric conditions, pseudo predictions of building changes are demonstrated in the existing methods. Second, because of the detailed and complex backgrounds shown in the high-spatial-resolution RSIs, the boundaries of changed buildings may be mispredicted. Third, most prior DNNs for BuCD involve very deep convolutional layers and various feature extraction modules, thus demanding heavy computing resources.

Considering the above issues, in this paper, we propose the LightCDNet. The multilevel and multiscale features can be efficiently exploited and fused in LightCDNet. The Siamese encoder based on MobileNetV2 [45], a memory-efficient network, was created to extract multilevel features. Then, a MultiTFFM combined the two-branch multilevel features to gather the localization and semantics information. Finally, the decoder received the fused features to gradually recover the varied buildings. The following are the main contributions of this study:

A lightweight Siamese network LightCDNet using RSIs is proposed for BuCD. The LightCDNet consists of a memory-efficient encoder to compute multilevel deep features. Correspondingly, it utilizes a decoder with deconvolutional layers to recover the changed buildings.
The MultiTFFM is designed for exploiting multilevel and multiscale building change features. First, it fuses low-level and high-level features (HiLeFs) separately. Subsequently, the fused HiLeFs are processed by the atrous spatial pyramid pooling (ASPP) module to extract multiscale features. Lastly, the fused low-level and multiscale HiLeFs are linked to produce the final feature maps containing the localizations and semantics of varied buildings.

2. Materials and Methods

2.1. Datasets for BuCD

The proposed network, LightCDNet, was tested using two public BuCD datasets, LEVIRCD [46] and WHUCD [47].

2.1.1. LEVIRCD

The public BuCD dataset, LEVIRCD, comprises 0.5 m images from Google Earth. It covers several regions in Texas and the United States of America, such as Austin, Lakeway, and others. Additionally, the satellite images in the LEVIRCD dataset were taken between 2002 and 2018. Hence, seasonal variations and illumination differences can be observed in the LEVIRCD dataset, which may result in the diverse appearance of buildings. Additionally, various types of building changes that have occurred over time can be observed in the LEVIRCD dataset (Figure 1). More specifically, both newly constructed and demolished buildings can be seen in the multi-date images (Figure 1a–e). The LEVIRCD dataset also includes multi-temporal RSIs that display unchanged regions with different colors (Figure 1f). Furthermore, there are 637 multi-date image pairs with 1024 × 1024 pixels in the original LEVIRCD [46]. The training, validation, and testing sets adhere to the settings in [46]. Moreover, the original 1024 × 1024 pixels multi-date RSIs were all cropped to 256 × 256 pixels (Table 1) because of the configuration of the graphics processing unit (GPU) employed.

2.1.2. WHUCD

The WHUCD dataset covers parts of the region in New Zealand that was subjected to a 6.3-magnitude earthquake in 2011. The dataset includes 0.2 m bitemporal aerial images acquired in 2012 and 2016. As a result, building variations, particularly renovations, may be discerned between images collected in 2012 and those captured in 2016 (Figure 2). Specifically, there are 12,796 buildings in the 2012 images and 16,077 buildings in the 2016 images [47]. The majority of buildings in the WHUCD dataset are low-rise structures (Figure 2). In addition, the original bitemporal images provided by [47] have 32,507 pixels in the column and 15,354 pixels in the row. Considering the GPU used, the 2012 and 2016 images were also cropped to 256 × 256 pixels as those in the LEVIRCD dataset. Additionally, a 7:1:2 ratio of training, validation, and testing sets was created using a random division of the cropped bitemporal image pairs (Table 1).

2.2. Methods

The proposed LightCDNet consists of a lightweight Siamese encoder, a MultiTFFM, and a decoder that predicts the changed buildings (Figure 3). Specifically, LightCDNet utilizes a pair of RSIs collected on different dates, T₁ and T₂, as the input. Using two parallel branches, multilevel features are computed from bitemporal RSIs in the encoder (Figure 3a). Afterward, the MultiTFFM (Figure 3b) fuses multilevel features from two branches, depending on their level. LoLeFs F_L₁ and F_L₂ and HiLeFs F_H₁ and F_H₂ are taken as separate groups, and each is fed into the MultiTFFM for further processing. Finally, the decoder (Figure 3c) receives the combined feature maps from the MultiTFFM to reconstruct the changed buildings.

2.2.1. Siamese Encoder

On the one hand, the structure of the Siamese network corresponds precisely to the input bitemporal RSIs of BuCD. As a result, we utilized two weight-shared branches in the encoder module. On the other hand, DNNs have demonstrated superiority over feature learning, most likely at the cost of quantities of computational resources [48,49,50]. To decrease the number of computing resources needed for BuCD and still maintain accuracy, we chose MobileNetV2 [45] as the backbone for the Siamese encoder. MobileNetV2 is designed for environments with limited resources, such as mobile phones. The fundamental component of MobileNetV2 is the inverted residual bottleneck (InvResB) [45]. Thanks to this basic structure, MobileNetV2 can outperform the most cutting-edge technology [45]. Additionally, the InvResB module can significantly lower the memory occupancy required during inference, making it well-suited for mobile applications [45]. Please see [45] for further details on MobileNetV2. It is worth noting that a pretrained MobileNetV2 (Appendix A) was utilized in the training process.

Suppose the input of the encoder is bitemporal RSIs I(T₁) and I(T₂), I(T₁), and I(T₂)

\in R^{h * w * b}

, where h, w, and b represent the height, width, and number of bands of the input RSIs, respectively. In other words, bitemporal images I(T₁) and I(T₂) are fed into the dual-branch encoder separately (Figure 3a). Subsequently, I(T₁) and I(T₂) are, respectively, processed by a convolutional layer and 17 InvResB modules (Figure 3a). For BuCD feature extraction, dense feature maps are required, analogous to the work of semantic segmentation. For clarity, the output stride (OtS) is defined first. The OtS describes the ratio of the input images’ spatial resolution to the output images’ resolution. To obtain dense feature maps, OtS is commonly set to 16 or 8 for semantic segmentation [45]. Setting OtS to 16 is more efficient when using MobileNetV2 as feature extractors for semantic segmentation [45]. Therefore, the OtS is set to 16 in the Siamese encoder of LightCDNet. According to the above OtS setting, the first convolutional layer has a stride of 2. Additionally, the stride of the second, fourth, and seventh InvResB modules is set to 2, and the rest are set to 1 (Figure 3a).

To account for both localization and semantic information, we demonstrate how to select HiLeFs (F_H₁ and F_H₂) and LoLeFs (F_L₁ and F_L₂), respectively. According to the application of MobileNetV2 for semantic segmentation [45], the penultimate feature maps are chosen as the HiLeFs to consider both efficiency and accuracy. Hence, we utilized the penultimate layers of MobileNetV2 as the HiLeFs (F_H₁ and F_H₂) in the dual branches (Figure 3a), i.e.,

F_{H 1}, F_{H 2} \in R^{\frac{h}{16} * \frac{w}{16} * 320}

. LoLeFs, which are four times larger than HiLeFs, can be used to capture precise localization information, according to [51,52]. As a result, we choose the output of the third InvResB module of each branch as F_L₁ and F_L₂ (Figure 3a), i.e.,

F_{L 1}, F_{L 2} \in R^{\frac{h}{4} * \frac{w}{4} * 24}

.

2.2.2. Multi-Temporal Feature Fusion Module

In the Siamese encoder (Figure 3a), features related to localization (F_L₁ and F_L₂) and semantics (F_H₁ and F_H₂) are exploited from the original bitemporal RSIs I(T₁) and I(T₂). The MultiTFFM fuses two-stream features to improve the representation of building change information (Figure 3b). Initially, the multi-temporal LoLeFs, F_L₁ and F_L₂, and the HiLeFs, F_H₁ and F_H₂, are concatenated independently on the basis of corresponding levels (Equations (1) and (2)):

F_LC = Concat(F_L₁, F_L₂),

(1)

F_HC = Concat(F_H₁, F_H₂),

(2)

where

F_{L C} \in R^{\frac{h}{4} * \frac{w}{4} * 48}

is the concatenated LoLeFs,

F_{H C} \in R^{\frac{h}{16} * \frac{w}{16} * 640}

is the concatenated HiLeFs, and Concat is the concatenation operation. Subsequently, two 3 × 3 convolutions are applied individually on F_HC and F_LC to decrease the number of channels by half; thus, reducing the amount of redundant information. Accordingly, assuming that the intermediate outputs after the above two 3 × 3 convolutions are F_HC₁ and F_LC₁, the channels of F_HC₁ and F_LC₁ are 320 and 24, respectively. Afterward, F_HC₁ is processed by the ASPP module [53,54] to compute multiscale contextual features, yielding F_HASPP

\in R^{\frac{h}{16} * \frac{w}{16} * 256}

. Inspired by [52], F_HASPP is fourfold up-sampled and joins the features with abundant localization information. To this aim, a 1 × 1 convolution is applied on F_LC₁ to balance the channels of F_HASPP and F_HC₁, obtaining F_LC₂

\in R^{\frac{h}{4} * \frac{w}{4} * 48}

. Finally, F_LC₂ and fourfold up-sampled F_HASPP are combined as F_MC

\in R^{\frac{h}{4} * \frac{w}{4} * 304}

.

2.2.3. Decoder

The decoder’s purpose is to reconstruct the appearance of altered buildings from course to fine (Figure 3c). Initially, two 3 × 3 convolutions are performed on the fused features F_MC from the MultiTFFM, yielding F_D₁

\in R^{\frac{h}{4} * \frac{w}{4} * 256}

. Subsequently, repeated deconvolutional layers [55] are employed to recover the feature maps from the spatial dimension of F_D₁,

\frac{h}{4} \times \frac{w}{4}

, to that identical to the bitemporal RSIs I(T₁) and I(T₂),

h \times w

. A 1 × 1 convolution is then applied to compute the final score map,

F \in R^{h * w * 2}

. Assuming that 1 and 0 represent the varied buildings and the background, respectively, the BuCD result

I_{c d}

can be computed as follows:

I_{c d} (i) = {}_{c \in \{0, 1\}}^{a r g m a x}F (i, c), 1 \leq i \leq h * w,

(3)

where

I_{c d} (i)

denotes the ith pixel of

I_{c d}

.

2.2.4. Loss Function

In LightCDNet, BuCD is regarded as a binary classification problem. Hence, the cross entropy (CE), L_CE, is utilized:

L_{C E} = \frac{1}{H * W} \sum_{h = 1, w = 1}^{H, W} L_{c e} (P (h, w), Y (h, w)),

(4)

L_{c e} (P (h, w), Y (h, w)) = - \sum_{i = 1}^{c} Y_{i} (h, w) l o g (P_{i} (h, w)),

(5)

where H and W are the image height and width, respectively, c is the class number, which is 2 here, and Y is the ground truth. Additionally, P represents the output of the Sigmoid function applied on the score map F.

On the other hand, the altered buildings often only make up a small part of the RSIs. This imbalance problem may increase the difficulty of identifying changed buildings using the BuCD network. To tackle this problem, we selected the Dice loss [56], L_Dice (Equation (6)):

L_{D i c e} = 1 - \frac{2 \times Y \times Y^{'}}{Y + Y^{'}},

(6)

where

Y

is the ground truth, and

Y^{'}

denotes the predicted BuCD result.

Therefore, the loss function used in LightCDNet, L_BUCD, is defined as follows,

L_BUCD = L_CE + L_Dice.

(7)

2.2.5. Implementation Details

The codes to carry out the BuCD experiments were written in Python 3.6 and run on the platform PyTorch v1.9. (Our code is available for potentially interested users. Please see Appendix A for more information.) Additionally, we chose Adam [57] for stochastic optimization because of its low memory requirement. The learning rate (LR) was initially set to

10^{- 4}

. When no improvements in the loss L_BUCD were observed after 5 epochs, LR was adjusted by a factor of 0.9 during the training process. Furthermore, data augmentation strategies were utilized since they can preclude DNNs from overfitting. The data augmentation techniques include random flip (vertical or horizontal flip), random rotation (rotation by 90, 180, and 270 degrees), random resize (resize between 0.8 and 1.2), random crop, random Gaussian blur, and random color jitter (brightness = 0.3, contrast = 0.3, saturation = 0.3, and hue = 0.3).

Moreover, our LightCDNet was trained on the computer with an Intel Xeon X5570 processor and one NVIDIA GeForce GTX 1080 Ti GPU. As a result, the batch size was assigned to 8 in consideration of the GPU memory limitations. In addition, the network was trained for 100 epochs. Information about the computation time is shown in Section 4.3.

2.3. Accuracy Assessment

To evaluate the accuracy of LightCDNet, we utilized five indicators: Precision, Recall, F1 score (F1), overall accuracy (OA), and Intersection over Union (IoU). Assuming that TP, TN, FP, and FN stand for true predictions on changed buildings, true predictions on the background, false predictions on changed buildings, and false predictions on the background, respectively, the five metrics can be computed as follows:

P r e c i s i o n = \frac{T P}{T P + F P},

(8)

R e c a l l = \frac{T P}{T P + F N},

(9)

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l},

(10)

O A = \frac{T P + T N}{T P + F N + F P + T N},

(11)

I o U = \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l - P r e c i s i o n * R e c a l l} .

(12)

We can notice that Precision quantifies the percentage of accurate classification of altered buildings (Equation (8)). Recall measures the missed proportion of altered buildings (Equation (9)). Compared with Precision and Recall, F1 (Equation (10)), OA (Equation (11)), and IoU (Equation (12)) are more comprehensive.

2.4. Comparative Methods

To demonstrate the effectiveness of LightCDNet, four state-of-the-art BuCD approaches were chosen for comparison. These methods are briefly described below.

Cross-layer convolutional neural network (CLNet) [18]: CLNet was adjusted from U-Net. The key component of CLNet is the cross-layer block (CLB), which can combine multilevel and multiscale features. To be specific, CLNet contains two CLBs in the encoder. It is worth noting that the input of CLB is the concatenated multi-date RSIs, i.e., images with six bands.
Deeply supervised attention metric-based network (DSAMNet) [16]: DSAMNet utilized two weight-sharing branches for feature extraction. Hence, DSAMNet can separately receive multi-date images. Moreover, DSAMNet included the change decision module (CDM) and deeply supervised module (DSM). CDM adopted attention modules to extract discriminative characteristics and produced output change maps. Additionally, the DSM can enhance the capacity of feature learning for DSAMNet.
Deeply supervised image fusion network (IFNet) [14]: IFNet adopted an encoder-decoder structure. In the encoder, IFNet utilized dual branches based on VGG16. The decoder consists of attention modules and deep supervision. Specifically, the attention modules were embedded to fuse raw convolutional and difference features.
ICIFNet [42]: ICIFNet, unlike the previous three, contained two asymmetric branches for feature extraction. Specifically, ResNet-18 [49] and PVT v2-B1 [58] are employed separately by two branches in ICIFNet. Four groups of features brimming with local and global information are produced from the dual branches. The attention mechanism was then applied to fuse multiscale features. Finally, the output was generated from the combination of three score maps.

To provide a fair comparison, data augmentation and hyperparameter settings utilized in the four comparative networks are consistent with the original sources and their published codes (Appendix A).

3. Results

The proposed LightCDNet and the comparative approaches, CLNet [18], DSAMNet [16], IFNet [14], and ICIFNet [42], were evaluated quantitatively and qualitatively using the LEVIRCD and WHUCD datasets.

3.1. Results on LEVIRCD Dataset

Six groups of bitemporal images were picked from the testing set of the LEVIRCD dataset to graphically demonstrate the results (Figure 4). The multi-date images (Rows Image 1 and Image 2 in Figure 4) displayed the regions transformed from shrubs into buildings (Figure 4a,b) or from bare lands into buildings (Figure 4c–f). In general, ICIFNet [42] and our LightCDNet performed better than CLNet [18], DSAMNet [16], and IFNet [14] visually. We observed that CLNet [18] failed to detect the newly built structures (Figure 4e,f). In addition, the newly constructed circular roads were falsely flagged by CLNet [18] (Figure 4a,b). Furthermore, we found that the unchanged land was falsely flagged by DSAMNet [16] and IFNet [14] because of the color variation (Figure 4c–e). Notably, DSAMNet [16] mistakenly identified most of the shadows as altered buildings. Moreover, by comparing the predicted maps of our LightCDNet and ICIFNet [42], we observed that our method illustrated fewer false identifications and more complete buildings. It is also worth mentioning that five networks, including ours, could not identify the changed rooftop completely in the yellow boxes of Figure 4e. From Rows Image 1 and Image 2, Column (e) of Figure 4, we can see that trees covered the upper corner of the rooftop. Thus, the omission shown in the upper corner of the rooftops is understandable and consistent with the input images. However, the missed parts on the right indicate that the five approaches are more or less affected by differences in rooftop color.

Let us briefly recall that there are 10,192 multi-date image pairs with 256 × 256 pixels in the LEVIRCD dataset, including 2048 pairs in the test set (Table 1). The quantitative analysis of the LEVIRCD dataset (Table 2) demonstrates that our LightCDNet outperforms CLNet [18], DSAMNet [16], IFNet [14], and ICIFNet [42] from the overall metrics, F1, IoU, and OA. ICIFNet [42] is second-ranked among the five methods, which is 2.8%, 4.4%, and 0.3% lower than our network in F1, IoU, and OA. This is consistent with the observation in Figure 4. Additionally, we found that CLNet [18] had the lowest F1, IoU, and OA among the five approaches. These findings imply that networks with two branches for receiving multi-date RSIs separately perform better than one branch processing the combination of multi-temporal RSIs. Moreover, DSAMNet [16] has the highest Recall (90.6%) while providing the lowest Precision (80.0%). From the perspective of overall evaluation indicators, our LightCDNet achieves 4.6%, 7.3%, and 0.6% improvements in F1, IoU, and OA, respectively, compared with the DSAMNet [16] metrics.

3.2. Results on WHUCD Dataset

Six pairs of multi-temporal RSIs from the WHUCD dataset were chosen to display the final output maps of five networks (Figure 5). As introduced in Section 2.1.2, we can see the reconstruction of various buildings from 2012 to 2016 (Rows Image1 and Image2 of Figure 5). For example, Figure 5a,d,f depict the ground’s transformation from bare soil to buildings. Figure 5b illustrates that the land was converted from a mixture of shrubs, grass, and human-made facilities to buildings. From the predicted maps in Figure 5, we found that our LightCDNet demonstrated more complete and regular rooftops than CLNet [18], DSAMNet [16], IFNet [14], and ICIFNet [42]. Additionally, conspicuous green regions were observed in the results of CLNet [18], IFNet [14], and ICIFNet [42] (Figure 5b,e,f), indicating that these models omit the new colorful rooftops. Moreover, falsely flagged altered buildings (red regions) are exhibited in the results of DSAMNet [16], mainly caused by shadows in multi-temporal RSIs. This observation is consistent with the visual results on the LEVIRCD dataset. Furthermore, from Column (e) of Figure 5, we can notice that omissions of altered buildings (green regions) are displayed on the top and left of images from all five approaches. This is probably because of the absence of contextual information in the original multi-temporal RSIs caused by image cropping performed in dataset preparation.

From the quantitative evaluation of the WHUCD dataset (Table 3), we find that our method shows higher F1 (91.5%), IoU (84.3%), and OA (99.3%) compared with CLNet [18], DSAMNet [16], IFNet [14], and ICIFNet [42]. Additionally, CLNet [18] has the lowest F1 (72.7%), IoU (57.1%), and OA (98.0%) among the five approaches. These numerical results are consistent with the LEVIRCD dataset (Table 2). In addition, IFNet [14] ranks second among the five networks. It has the highest Precision (94.9%), but its Recall is relatively low (77.0%). Our LightCDNet obtained 6.5%, 10.3%, and 0.5% improvements in F1, IoU, and OA compared with IFNet [14]. The third place among the five approaches belongs to DSAMNet [16]. Although DSAMNet [16] gains the highest Recall (91.9%), it has the lowest Precision (79.0%). Additionally, its F1 is 6.5% lower than that of our LightCDNet. The above quantitative analysis is consistent with the visualization in Figure 5.

4. Discussion

4.1. Ablation Study for Siamese Encoder

To demonstrate the effectiveness of the lightweight feature extractor used in our LightCDNet, architectures using different backbones in the Siamese encoder were tested on the WHUCD dataset. In detail, we employed two more complex models, ResNet-50 [49] and Xception [59], as the backbones. The predicted maps (Figure 6) illustrate that our LightCDNet with MobileNetV2 as the encoder outperforms networks with ResNet-50 and Xception as the encoder, respectively. The quantitative results (Table 4) also confirm the above observation. In terms of F1, IoU, and OA, our network with MobileNetV2 is improved by 2.1%, 3.5%, and 0.2%, respectively, compared with the network using ResNet-50 as the backbone. Additionally, compared with the network with Xception as the backbone, LightCDNet exceeded it by 7.8%, 12.3%, and 0.6% based on F1, IoU, and OA. The above visual and numerical results imply that the designed encoder of our LightCDNet can effectively compute multilevel features for the following MultiTFFM and decoder.

4.2. Ablation Study for Decoder

The precise masks of targets of interest can be progressively recovered using a series of deconvolutional layers [55]. To verify the deconvolutional operations embedded in the decoder, the architecture employing bilinear up-sampling operations instead of deconvolution in the decoder was implemented on the WHUCD dataset. From the predicted maps (Figure 7), we can observe that our LightCDNet with the decoder using deconvolutional operations illustrates more precise varied building boundaries compared with the network with the decoder using the up-sampling strategy (red boxes in Figure 7). In addition, numerical evaluation (Table 5) further demonstrates that the decoder with deconvolutional operations performs better than that with up-sampling operations, achieving 1.4%, 2.3%, and 0.1% improvements based on F1, IoU, and OA. These findings indicate the applicability of the decoder in our LightCDNet.

4.3. Efficiency Test

We will compare the computational efficiency of CLNet [18], DSAMNet [16], IFNet [14], ICIFNet [42], the networks for ablation experiments, and our LightCDNet in this section. To be specific, four indicators, floating point operations (FLOPs), the number of parameters of DNNs (NOPs), average training time (ATrT), and average testing time (ATeT), are computed for comparison. The ATrT and ATeT for DNNs depend heavily on the settings of hyperparameters and the running environment. Therefore, the four indicators are all computed in the computer mentioned in Section 2.2.5. The hyperparameter settings followed the original paper mentioned in Section 2.4. It should be noted that, unlike the setting in Section 2.2.5, the batch size is set to 4 in all the efficiency tests because the network using an encoder with Xception [59] in the ablation study cannot run on the above computer when batch size is 8 due to GPU memory limitation. Additionally, four indicators were calculated on the WHUCD dataset. All networks were trained using five epochs, and the average time per epoch was calculated as ATrT. Furthermore, we repeatedly predicted the testing set five times and took the average as ATeT.

Table 6 shows that the FLOPs, NOPs, ATrT, and ATeT of CLNet [18] were less than all the other methods. Nevertheless, CLNet [18] provided the lowest accuracy among these methods on the WHUCD dataset (Table 3, Table 4 and Table 5). Compared with our LightCDNet, F1 of CLNet [18] is 18.8% less (Table 3). In addition, the FLOPs and NOPs of the proposed LightCDNet are lower than those of DSAMNet [16], IFNet [14], and ICIFNet [42]. Moreover, we can observe that the lightweight encoder of LightCDNet dramatically reduces the FLOPs, NOPs, ATrT, and ATeT, proving the encoder’s efficiency. Lastly, the up-sampling strategy in the decoder reduces the FLOPs, NOPs, ATrT, and ATeT. However, the Precision, Recall, F1, IoU, and OA (%) of the network using the up-sampling strategy in the decoder decreased by 1.0%, 1.7%, 1.4%, 2.3%, and 0.1%, respectively, compared with those of LightCDNet (Table 5). These findings indicate that our LightCDNet considers both efficiency and accuracy for BuCD.

4.4. Perspectives

To improve the accuracy and reduce the computation cost of BuCD, we designed a lightweight Siamese neural network, LightCDNet. Results (Table 2, Table 3, Table 4, Table 5 and Table 6) show that LightCDNet reduces the model complexity and the computation cost while improving accuracy. This indicates that LightCDNet befits resource-limited environments.

In future research, we will extend our method in two aspects. First, we will directly integrate boundary information into our network to generate outlines of the changed buildings in vector files because the vectors of altered buildings can be further applied to geographic information systems. Secondly, we will develop our network from binary BuCD to multi-class change detection using high-spatial-resolution RSIs.

5. Conclusions

In this paper, a lightweight Siamese neural network, LightCDNet, is proposed for BuCD using high-spatial-resolution RSIs. In detail, a Siamese encoder, the MultiTFFM, and a decoder comprise the LightCDNet. Firstly, the multi-temporal images are processed by the Siamese encoder, with the lightweight MobileNetV2 separately extracting multilevel features. Subsequently, the multi-temporal features are fused progressively on the basis of corresponding levels in the MultiTFFM, thus producing multilevel and multiscale fusion features. Finally, the fused features are fed into the decoder to gradually recover the changed buildings using the deconvolutional layers. LightCDNet and comparative approaches were tested on the LEVIRCD and WHUCD datasets. The proposed LightCDNet demonstrated its superiority over comparative networks in terms of both accuracy and efficiency.

Author Contributions

Conceptualization, H.Y., Y.C. and W.W.; methodology, H.Y. and Y.C.; software, Y.C.; investigation, Y.C. and W.D.; writing—original draft preparation, H.Y.; writing—review and editing, H.Y., W.W. and S.P.; visualization, H.Y., Y.C., X.W. and Q.W.; supervision, W.W.; project administration, W.W., X.W. and Q.W.; funding acquisition, H.Y., S.P. and W.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 42001276; the National Key R&D Program of China, grant number 2018YFD1100301; the Zhejiang Provincial Natural Science Foundation of China, grant number LQ19D010006; the Strategic Priority Research Program of Chinese Academy of Sciences, grant number XDA 20030302.

Data Availability Statement

The LEVIRCD dataset can be found at https://justchenhao.github.io/LEVIR (accessed on 1 December 2022). Additionally, we downloaded the WHUCD dataset from http://gpcv.whu.edu.cn/data/building_dataset.html (accessed on 1 December 2022).

Acknowledgments

The authors thank the providers of the public BuCD datasets.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

BuCD	Building change detection
RSIs	Remote sensing images
MultiTFFM	Multi-temporal feature fusion module
LEVIRCD	LEVIR Building Change Detection dataset
WHUCD	WHU Building Change Detection dataset
DNNs	Deep neural networks
CNNs	Convolutional neural networks
FCNs	Fully convolutional neural networks
FCEFN	Fully convolutional early fusion network
FCSiCN	Fully convolutional Siamese concatenation network
FCSiDN	Fully convolutional Siamese difference network
IFN	Image fusion network
SFCCDN	Semantic feature-constrained change detection network
LoLeFs	Low-level features
MCAN	Multiscale context aggregation network
LGFEM	Local and global feature extraction modules
ICIFNet	Intra-scale cross-interaction and inter-scale feature fusion network
HiLeFs	High-level features
ASPP	Atrous spatial pyramid pooling
GPU	Graphics processing unit
ImgPs	Image pairs
InvResB	Inverted residual bottleneck
OtS	Output stride
CE	Cross entropy
LR	Learning rate
F1	F1 score
OA	Overall accuracy
IoU	Intersection over union
CLNet	Cross-layer convolutional neural network
CLB	Cross-layer block
DSAMNet	Deeply supervised attention metric-based network
CDM	Change decision module
DSM	Deeply supervised module
IFNet	Deeply supervised image fusion network
FLOPs	Floating point operations
NOPs	Number of parameters
ATrT	Average training time
ATeT	Average testing time

Appendix A

The codes used in this article can be available from the following links.

Our method: https://github.com/yanghplab/LightCDNet (accessed on 29 January 2023)
CLNet: https://skyearth.org/publication/project/CLNet (accessed on 1 December 2022)
DSAMNet: https://github.com/liumency/DSAMNet (accessed on 1 December 2022)
IFNet: https://github.com/GeoZcx/A-deeply-supervised-image-fusion-network-for-change-detection-in-remote-sensing-images (accessed on 1 December 2022)
ICIFNet: https://github.com/ZhengJianwei2/ICIF-Net (accessed on 1 December 2022)

Additionally, the pretrained MobilenetV2 can be found at http://jeff95.me/models/mobilenet_v2-6a65762b.pth (accessed on 1 December 2022).

References

de Alwis Pitts, D.A.; So, E. Enhanced change detection index for disaster response, recovery assessment and monitoring of buildings and critical facilities—A case study for Muzzaffarabad, Pakistan. Int. J. Appl. Earth Obs. Geoinf. 2017, 63, 167–177. [Google Scholar] [CrossRef]
Wang, N.; Li, W.; Tao, R.; Du, Q. Graph-based block-level urban change detection using Sentinel-2 time series. Remote Sens. Environ. 2022, 274, 112993. [Google Scholar] [CrossRef]
Jimenez-Sierra, D.A.; Quintero-Olaya, D.A.; Alvear-Muñoz, J.C.; Benítez-Restrepo, H.D.; Florez-Ospina, J.F.; Chanussot, J. Graph Learning Based on Signal Smoothness Representation for Homogeneous and Heterogeneous Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Reba, M.; Seto, K.C. A systematic review and assessment of algorithms to detect, characterize, and monitor urban land change. Remote Sens. Environ. 2020, 242, 111739. [Google Scholar] [CrossRef]
Dong, J.; Zhao, W.; Wang, S. Multiscale Context Aggregation Network for Building Change Detection Using High Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Lei, J.; Gu, Y.; Xie, W.; Li, Y.; Du, Q. Boundary Extraction Constrained Siamese Network for Remote Sensing Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Liu, T.; Gong, M.; Lu, D.; Zhang, Q.; Zheng, H.; Jiang, F.; Zhang, M. Building Change Detection for VHR Remote Sensing Images via Local–Global Pyramid Network and Cross-Task Transfer Learning Strategy. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
Xia, H.; Tian, Y.; Zhang, L.; Li, S. A Deep Siamese PostClassification Fusion Network for Semantic Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Chen, H.; Qi, Z.; Shi, Z. Remote Sensing Image Change Detection with Transformers. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
Shi, W.; Zhang, M.; Zhang, R.; Chen, S.; Zhan, Z. Change Detection Based on Artificial Intelligence: State-of-the-Art and Challenges. Remote Sens. 2020, 12, 1688. [Google Scholar] [CrossRef]
Xiao, P.; Zhang, X.; Wang, D.; Yuan, M.; Feng, X.; Kelly, M. Change detection of built-up land: A framework of combining pixel-based detection and object-based recognition. ISPRS J. Photogramm. Remote Sens. 2016, 119, 402–414. [Google Scholar] [CrossRef]
Huang, X.; Zhang, L.; Zhu, T. Building Change Detection from Multitemporal High-Resolution Remotely Sensed Images Based on a Morphological Building Index. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 105–115. [Google Scholar] [CrossRef]
Sofina, N.; Ehlers, M. Building Change Detection Using High Resolution Remotely Sensed Data and GIS. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 3430–3438. [Google Scholar] [CrossRef]
Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 166, 183–200. [Google Scholar] [CrossRef]
Hu, M.; Lu, M.; Ji, S. Cascaded Deep Neural Networks for Predicting Biases between Building Polygons in Vector Maps and New Remote Sensing Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium, Brussels, Belgium, 11–16 July 2021; pp. 4051–4054. [Google Scholar]
Liu, M.; Shi, Q. DSAMNet: A Deeply Supervised Attention Metric Based Network for Change Detection of High-Resolution Images. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium, Brussels, Belgium, 11–16 July 2021; pp. 6159–6162. [Google Scholar]
Liu, Y.; Pang, C.; Zhan, Z.; Zhang, X.; Yang, X. Building Change Detection for Remote Sensing Images Using a Dual-Task Constrained Deep Siamese Convolutional Network Model. IEEE Geosci. Remote Sens. Lett. 2021, 18, 811–815. [Google Scholar] [CrossRef]
Zheng, Z.; Wan, Y.; Zhang, Y.; Xiang, S.; Peng, D.; Zhang, B. CLNet: Cross-layer convolutional neural network for change detection in optical remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2021, 175, 247–267. [Google Scholar] [CrossRef]
Chen, P.; Zhang, B.; Hong, D.; Chen, Z.; Yang, X.; Li, B. FCCDN: Feature constraint network for VHR image change detection. ISPRS J. Photogramm. Remote Sens. 2022, 187, 101–119. [Google Scholar] [CrossRef]
Du, P.; Liu, S.; Gamba, P.; Tan, K.; Xia, J. Fusion of Difference Images for Change Detection Over Urban Areas. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2012, 5, 1076–1086. [Google Scholar] [CrossRef]
Liu, S.; Du, Q.; Tong, X.; Samat, A.; Bruzzone, L.; Bovolo, F. Multiscale Morphological Compressed Change Vector Analysis for Unsupervised Multiple Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 4124–4137. [Google Scholar] [CrossRef]
Desclée, B.; Bogaert, P.; Defourny, P. Forest change detection by statistical object-based method. Remote Sens. Environ. 2006, 102, 1–11. [Google Scholar] [CrossRef]
Stow, D.; Hamada, Y.; Coulter, L.; Anguelova, Z. Monitoring shrubland habitat changes through object-based change identification with airborne multispectral imagery. Remote Sens. Environ. 2008, 112, 1051–1061. [Google Scholar] [CrossRef]
Bouziani, M.; Goïta, K.; He, D.-C. Automatic change detection of buildings in urban environment from very high spatial resolution images using existing geodatabase and prior knowledge. ISPRS J. Photogramm. Remote Sens. 2010, 65, 143–153. [Google Scholar] [CrossRef]
Leichtle, T.; Geiß, C.; Wurm, M.; Lakes, T.; Taubenböck, H. Unsupervised change detection in VHR remote sensing imagery—An object-based clustering approach in a dynamic urban environment. Int. J. Appl. Earth Obs. Geoinf. 2017, 54, 15–27. [Google Scholar] [CrossRef]
Zhang, X.; Xiao, P.; Feng, X.; Yuan, M. Separate segmentation of multi-temporal high-resolution remote sensing images for object-based change detection in urban area. Remote Sens. Environ. 2017, 201, 243–255. [Google Scholar] [CrossRef]
Xiao, P.; Yuan, M.; Zhang, X.; Feng, X.; Guo, Y. Cosegmentation for Object-Based Building Change Detection from High-Resolution Remotely Sensed Images. IEEE Trans. Geosci. Remote Sens. 2017, 55, 1587–1603. [Google Scholar] [CrossRef]
Ji, S.; Shen, Y.; Lu, M.; Zhang, Y. Building Instance Change Detection from Large-Scale Aerial Images using Convolutional Neural Networks and Simulated Samples. Remote Sens. 2019, 11, 1343. [Google Scholar] [CrossRef]
Peng, D.; Zhang, Y.; Guan, H. End-to-End Change Detection for High Resolution Satellite Images Using Improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef]
Huang, X.; Cao, Y.; Li, J. An automatic change detection method for monitoring newly constructed building areas using time-series multi-view high-resolution optical satellite images. Remote Sens. Environ. 2020, 244, 111802. [Google Scholar] [CrossRef]
Chen, H.; Li, W.; Shi, Z. Adversarial Instance Augmentation for Building Change Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.; Zhang, L. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. Remote Sens. 2022, 183, 228–239. [Google Scholar] [CrossRef]
Bai, B.; Fu, W.; Lu, T.; Li, S. Edge-Guided Recurrent Convolutional Neural Network for Multitemporal Remote Sensing Image Building Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Yu, B.; Chen, F.; Xu, C. Landslide detection based on contour-based deep learning framework in case of national scale of Nepal in 2015. Comput. Geosci. 2020, 135, 104388. [Google Scholar] [CrossRef]
Anniballe, R.; Noto, F.; Scalia, T.; Bignami, C.; Stramondo, S.; Chini, M.; Pierdicca, N. Earthquake damage mapping: An overall assessment of ground surveys and VHR image change detection after L’Aquila 2009 earthquake. Remote Sens. Environ. 2018, 210, 166–178. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Daudt, R.C.; Saux, B.L.; Boulch, A. Fully Convolutional Siamese Networks for Change Detection. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chopra, S.; Hadsell, R.; LeCun, Y. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; pp. 539–546. [Google Scholar]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “Siamese” time delay neural network. In Proceedings of the 6th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November—2 December 1993; pp. 737–744. [Google Scholar]
Feng, Y.; Xu, H.; Jiang, J.; Liu, H.; Zheng, J. ICIF-Net: Intra-Scale Cross-Interaction and Inter-Scale Feature Fusion Network for Bitemporal Remote Sensing Images Change Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
Shen, Q.; Huang, J.; Wang, M.; Tao, S.; Yang, R.; Zhang, X. Semantic feature-constrained multitask siamese network for building change detection in high-spatial-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 189, 78–94. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A Hybrid Transformer Network for Change Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Chen, H.; Shi, Z. A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection. Remote Sens. 2020, 12, 1662. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction from an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the European Conference on Computer Vision 2016, Amsterdam, The Netherlands, 11–14 October 2016; pp. 630–645. [Google Scholar]
Hariharan, B.; Arbeláez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 447–456. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision 2018, Munich, Germany, 8–14 September 2018; pp. 833–851. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Kingma, D.P.; Ba, L.J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with Pyramid Vision Transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar]

Figure 1. Examples of multi-date images from the LEVIRCD dataset. Newly constructed buildings are shown from (a–d). A demolished building is shown in (e). (f) displays unchanged regions with different colors.

Figure 2. Illustrations of bitemporal aerial images from the WHUCD dataset.

Figure 3. The architecture of the proposed lightweight Siamese network. (a–c) show the Siamese encoder, the multi-temporal feature fusion module and the decoder, respectively. ‘s’ denotes the stride, ‘k’ denotes the kernel size, ‘Concat’ is the concatenation operation, ‘Conv’ denotes convolution, and ‘DeConv’ stands for deconvolution. The term ‘ASPP’ refers to the atrous spatial pyramid pooling. InvResB represents ’the inverted residual bottleneck’.

Figure 4. Illustrations of BuCD results on the LEVIRCD dataset. Six pairs of multi-temporal images are shown from (a–f). White represents the correctly identified changing buildings. Additionally, red and green denote the false and missed predictions of changed buildings, respectively. The yellow boxes of (e) show that five networks could not identify the changed rooftop completely.

Figure 5. Illustrations of BuCD results on the WHUCD dataset. Six pairs of multi-temporal images are shown from (a–f). White represents the correctly identified changing buildings. Additionally, red and green denote the false and missed predictions of changed buildings, respectively.

Figure 6. Illustrations of BuCD results on the WHUCD dataset using different backbones in the Siamese encoder. Six pairs of bitemporal images are shown from (a–f). White represents the correctly identified changing buildings. Additionally, red and green denote the false and missed predictions of changed buildings, respectively.

Figure 7. Illustrations of BuCD results for networks with different decoders used on the WHUCD dataset. From the red boxes, we can find that the network with the decoder using deconvolutional operations illustrates more precise boundaries compared with one with the decoder using the up-sampling strategy.

Table 1. An overview of multi-date image pairs (ImgPs) with 256 × 256 pixels in LEVIRCD and WHUCD datasets.

Dataset	Number of Training ImgPs	Number of Validation ImgPs	Number of Testing ImgPs
LEVIRCD	7120	1024	2048
WHUCD	5376	768	1536

Table 2. Quantitative analysis of the LEVIRCD dataset. The highest metrics are in bold.

Methods	Precision (%)	Recall (%)	F1 (%)	IoU (%)	OA (%)
CLNet	83.2	75.4	79.1	65.4	98.0
DSAMNet	80.0	90.6	85.0	73.9	98.4
IFNet	84.7	85.5	85.1	74.0	98.5
ICIFNet	89.6	84.3	86.8	76.8	98.7
Ours	91.3	88.0	89.6	81.2	99.0

Table 3. Quantitative analysis of the WHUCD dataset. The highest metrics are in bold.

Methods	Precision (%)	Recall (%)	F1 (%)	IoU (%)	OA (%)
CLNet	88.8	61.6	72.7	57.1	98.0
DSAMNet	79.0	91.9	85.0	73.9	98.6
IFNet	94.9	77.0	85.0	74.0	98.8
ICIFNet	88.0	74.8	80.9	67.9	98.5
Ours	92.0	91.0	91.5	84.3	99.3

Table 4. Quantitative analysis of LightCDNet with different settings in the Siamese encoder on the WHUCD dataset. The highest numbers are in bold.

Siamese Encoder	Precision (%)	Recall (%)	F1 (%)	IoU (%)	OA (%)
ResNet-50	90.2	88.6	89.4	80.8	99.1
Xception	89.5	78.7	83.7	72.0	98.7
Ours (MobileNetV2)	92.0	91.0	91.5	84.3	99.3

Table 5. Quantitative evaluation of networks with different settings in the decoder used on the WHUCD dataset. The highest numbers are in bold.

Decoder	Precision (%)	Recall (%)	F1 (%)	IoU (%)	OA (%)
Ours (Deconvolution)	92.0	91.0	91.5	84.3	99.3
Upsampling	91.0	89.3	90.1	82.0	99.2

Table 6. Comparison of the complexity of networks and computation cost on the WHUCD dataset.

Methods	FLOPs (M)	NOPs (M)	ATeT (s/Image)	ATrT (s/Epoch)
CLNet	16.202	8.103	0.031	204.12
DSAMNet	301,161.792	16.951	0.045	842.20
IFNet	329,055.161	50.442	0.060	604.36
ICIFNet	101,474.886	23.828	0.129	1041.75
Siamese Encoder (ResNet50)	283,461.036	155.725	0.077	738.29
Siamese Encoder (Xception)	289,171.335	168.750	0.118	725.05
Decoder (Upsampling)	48,801.779	10.405	0.052	294.18
Ours	85,410.156	10.754	0.056	303.57

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, H.; Chen, Y.; Wu, W.; Pu, S.; Wu, X.; Wan, Q.; Dong, W. A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images. Remote Sens. 2023, 15, 928. https://doi.org/10.3390/rs15040928

AMA Style

Yang H, Chen Y, Wu W, Pu S, Wu X, Wan Q, Dong W. A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images. Remote Sensing. 2023; 15(4):928. https://doi.org/10.3390/rs15040928

Chicago/Turabian Style

Yang, Haiping, Yuanyuan Chen, Wei Wu, Shiliang Pu, Xiaoyang Wu, Qiming Wan, and Wen Dong. 2023. "A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images" Remote Sensing 15, no. 4: 928. https://doi.org/10.3390/rs15040928

APA Style

Yang, H., Chen, Y., Wu, W., Pu, S., Wu, X., Wan, Q., & Dong, W. (2023). A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images. Remote Sensing, 15(4), 928. https://doi.org/10.3390/rs15040928

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Siamese Neural Network for Building Change Detection Using Remote Sensing Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets for BuCD

2.1.1. LEVIRCD

2.1.2. WHUCD

2.2. Methods

2.2.1. Siamese Encoder

2.2.2. Multi-Temporal Feature Fusion Module

2.2.3. Decoder

2.2.4. Loss Function

2.2.5. Implementation Details

2.3. Accuracy Assessment

2.4. Comparative Methods

3. Results

3.1. Results on LEVIRCD Dataset

3.2. Results on WHUCD Dataset

4. Discussion

4.1. Ablation Study for Siamese Encoder

4.2. Ablation Study for Decoder

4.3. Efficiency Test

4.4. Perspectives

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI