1. Introduction
Remote image change detection (CD) is the process of obtaining semantic change information such as vegetation and buildings from analyzing multitemporal remote images taken in the same location at different times. Lately, due to the advancement of high-resolution remote images, CD has been broadly employed for disaster monitoring [
1,
2], in which CD can discover the scope of the damage, so that the rescue and relief personnel can be reasonably arranged and dispatched; urban expansion [
3], in which CD can identify alterations in and demolitions of urban buildings and detect the presence of unauthorized buildings; forest and vegetation changes [
4], in which CD can effectively identify the growth change areas of forest and vegetation; and many other aspects, which have also attracted more and more scholars to be more interested in this task and to produce a lot of work. The process of CD is shown in
Figure 1.
Deep learning also has a very promising future in the domain of CD. Owing to the excellent characteristic extraction ability of convolutional neural networks (CNNs), some early CD algorithms have used CNNs to extract bitemporal features [
5,
6,
7,
8,
9,
10,
11,
12] to complete CD. Zhan et al. [
6] established a dual-attention convolutional siamese network, which processed two input images with shared weights and firstly introduced the siamese construction composed of two identical structures in the CD task. However, its loss function only improved the data imbalance and did not effectively solve the problems of pseudo-change and the difficult detection of boundary regions. Daudt et al. [
7] first proposed codec-based fully convolutional neural networks (FCNNs), which replaced the fully connected layer with a convolutional layer and could receive inputs of arbitrary size. As none of these methods could obtain the global information of the images due to the local limitations of traditional convolutional feature extraction, some subsequent works made some improvements in this regard as well. Peng et al. [
8] put forward the UNet++ Multiple Side-Outputs Fusion network based on UNet++ [
13], which fused deep supervision and dense connection mechanisms to optimize the edge details of change regions. In addition, UNet++ consists of different depths of UNet, providing improved segmentation performance for objects of different sizes. Yet, this method ignored the effect of season, light, etc., on change detection. Chen and Shi [
10] proposed the pyramid spatial–temporal attention module that modeled the spatial–temporal relationships during the feature extraction phase and considered capturing multiscale spatial–temporal relationships to extract more discriminative features. In [
11], a deep-supervised image fusion algorithm was presented to optimize the boundary integrity and compactness inside the target by means of merging multilayer depth features and differential image features. In [
12], channel and spatial attention were used when processing images at each moment to obtain more discriminative features. Zhang et al. [
9] utilized dilated convolution to enlarge the receptive field, where the dilated convolution was conducted by setting the dilated rate to fill the conventional convolution kernel with 0. Fang et al. [
14] adopted UNet++, where features of different levels were closely interconnected in a bottom-up manner to yield fine-grained change maps. These methods somewhat improved the disadvantages of traditional convolution, but they still could not fully extract global information, neither could they accurately identify large-scale objects nor perform well enough to acquire the correlations between the surface objects and the rest of the objects on the entire image.
The Transformer [
15] has been gradually applied in the domain of computer vision [
16,
17,
18,
19] due to its superior ability to capture long-term dependencies. Similarly, for the purpose of solving the limitations of the CNN mentioned above, the Transformer has made considerable achievements in CD tasks [
20,
21]. Chen et al. [
20] presented a method, where the Transformer was firstly brought into the CD task to enhance the spatial–temporal contextual information extraction capability through the Transformer module, and Were et al. [
21] proposed a transformer-based siamese network architecture (abbreviated as ChangeFormer) for CD that united a hierarchical Transformer encoder to generate ConvNet-like multilevel features with a multilayer perceptron (MLP) decoder to effectively extract multiscale long-range relationships. However, these algorithms lacked some capture of local information, and the tight semantic features led to the loss of information such as contour.
Based on the above problems, we realize that both local and global information are important, and the extraction of multiscale features is also an urgent work. SPNet [
22] puts forward the feature enhancement and fusion module to fully explore the feature interaction between multimodal information and strengthen the feature communication between different scales, which has made good progress in salient object detection, which is a task to detect the most salient object.
Inspired by SPNet [
22], we determined to extend this structure into the field of CD and designed a New Fusion network with Dual-branch Encoder and Triple-branch Decoder (DETDNet). DETDNet adopts a codec architecture. The encoder is a dual-branch siamese structure, and the bitemporal image features are fused by a concise but effective module, namely concatenation and a
convolution operation (CAC). Moreover, the decoder uses a triple-branch structure, while using a refined Receptive Field Block (RFB) improved from [
23,
24] to extract the multiscale contextual characteristics of the three branches, denoted as the multiscale feature extraction module (MFE), and to fuse the features of each layer in the middle branch with the next layer via the triple-branch aggregation (TA) module. Finally, the resolutions of the three change maps are recovered to be in accordance with the raw images after upsampling, and these three change maps are fused to build the final change map we need.
The key work of this article unfolds in three ways:
(1) DETDNet is proposed, in which the encoder is a dual-branch structure that captures the local features of images, and for the first time, a three-branch structure is used in the decoder to obtain multiscale features by using MFE.
(2) We use different feature fusion methods for the decoder and encoder, respectively. The encoder applies CAC to fuse the bitemporal images taken in the same location at different times, and the decoder uses the TA module to fuse the triple-branch features. Futhermore, a cascade operation is adopted to fuse the features from the same stage in the encoder and decoder.
(3) The experiments implemented on three publicly available datasets demonstrate our approach exceeds some recent approaches in terms of the F1 score, IoU, and OA.
The rest of this paper is structured as follows.
Section 2 lists related works.
Section 3 shows the whole DETDNet structure and its details.
Section 4 discusses the experiments conducted to provide evidence of the superiority of our approach, and
Section 5 draws together the work of this paper.
4. Experiments
To confirm the superiority of this method, we executed experiments on the BCDD [
32], LEVIR-CD [
10], and SYSU-CD [
33] datasets, and a sequence of comparative experiments was designed to compare this model with some classical models from recent years. To be fair, the experimental settings were conducted according to the original article.
4.1. Datasets
The BCDD dataset was collected in New Zealand and covered Christchurch. It contains two high-resolution remote sensing images with a registration error of 1.6 pixels. The imaging dates were 2012 and 2016, respectively, the resolution is 0.3 m/pixel, and the size is 32,507 × 15,354 pixels. To make the training more convenient, we cropped the images into non-overlapping 256 × 256 image pairs, for a total of 7434 pairs, and divided them randomly in the ratio of 8:1:1 into a training set, validation set, and testing set.
The LEVIR-CD dataset originated from the Beihang LEVIR team, and the imaging locations were 20 different areas in several cities in Texas, USA. The imaging time varied from 2002 to 2018. Over 31,000 individual instances of change were fully labeled in 637 image pairs of 1024 × 1024 pixels and a resolution of 0.5 m, among which the change in land use types such as urban expansion was more significant. For the convenience of training, we cropped the image into small nonoverlapping blocks of 256 × 256 pixels, and the dataset was randomly partitioned, with 7120 image pairs as the training set, 2048 image pairs as the validation set, and 1024 image pairs as the testing set.
The SYSU-CD dataset includes 20,000 pairs of 0.5 m aerial images collected in Hong Kong in 2007 and 2014. The primary change types in the dataset comprised suburban sprawl, new urban construction, pre-construction groundwork, road expansion, vegetation changes, and marine construction. In this experiment, we partitioned the whole dataset into a training set, a validation set, and a testing set in the proportion of 6:2:2.
Table 3 shows the sizes of the three datasets used.
4.2. Implementation Details
For this experiment, pytorch was used as the training framework. For the convergence acceleration of the model, Res2Net-50 [
27] was pretrained on the ImageNet [
28] dataset to initialize the parameters of DETDNet. The training batch size was set to 16, the optimizer was Adam, the initial learning rate was set to 0.001, and the model was iterated for 100 epochs, with the learning rate decaying by 0.5 for every eight epochs. The specific hardware configuration was an NVIDIA TITAN RTX (24 GB) GPU.
4.3. Evaluation Metrics
Considering that the remote sensing CD task can be seen as a binary classification task, the precision, recall, F1 score, intersection over union (IoU), and overall accuracy (OA) were selected as the evaluation metrics to quantitatively validate the efficiency of the algorithm presented in this article. These evaluation metrics are always used to measure binary classification models in machine learning. The expressions of these evaluation metrics are listed below:
denotes the sum total of the changed pixels predicted to be changed, denotes the total number of unchanged pixels predicted to be changed, denotes the total number of unchanged pixels predicted to be unchanged, and denotes the total number of changed pixels predicted to be unchanged.
4.4. Performance Comparison
To prove the superiority of the method put forward in this article, DETDNet was compared with some advanced methods in current CD tasks, including FC-EF [
7], FC-Siam-conc [
7], FC-Siam-diff [
7], CDNet [
34], STANet [
10], BiT [
20], SNUNet [
14], and ChangeFormer [
21]. To be fair, we conducted comparative experiments in the same environment, that is, the same software environment, hardware environment, and dataset processing methods.
4.4.1. Comparative Experiments on the BCDD Dataset
We display the quantitative experimental results of various algorithms on the BCDD dataset in
Table 4. The results in the table reveal that our algorithm achieved 93.84%, 91.59%, 92.70%, 86.40%, and 99.32% for the precision, recall, F1 score, IoU, and OA, respectively, which were higher than all the other methods and 3.3%, 5.56%, and 0.2% over the second-best method on the main metrics of the F1 score, IoU, and OA, respectively. The highest precision and recall also indicate that our model is more robust compared to the other methods. The above results proved the method in this article surpasses these comparative methods.
Figure 7 illustrates the visualization results of the comparative experiments performed on the BCDD dataset. For easier observation, the
,
,
, and
are marked in the figure with white, black, red, and green, respectively. It is obvious that our model can avoid the
and
more effectively than other methods. The first and third rows show that the comparison methods had poorer detection accuracy for the change edges, leading to boundary misses or misdetections, thus making the boundary more blurry, while our method detected clearer boundaries, probably due to our TA module, which can efficiently extract the spatial relationships between bitemporal features. As viewed from the second and fourth rows, the influence of the land cover and color around the building in the bitemporal images caused the comparison methods to easily detect the non-changing areas as changing areas. By contrast, the proposed method in this paper circumvented this drawback. This is mainly due to the MFA module. By increasing the receptive field, the MFA module can obtain more global feature relationships and enhance the extraction of semantic information, thus reducing the influence of the pseudo-changes on the CD results. Moreover, due to the use of the dilated convolution and strip convolution, our method is superior to the second-best model in terms of the number of parameters.
4.4.2. Comparative Experiments on the LEVIR-CD Dataset
The quantitative results of the comparison experiments conducted on another public dataset LEVIR-CD are exhibited in
Table 5. As
Table 5 shows, our algorithm was significantly better than the rest of the algorithms in the main metrics of performance, the F1 score, IoU, and OA, and was 0.82%, 1.37%, and 0.06% better than the second best algorithm, respectively.
Figure 8 depicts the visualization results of the comparison experiments. As the spectral information of the images taken at different times may be different, it may cause misdetections or missing detections; as shown in the first line, with respect to the changed area at the bottom right corner, because of the influence of the spectral information, some other methods showed red and some showed green, while our method detected the changed area more accurately. As seen in lines 2 and 7, our method detected both large target regions and small change regions relatively accurately, due to the proposed idea of combining the local features from the encoder as well as the more global features from the decoder. In lines 3, 4, 5, and 6, the shadowed parts of the images caused the comparison methods to easily detect non-changing regions as changing regions; however, our method successfully avoided such pseudo-change.
4.4.3. Comparative Experiments on the SYSU-CD Dataset
The quantitative results of all kinds of comparison methods with DETDNet performed on the SYSU-CD dataset are exhibited in
Table 6. As displayed in
Table 5, our model outperformed the second highest model in the F1 score, IoU, and OA by 1.41%, 1.98%, and 0.71%, respectively. As for the visualization results, they are presented in
Figure 9. The main changes in rows 1, 2, 3, 4, and 6 were the building expansions. It can be seen that whether it was the new buildings around the vegetation in rows 1 and 2, the expansion of the seaside buildings in row 3, or the building expansions around the highway in rows 4 and 6, our model was better able to handle the change boundaries and obtain more accurate boundaries. For the change area in the middle of row 2 and the color change of the building roof in row 4, our model recognized the pseudo-change due to the color better. For the change type of vegetation in rows 2 and 5, we can see that our model also had better prediction results. From a comprehensive point of view, since the scenes in the SYSU-CD dataset were relatively complex, there may be various types of changes in the single image, which makes it more difficult to detect, and our model was comparatively better at extracting the features of the different changes and arriving at a more accurate change map.
4.5. Ablation Experiments
We conducted ablation experiments mainly on the BCDD and LEVIR-CD datasets to determine the improvement in each part of the model, which were mainly divided into the following four aspects.
4.5.1. Effectiveness of the Pretraining
Before training, we first initialized the weight parameters using Res2Net-50 [
27] pretrained on the ImageNet [
28] dataset, with the pretrained model provided by Res2Net-50 [
27]. To demonstrate the necessity of pretraining, the ablation experiment was conducted on the BCDD dataset. Moreover, the results are tabulated in
Table 7, where × indicates no pre-trained and
√ indicates pre-trained. From the visualization results in
Figure 10, it is noticeable that the metrics were significantly lower without pretraining than with pretraining, based on which the necessity of pretraining is confirmed.
4.5.2. The Selection of the Feature Fusion Method
As described above, different feature fusion methods were used between the branches within the encoder and decoder, where the encoder used the CAC to fuse the dual-branch features, and the decoder used the TA to aggregate the triple-branch features. To confirm the adaptability of the two fusion methods in the encoder and decoder, this paper tried to use the TA in the encoder and the CAC in the decoder, based on which ablation experiments were implemented on the BCDD dataset.
Table 8 includes the results, which shows that the effect of using the TA in the encoder was somewhat lower than that of the original fusion mode in the F1 score, IoU, and the OA. In addition, we changed the TA in the decoder to the CAC, and
Table 9 lists the results, whose performance was also reduced as opposed to the original TA fusion method. In the encoder stage, using the CAC can simply and effectively fuse the dual-temporal features, while using the TA module will cause the redundancy of features. In the decoder stage, on account of the integration of local and multiscale contextual features, the features are relatively more complex, and for the fusion of the left and right branches and the middle branch, using the TA can extract the change features more accurately.
4.5.3. Impact of the MFE
The model in this paper used a modified RFB module, which we referred to as an MFE. According to [
30], the MFE has a stronger feature representation, and the model is more robust compared to the original RFB [
24] and RFB-s [
24]. For this reason, the related ablation experimental results rendered on the LEVIR-CD dataset are provided in
Table 10. The setup of the specific ablation experiments was that the MFE module in the decoder was replaced by an RFB and RFB-s, respectively, which are also used for extracting multiscale contextual features. The metrics of the MFE module were remarkably higher than those of RFB and RFB-s, except for the recall, which was slightly lower than those of RFB and RFB-s.
4.5.4. The Importance of Each Branch of the MFE
We performed a variety of experiments on the BCDD to determine the implications of each branch in the MFE on the model as a whole. The MFE contained a total of five branches, and in the experimental setup, four of them were kept unchanged and branches 0, 1, 2, and 3 were removed in turn. This operation was executed to demonstrate the importance of extracting multiscale contextual features. The resulting data in
Table 11 imply that the model yielded better performance than the rest of cases when the MFE was used. Moreover,
Figure 11 shows the trend of the F1 score at various settings in the training process.
5. Conclusions
The model in this paper is designed to focus on remote sensing image change detection. The model uses a dual-branch structure in the encoder to extract local features, a triple-branch structure in the decoder to extract more global contextual information, and a TA module to effectively fuse the left and right branches with the middle branch. We validated the performance of the DETDNet on the SYSU-CD, LEVIR-CD, and BCDD datasets. In the three datasets, our model reached the optimal value in the F1 score, OA, and IoU. Among them, in the BCDD dataset, our F1 score, OA, and IoU were 3.3%, 0.2%, and 5.56% higher than the second best method, respectively. In the LEVIR-CD dataset, our model outperformed the next best method by 0.82%, 0.06%, and 1.37%, respectively. In the SYSU-CD dataset, our model was 1.41%, 0.71%, and 1.98% higher than the second best method, respectively. The BCDD dataset mainly contains large sparse buildings, and the LEVIR-CD contains small dense buildings. However, both contain pseudo-changes, and the data volume is relatively small compared to the SYSU-CD. These two datasets test the model’s ability to learn and explore potential relationships with a small amount of data. The SYSU-CD has a large amount of data but not high labeling accuracy, which tests the model’s generalization ability. In addition to this, we conducted four sets of ablation experiments to prove the significance of each component in the model. Although the receptive field was increased by the MFA module, the maximum receptive field was 23 after calculation. Therefore, it can be seen that the global features cannot be fully obtained in the shallow layers. Based on this, our subsequent work will focus on using the Transformer or MLP to obtain the global features to achieve a higher performance.