1. Introduction
Remote sensing images have long served as a crucial conduit of information for humankind [
1]. They are generated by capturing visual data through photographic or non-photographic methods from a distance without physical contact with the objects of interest. Remote sensing techniques enable the swift and extensive surveillance of the Earth’s surface, furnishing accurate and impartial information on terrestrial features. Therefore, remote sensing technology represents the sole means currently at our disposal to perform the rapid and real-time monitoring of the Earth’s surface over large areas [
2,
3,
4].
The change detection of remote sensing image is to extract and analyze the changes in the surface area, including the changes in the location, scope, and nature of the area, by combining the characteristics of ground objects, remote sensing imaging mechanism, and relevant geospatial data, using the mathematical model and image processing theory of multi-source remote sensing images of the same surface during different periods [
5]. Currently, remote sensing image change detection is widely applied in various applications such as monitoring urban changes [
6], analyzing land use and coverage changes [
7], assessing natural disasters [
8], and environmental monitoring [
9], etc. With the development of remote sensing platforms and sensors, a large number of multi-source, multi-scale, and multi-type remote sensing images have been accumulated; however, remote sensing imaging is easily affected by external conditions, and the images may be coarse and contain a large number of mixed pixels. Super-resolution mapping (SRM) can analyze such mixed pixels and obtain mapping information at the sub-pixel level. Wang et al. [
10] proposed a spatial-spectral correlation (SSC) method using a mixed spatial attraction model (MSAM) based on linear Euclidean distance to obtain spatial correlation, combined with a spectral correlation method based on the nonlinear Kullback–Leibler distance (KLD). The combination of spatial correlation and spectral correlation reduces the effect of linear and nonlinear imaging conditions. Li et al. [
11] used the spatial distribution of fine spatial resolution images to improve the accuracy of spectral unmixing, performed spectral unmixing with coarse spatial resolution images, integrated the spatial distribution information of fine spatial resolution images into the original abundance images, and used the improved abundance values to generate sub-pixel maps of fine spatial resolution using sub-pixel mapping methods. The study of sub-pixel-level mapping can reduce the influence of remote sensing images by external factors. At the same time, remote sensing images also exhibit high resolution, strong timeliness, and diversity [
12,
13]. As a result, low-to-medium resolution remote sensing image change detection is no longer sufficient to meet demands. Therefore, high-resolution remote sensing image change detection has become a popular but difficult research topic in the field of remote sensing [
14].
In the early stages of remote sensing change detection research, manual visual interpretation methods were primarily used. However, these methods are labor-intensive, time-consuming, and have low interpretation efficiency [
1], making them inadequate for meeting the practical needs of production and development. Subsequently, traditional change detection methods such as the post-classification comparison method [
15,
16,
17,
18], the direct comparison analysis method [
19,
20,
21,
22], change vector analysis (CVA) method [
23,
24], and principal component analysis (PCA) method [
25,
26,
27,
28] were introduced. While these methods can quickly identify change areas, they have higher requirements for image preprocessing and often require threshold segmentation and cluster analysis, etc. The process of change detection in remote sensing images still heavily relies on manual intervention, which can lead to large errors and failure to fully utilize the rich ground object information contained in high-resolution images. This results in the low accuracy and poor performance of change detection. Therefore, the development of techniques for the machine-based, automatic determination of change areas is a critical aspect of research in this field [
29].
Since 2012, with the birth of the Alex-Net network [
30], deep learning began to flourish and was rapidly applied to the field of change detection, which can autonomously learn the multi-level features of images without manual extraction and through training, for change detection, with wide applicability [
31]. Deep learning technology has powerful image processing capabilities, data analysis capabilities, and automation capabilities. Change detection is essentially a classification problem of changed pixels and unchanged pixels in the image. Deep learning technology can be used to effectively solve this type of problem [
32,
33]. Unlike traditional detection methods, change detection combined with deep learning is treated as a semantic segmentation task, which eliminates the need for manual intervention and allows for end-to-end detection. This approach can quickly process large amounts of data, thus improving efficiency and accuracy. With the introduction of deep learning, change detection technology based on remote sensing images has been rapidly developed to avoid tedious manual feature extraction, classification, etc., and to provide science and technology to realize the intelligent and efficient processing of remote sensing images [
34]. Currently, change detection based on deep learning can be divided into two categories based on the network structure: single-branch and multi-branch structures [
35].
The single-branch structure employs a single sub-network, and the bi-temporal images are connected as a single object input to the network by difference comparison or superposition, and only a core feature extraction model is needed for change detection. For example, the deeplabv3+ network was optimized using heterogeneous receptive fields and multi-scale feature tensors proposed by Chang Zhenliang et al. [
36]. On the other hand, the change detection model proposed by Zhang et al. [
37] uses a feature pyramid and trapezoidal U-Net network; in the encoder structure, the model incorporates dilated convolutions with different dilation rates to ensure the sufficient receptive field coverage of objects of different sizes, which is capable of effectively learning multi-scale features. However, the U-Net network may result in less context information due to the trade-off between precise positioning and contextual information. Zhang Cuijun et al. [
14] proposed a change detection model that combines the use of asymmetric convolution blocks and the convolutional block attention module (CBAM) to improve U-Net network and convert change detection into a pixel-level binary classification task, thereby achieving end-to-end detection. Tian Qinglin et al. [
38] proposed a building change detection method based on an attention pyramid network, adding dilated convolution and pyramid pooling structure to the CNN in the encoder to expand the receptive field and extract multi-scale features, introducing attention mechanism in the decoder, and using the top–down dense connection to calculate the feature pyramid, fully fusing multi-level features to solve the problem of missed detection and indistinct detection boundaries of multi-scale features occurring in detection. Ji Shunping et al. [
7] proposed the FACNN model using the U-Net framework to obtain finer multi-scale information by replacing the normal convolution in the encoder with the dilated convolution and using atrous spatial pyramid pooling (ASPP). Similarly, Peng et al. [
39] proposed an improved U-Net++ detection model that employs a multi-side output strategy, making use of both global and fine-grained information to mitigate the problem of error propagation in change detection. Papadomanolaki et al. [
40] used U-Net to extract the spatial features of images and LSTM to obtain temporal features to fully exploit the spatio-temporal information of high-resolution images and improve the change detection evaluation. Wang et al. [
41] proposed using faster R-CNN for the change detection of high-resolution images, and designed two detection models; MFRCNN merges bi-temporal image bands and feeds them into faster R-CNN for detection, while SFRCNN generates difference images before detection, and the experimental results show that SFRCNN has a higher detection accuracy and the MFRCNN architecture is simpler and highly automatic. However, this type of single-branch structure detection model, which often inputs images to the network through difference comparison or superposition, can lead to the loss of high-dimensional features of the original image and result in detection errors.
The multi-branch structure in deep learning-based change detection requires two data inputs, fusing bi-temporal features from the two sub-networks to extract change information, and can be further divided into two categories: Siamese networks and pseudo-Siamese networks. The main difference between the two is that Siamese networks share weights when the two branches extract image features, while pseudo-Siamese networks do not. The multi-branch structure change detection model with shared sub-network weights: Chopra et al. [
42] used a Siamese network for face recognition verification, which not only preserves the complete spatiality of the image, but also does not require any prior information about the category, which is superior compared to other networks. Zhan Y. et al. [
43] used a Siamese convolutional neural network with a contrast loss function for remote sensing image change detection at the pixel level, where each pixel point in the input image generates a 16-dimensional vector and compares the similarity of the vectors generated for each pixel point of the two images to determine, pixel-by-pixel, whether a change has occurred or not. Zhang et al. [
44] employed a Siamese convolutional network to extract highly representative deep features and presented a deep-supervised Siamese fully convolutional neural network (IFN) with a deep supervision strategy to supervise the model’s output, thereby enhancing the network’s discriminative ability. Similarly, Raza et al. [
45] used a CBAM for feature fusion and designed an efficient convolution module to improve U-Net++, thereby enhancing the model’s ability to extract fine-grained image features and improving detection performance. Wang et al. [
46] proposed a deep supervision network, which is based on the CBAM of the Siamese structure for change detection and can make full use of the ground feature information of the image. Similarly, Jiang et al. [
47] proposed a Siamese network based on feature pyramids, which uses an encoder–decoder architecture with a shared attention module at the end of the encoder to enable the model to find unique objects in other images and address the complex relationship between buildings and displacements in building change detection in remote sensing orthophotos. Shi et al. [
48] used a convolutional neural network (CNN) network with the Siamese structure to extract image features, introduced an attention mechanism and a deep supervision module to obtain change maps with rich spatial information, and effectively reduced the influence of pseudo-change and noise in the detection maps due to external factors. Zhang et al. [
49] proposed a DSIFN model that uses a Siamese-structured CNN network to extract image depth features, uses a deeply supervised disparity discrimination network for change detection, and introduces an attention mechanism in the disparity discrimination network to make full use of the fine image details and complex texture features in high-resolution images. Bandara et al. [
50] used two transformer encoder structures to extract multilayer features and designed a lightweight multilayer perception and decoder to fuse feature differences and predict change information, capable of simultaneously acquiring remote contextual information to identify changes in images in both the spatial and temporal scales. Fan Wei et al. [
51] proposed the MDFCD model, which uses a Siamese-structured CNN network to extract depth features from multiple layers, and the multi-scale fusion of high-level and low-level features to make full use of texture features and semantic information in images. Chen et al. [
52], proposed the SiamCRNN model for the change detection of heterogeneous data, combining CNN and RNN to process images, extracting spatial spectral features using Siamese-structured CNN (pseudo-Siamese is used for heterogeneous data), and then fully mining change information using RNN. Fang et al. [
53] proposed the DLSF model with a hybrid Siamese structure, consisting of two branches for dual-learning feature extraction and Siamese-based change detection, respectively. Wang et al. [
54] proposed the DSCNH model with a Siamese convolutional neural network based on a hybrid convolutional feature module to achieve the change detection of multi-sensor images. Additionally, Xu et al. [
55], Liu et al. [
56], Wiratama et al. [
57], and Touati et al. [
58] developed a pseudo-Siamese network, which is composed of two parallel convolutional neural networks with non-shared weights, to accomplish the task of change detection. The multi-branch structure detection model is capable of effectively preserving the high-dimensional features of the image. However, the complexity of high-resolution image information can still lead to issues such as blurred detection boundaries, small target missed detection and more pseudo changes.
Continuous exploration by researchers has resulted in the widespread application and significant development of remote sensing change detection technology in many fields. However, the diversity and variability of remote sensing image sensors, changed areas, and real-world scenarios preclude the existence of a change detection method that can be universally applied to all application scenarios, rendering change detection methods lacking in generality. After a comprehensive analysis of the current research status, the high-resolution remote sensing image change detection is faced with several challenges and shortcomings:
Remote sensing images exhibit internal complexity and distinct characteristics in the expression of features as compared to images used in everyday settings. Remote sensing images are large-scale observations of the Earth, with original sizes that are significantly larger than those of regular images. With increasing image resolution, more detailed parts of the features can be resolved, leading to increased information content in the images. Nevertheless, the majority of research scenes are typically limited to specific scenarios with relatively low information content, thereby complicating change detection in high-resolution remote sensing images. As the multi-source heterogeneous data sources born with the continuous development of remote sensing technology have differences in imaging modes, the traditional change detection techniques cannot accurately identify the change areas, and the current methods, which mainly target the change detection between different data sources, are not universal for multimodal data application scenarios.
The low degree of automation is a major challenge in traditional change detection methods, which require a considerable amount of manual intervention, leading to a high degree of uncertainty and a significant impact on human factors on the results. One of the current difficulties is to automatically extract features, simplify the process, and improve the level of automation. Moreover, due to the existence of diversity among the datasets used in the research, there is a need to further investigate change detection methods that have universality and generalization.
The blurred boundary and missed detection of small targets. High-resolution remote sensing image information is complex, and shallow information will be lost when feature extraction is performed by the change detection network, resulting in blurred detection boundaries and small target miss detection, etc. Further research on multi-scale information extraction is needed to improve the detection performance. The imaging of remote sensing images is often susceptible to disturbances from external factors such as lighting, sensors, and human factors, which can result in the presence of pseudo-changes. Change detection, which involves comparing two remote sensing images obtained at different times, can be compromised if these external factors interfere with the imaging process, leading to inaccuracies in the research results. As a result, the reduction in the impact of pseudo-changes has become one of the primary challenges in remote sensing image change detection. It is critical to devise strategies to minimize their interference and exclude them from the research results.
In response to the limitations of existing research, to address the problems of blurred detection boundary, the missed detection of small targets and more pseudo changes in the change detection of high-resolution remote sensing images, and fully utilize the global information of images, solve the problems of shallow information loss and the semantic misalignment caused by the direct concatenation of features during feature fusion. We developed a new change detection algorithm called Siam-FAUnet in this paper. This algorithm incorporates a Siamese structure, atrous spatial pyramid pooling module, and flow alignment module to more accurately extract the changing area and features of the bi-temporal remote sensing image.
The contributions of this research are highlighted as follows:
This paper proposes a high-resolution image change detection model, called Siam-FAUnet, based on the Siamese structure. The Siamese network was first proposed by Bromley et al. [
59] in 1994, and it is used to extract change features to alleviate the influence of sample imbalance. Moreover, the bi-channel input of the two temporally separated remote sensing images can reduce the image distortion caused by channel overlay.
The proposed model employs an encoder–decoder architecture, where an enhanced VGG16 is employed in the encoder section for feature extraction. To capture contextual features across multiple scales, the atrous spatial pyramid pooling (ASPP) module [
60] is employed. Atrous convolution with varying dilation rates is used to sample input features, followed by a concatenation of the results to enlarge the network’s receptive field, which enables the finer classification of remote sensing images. In the decoder section, the flow alignment module (FAM) [
61] is utilized to combine the features extracted by the encoder. FAM can learn the semantic flow between the features of different resolutions, and efficiently transfer the semantic information from coarse features to refined features with high resolution. This mitigates semantic misalignment issues caused by direct feature concatenation during feature fusion.
The Siam-FAUnet model is based on the Siamese structure, which effectively preserves the complexity of the high-dimensional image information, reduces the impact of sample imbalance on detection, and reduces image distortion. The ASPP module extracts multi-scale contextual features, which can increase the receptive field of the network, and the FAM module fuses feature information to solve the problem of semantic flow misalignment caused by the direct concatenation of features. The proposed model is based on the Siamese structure and combines ASPP and FAM modules to solve the problem of the loss of shallow information of high-dimensional image features, making full use of the global information of images to achieve multi-scale information extraction, and effectively solve the problems of blurred boundaries, missed detection of small targets, and more pseudo changes in change detection.
The chapters of the article are organized as follows:
In
Section 1, the current background of remote sensing change detection research and its extensive and far-reaching significance are introduced. The current research status of remote sensing image change detection is summarized and sorted out, and the research difficulties and shortcomings are summarized based on the current research status. In
Section 2, the structure of the proposed model is introduced, including the general structure of the model and the introduction of each module, the loss function used, and the experiment setup, including the experimental data, the experimental environment, the evaluation metrics, and the experimental comparison method. In
Section 3, the experimental results are qualitatively and quantitatively presented in this chapter. In
Section 4, summarize and explain the results of the study and explore how the results compare and what was found with other studies. In
Section 5, summarize the main contents and research work of the paper, analyze the shortcomings of the paper, summarize the methods and experiments in the paper, and provide a reasonable outlook on the future research directions and contents.
2. Materials and Methods
2.1. Structure of Change Detection Network Based on Siam-FAUnet
The proposed Siam-FAUnet model is illustrated in
Figure 1, and comprises two parts, namely the prediction module and the residual correction module. The prediction module, which is composed of an encoder and a decoder, receives remote sensing images from two different temporal periods (T
1 and T
2). Combine the atrous spatial pyramid pooling (ASPP) and the flow alignment module (FAM) in this module, the network uses image information to perform multiple consecutive convolutions and the deconvolution operations to extract deep and multi-scale feature information, thus allowing for the prediction of changing areas. The residual correction module is used to reduce the residual between the prediction and the ground truth, thereby enhancing the accuracy of change detection and producing more accurate change binary maps.
2.2. Prediction Module
To address the problem of the model’s limited receptive field not being able to cover multi-scale target objects, leading to the inadequate refinement of change detection results and indistinct change boundaries, a multi-branch encoder–decoder structure network is implemented in the prediction module. The single-branch structure model, which usually inputs the image into the network by difference comparison or superposition, easily leads to the loss of the high-dimensional features of the original image. Therefore, the Siam-FAUnet model based on the Siamese structure effectively preserves the high-dimensional features of the image, but due to the complexity of high-resolution image information, the detection results still have problems such as blurred detection boundary, small target miss detection, and more pseudo-variation due to the loss of shallow feature information.
The atrous spatial pyramid pooling (ASPP) and the flow alignment module (FAM) are used in the encoder–decoder structure. The ASPP module introduces the dilation rates to the convolutional layer, defines the spacing of each value when the parameter kernel processes the data, and can avoid the problems of internal data structure, loss of spatial hierarchical information and loss of small object data after downsampling caused by ordinary convolution in pooling, so as to obtain a larger receptive field, obtain multi-scale information, and improve the effect of small object change detection. However, atrous convolution can causes a grid effect and lose the continuity of information, as well as an excessively large dilation rate that makes the information irrelevant at a distance, leading to a detection that is only effective for large objects and semantic information loss during feature transfer. Additionally, FAM can learn the semantic flow between the features of different resolutions and generated semantic flow fields, which can effectively transfer semantic information from coarse features to high-resolution refined features, and can solve the semantic misalignment problem caused by direct concatenation during feature fusion. The ASPP module obtains multi-scale contextual information combined with the feature mapping of the FAM module, which can recover part of the semantic information lost in the encoding stage, effectively fusing multi-scale information and solving the problems of shallow information loss and the alignment of feature positions at different scales.
The encoder employs the first four blocks of VGG16 with a Siamese structure for feature extraction. Each block consists of two convolutional layers and a max pooling layer with a window size of 2 × 2 and a stride of 2. These layers are followed by a batch normalization layer and ReLU activation function. The convolution blocks use a convolution kernel size of 3 × 3.
The encoder utilizes the first four blocks of VGG16, with 32, 64, 128, and 256 convolution kernels in each block, respectively. The use of remote sensing images that contain objects with varying scales and inconsistent textures and colors may not be able to fully capture the entire object if only the pooling layer is used for semantic information extraction, leading to challenges such as indistinct change boundaries in the results. To mitigate this, the atrous spatial pyramid pooling (ASPP) is incorporated in the last layer of the encoder to fuse features, to obtain context information of varying scales within the image and improve detection accuracy.
The decoder, which corresponds to the encoder, is also composed of five blocks. The first four blocks of this section are parallel to the corresponding blocks in the encoder and feature mapping is performed using the flow alignment module (FAM) to recover some of the semantic information that was lost during the encoding stage. The FAM is utilized instead of the traditional method of direct concatenation for up-sampling, which reduces the loss of semantic information during feature transfer.
Atrous spatial pyramid pooling (ASPP) is used to address the challenge of large variation in the scale of objects in remote sensing images and the randomness of change locations. The traditional pooling operation samples at a fixed scale, which results in the network being unable to fully utilize the global information of the image and causes significant variations in the segmentation of objects of different scales. The proposed ASPP module effectively addresses these issues and its structure is shown in
Figure 2. ASPP employs atrous convolution into sample input features with various dilation rates and then concatenates the results. A 1 × 1 convolutional layer is used to decrease the number of channels of the feature map to the desired amount. By incorporating the ASPP module into the network, the receptive field of the network is enlarged. This allows the network to extract features from a wider area while preserving the details of the original image. As a result, the network can more thoroughly classify remote sensing images, leading to improved change detection performance.
The structure of the flow alignment module (FAM) is depicted in
Figure 3, where
Figure 3a illustrates the overall structure of FAM, and
Figure 3b shows the schematic diagram of the semantic flow alignment process. FAM utilizes the feature maps,
Fn and
Fn−1, from two adjacent layers, where the resolution of
Fn−1 is higher than that of
Fn. FAM first increases the resolution of
Fn to align with that of
Fn−1 through bilinear interpolation and then concatenates these two feature maps. These concatenated feature maps are then processed by a sub-network comprising two 3 × 3 convolutional layers. The FAM module generates a semantic flow field by utilizing input feature maps, namely
Fn and
Fn−1. This flow field facilitates the improvement of the up-sampling of low-resolution features. The semantic flow field maps each pixel point,
Pn−1, on the spatial grid to n levels through simple addition. Then, a differentiable bilinear sampling mechanism is employed to interpolate the values of the four neighboring pixels of
Pn (upper left, upper right, lower left, and lower right) in order to estimate the final output of the FAM module,
. This final output can be represented in Formula (1):
where
ωp is the bilinear interpolation kernel weight, which is estimated based on the distance,
Fn(
P) represents the value of the low-resolution feature map at a specific position
p, and
N(
Pn) denotes the four neighboring positions of
Pn. This semantic flow field allows FAM to learn the correlation between features of different resolutions. The generated offset field, also known as the semantic flow field, can effectively transfer semantic information from coarse features to high-resolution refined features, thus resolving the semantic misalignment caused by direct feature fusion during the splicing process.
2.3. Residual Correction Module
If the detection results of the prediction module are not corrected, problems such as patch holes and unsmooth detection boundaries are likely to exist, so there is a large difference between the final predicted results and the actual changes. In order to improve the accuracy of the change detection, the proposed model includes a residual correction module that utilizes dilated convolution. This module takes in the output of the prediction module, which is the semantic information of the image, and learns the difference, also known as the residual, between this output and the ground truth during the training process. By applying this residual correction, the input image is further refined, and more accurate semantic segmentation results are obtained.
The high-resolution remote sensing image poses a significant challenge in feature extraction, as the characteristics of ground objects, such as texture, spectrum, and shape, vary greatly. Traditional methods fail to take into account the different scales of ground objects and the surrounding background, resulting in an inability to extract the global features of the image. This leads to there being unclear boundaries of ground objects in change detection results, incomplete detection, and other issues. To address this problem, the proposed dilated convolution residual correction module has multi-scale receptive fields, enabling the extraction of deeper image features.
The structure of the dilated convolution residual correction module is shown in
Figure 4. The proposed dilated convolution residual correction module features a structure with four convolutional layers of varying dilation rates, with the dilation rates being 2, 4, 8, and 16, and the number of dilated convolution kernels being 32. This module fuses the feature maps of different receptive fields through superposition, with each convolutional layer being followed by batch normalization and ReLU activation functions. The input of this module contains the initial information of image prediction, and adding and fusing the input image with the obtained feature map results in more detailed change information. The final modified change detection image is then obtained through the use of the Softmax function to classify changed and non-changed pixels.
2.4. Loss Function
Cross entropy loss function (CE) [
62] is a loss function based on empirical error minimization. Its definition is shown in Formula (2).
Among them, y is the actual value of the sample, and y′ is the predicted value of the model output. Although the cross-entropy loss function can optimize the overall accuracy of model classification, when the number of samples of different categories in the dataset is unevenly distributed, the prediction results of the network trained with this loss function will be more biased towards the side with more samples, while those with fewer samples will be ignored, resulting in poor prediction results for categories with small loss function values but a small proportion of samples.
The Dice loss function, which is defined in Formula (3), takes into account the proportion of the intersection of pixels of the same category in the population during the training process, rather than analyzing the overall accuracy of model classification as in the case of the cross-entropy loss function. This approach makes the model less sensitive to the uneven distribution of samples among different categories, resulting in improved prediction results for categories with small sample sizes. The variable
M represents the total number of pixels,
Pm is the prediction result,
rm is the true value of the sample, and
ε is a very small real number that is used to prevent division by zero.
In order to optimize the performance of the change detection model, this paper proposes the use of a combination of the CE loss function and the Dice loss function. The CE loss function, commonly used in image classification tasks, calculates the difference between the predicted output and the actual value of the samples. However, in remote sensing change detection, the distribution of changed and unchanged pixels is often uneven, leading to errors. The Dice loss function, on the other hand, calculates the proportion of the intersection of all pixels of the same category in the population, which is less affected by the imbalance in the sample distribution. By assigning a weight
λ to the CE–Dice loss function, the model can achieve better performance in detecting changes in remote sensing images. Its definition is shown in Formula (4).
Among them, LCE–Dice represents the loss function CE–Dice loss used in the research, is defined as the weighted sum of the CE loss (LCE) and the Dice loss (LDice), and λ is used to adjust the weight of the two combined items in the expression. Concerned with the problem stemming from the value of λ, an experiment is conducted to determine the optimal value of λ for the proposed loss function.
2.5. Experimental Setup
2.5.1. Datasets
CDD: In 2018, Lebedev et al. [
63] created and used the Change Detection Dataset (CDD) in a published paper. The dataset contains three types of images: synthetic images without relative object movement, synthetic images with relatively small relative object movement, and real remote sensing images that change with the seasons. The experiment selected real remote sensing images that change with the seasons in the CDD. The dataset has a total of 10,000 pairs of training sets, 3000 pairs of validation sets, and 3000 pairs of images as test sets. Its satellite image is retrieved from Google Earth, a 3-channel RGB image with a size of 256 × 256 pixels, and its ground resolution varies from 0.03 m/pixel to 1 m/pixel. The two instances of each image pair were usually acquired in different seasons, and the amount of variation was occasionally increased by manually adding objects. The change truth value only considers the corresponding change in object appearance or disappearance between two instances, and does not ignore the all seasons change.
Figure 5 shows a sample plot of part of the CDD.
Table 1 shows the introduction of experimental datasets.
SZTAKI Dataset: an optical aerial image pair provided by the Hungarian Society of Geodesy and Cartography [
64], including three sub-datasets of SZADA, TISZADOB, and ARCHIVE. However, due to the poor quality of remote-sensing images in the ARCHIVE dataset, most studies only use the first two sub-datasets. The SZADA dataset contains 7 pairs of remote sensing images. The images were collected from 2000 to 2005. The size of the images is 952 × 640 pixels, and the resolution is 1.5 m/pixel. The TISZADOB dataset contains five pairs of remote sensing images collected from 2000 to 2007. The image size and resolution are consistent with those in SZADA. Therefore, in this experiment, a total of 12 pairs of remote sensing image pairs with a size of 952 × 640 pixels and a pixel resolution of 1.5 m/pixel were used in the SZTAKI dataset. The set of 12 image pairs was subjected to cropping with overlap to generate image blocks of size 256 × 256 pixels. Subsequently, a random partitioning scheme was employed, allocating the blocks into training, validation, and testing sets in an 8:1:1 ratio. To augment the dataset, operations such as flipping and rotating were applied to the image pairs.
Figure 6 shows a sample plot of part of the SZTAKI dataset (wherein SZADA/2 denotes the second sample in the SZADA dataset and TISZADOB/3 denotes the third sample in the TISZADOB dataset).
Table 1 shows the introduction of experimental datasets.
2.5.2. Experimental Environment and Evaluation Criteria
To ensure the scientific nature of the experiment, all experiments in this paper are carried out on the same desktop PC. The hardware environment of the experiment is: Intel(R) Core(TM)
[email protected] (manufacturer: Intel; manufacturing address: Santa Clara, CA, USA), the graphics card is NVIDIA GeForce RTX 3070 8G (manufacturer: NVIDIA Corporation; manufacturing address: Santa Clara, CA, USA); the software environment is Windows 10 64-bit, Pycharm 2021.2.3 (Community Edition), Anaconda3, Python3.6, TensorFlow2.0.
The Adam algorithm [
65] was used as an optimizer for the network to update the network parameters to speed up the convergence of the network. Parameter updating methods include stochastic gradient descent (SGD) algorithm [
66], RMSprop algorithm [
67], Adam algorithm, etc. The Adam algorithm combines the advantages of AdaGrad [
68] and RMSProp in the first- and second-order moments of the gradient the correction factor is introduced to make it more stable when adjusting the learning rate, which can effectively solve the problems caused by excessive gradient noise or sparsity. Careful consideration was given to the selection of the learning rate, as setting it too small may result in slow parameter updates and the risk of being trapped in a local minimum, while setting it too large may cause the loss value to become unstable and prevent the model from converging. Therefore, an appropriate learning rate of 0.0001 was chosen. Additionally, the number of training epochs was set to 50 for the CDD and 200 for the SZTAKI dataset, and the batch size for image processing was set to 4 for both datasets.
To quantitatively evaluate the performance of neural network models in remote sensing change detection tasks, a set of standard and widely accepted evaluation metrics are employed. These include accuracy, recall, precision, and F1 score. The accuracy measures the proportion of correctly classified pixels, recall measures the proportion of detected changed pixels, precision measures the proportion of correctly detected changed pixels, and F1-score is the harmonic mean of precision and recall. A higher score in these metrics indicates the better performance of the model.
where:
TP,
TN,
FP, and
FN represent the number of true positives, true negatives, false positives, and false negatives, respectively.
2.5.3. Comparison Methods
In this study, the proposed Siam-FAUnet change detection method is compared with several state-of-the-art methods, which include CDNet, FC-Siam-conc, FC-Siam-diff, IFN, DASNet, STANet, and SNUNet.
CDNet [
69] is a fully convolutional neural network based on the idea of stacking, shrinking, and expanding.
FC-Siam-conc [
70] is proposed based on the FCN model. The Siamese network is used to realize the dual input feature extraction of two-temporal remote sensing images. Finally, the features are spliced to obtain change information.
FC-Siam-diff is proposed in [
70] and also uses the FCN as the benchmark network. The difference with the previous one is that the semantic restoration stage uses a difference method to connect the feature maps, and finally obtains the changing image we need.
IFN [
44] is an encoder–decoder structure, and unlike U-Net, this model employs dense skip connections and implements implicit deep supervision in the decoder structure.
DASNet [
71] is proposed based on a dual attention mechanism to locate the changed areas and obtain more discriminant feature representations.
STANet [
72] is proposed based on the Siamese FCN to extract the bi-temporal image feature maps and design a change detection self-attention module to update the feature maps by exploiting spatial-temporal dependencies among individual pixels at different positions and times.
SNUNet [
73] is proposed based on the combination of the Siamese network and NestedUNet uses dense skip connections between the encoder and decoder and between the decoder and decoder.
4. Discussion
In this work, we propose a remote sensing image change detection method based on the Siam-FAUnet network, using an improved VGG16 to extract image features in the encoding part, ASPP to extract multi-scale contextual features, and FAM to fuse the feature information extracted in the encoder in the decoding part. The publicly available CDD and SZTAKI datasets are used for training and testing, and the evaluation metrics of the Siam-FAUnet model are improved compared to both the baseline model and other state-of-the-art deep learning change detection methods. Meanwhile, through visualization analysis, it is verified that the Siam-FAUnet model can effectively detect small change targets, overcome challenges associated with unclear image boundaries and the greater pseudo changes of the change regions.
Comparison and findings. To evaluate the effectiveness of ASPP and FAM, ablation experiments are conducted on the CDD and SZTAKI datasets, respectively. The results demonstrate that the introduction of either ASPP or FAM enhances the change detection performance of the model. Furthermore, integrating both ASPP and FAM improves the network’s overall evaluation metrics, albeit deepening the feature extraction dimension increases the classification difficulty. This improvement can be attributed to the increased perceptual field of the model achieved by ASPP, allowing the model to effectively make use of global image information, thus avoiding information loss and obtaining multi-scale contextual information. Meanwhile, FAM fuses low-level features from the encoder to the decoder, facilitating the learning of semantic flow between the features of varying resolutions and transferring semantic information from coarse features to high-resolution refinement features.
To verify the superiority of the Siam-FAUnet model in solving the problems of blurred detection boundary, the missed detection of small targets and more pseudo changes in change detection, we conduct comparative experiments using seven state-of-the-art deep learning change detection methods on the CDD and SZTAKI datasets, with the same experimental environment and parameter settings as our proposed method. Our Siam-FAUnet approach achieves superior evaluation metrics and detection results on both datasets. Compared with CDNet [
69], FC-Siam-conc [
70], FC-Siam-diff [
70], IFN [
44], DASNet [
71], STANet [
72], and SNUNet [
73] models, Siam-FAUnet has the highest recall R, reaching 93.56%, 73.79%, and 95.63% on the CDD and SZTAKI datasets, respectively. The larger the recall, the fewer the missed predictions are, and the lower the miss rate is, indicating that the Siam-FAUnet model is robust in reducing pseudo changes. As shown in
Figure 9 and
Figure 10, six pairs of images of changes in cars, roads, and buildings in the test set were selected to visualize the experimental results.
Figure 9 shows the visualization results of verifying the proposed model to solve the blurred detection boundary problem. From
Figure 9, we can see that the change regions detected by FC-Siam-conc and FC-Siam-diff have blurred boundaries, discontinuities, noise in the detected regions, and the incomplete detection of the change regions; CDNet reduces the noise, but still has blurred boundaries, internal holes, and the incomplete detection of the detected regions. The boundaries in the change regions detected by IFN and DASNet are smooth, but there are still problems such as internal holes and discontinuous region boundaries. The boundaries in the change regions detected by STANet are jagged and the detected change regions are not complete. The change regions detected by SNUNet have blurred boundaries, are not smooth, and the detection is incomplete. Siam-FAUnet has the best visualization effect, the change boundary is smooth and clear without jaggedness, there is less noise, the road change is more continuous and complete with less pseudo-change, the change area is continuous and complete, and its visual result is closer to the ground truth.
Figure 10 shows the visualization results for validating the proposed model to solve the small target miss detection problem, and
Table 6 shows the comparison of the number of small targets detected by each model. From
Figure 10 and
Table 6, we can obtain that the FC-Siam-conc and FC-Siam-diff have poor performance for small target detection, and cannot be detected for most of the small targets with blurred boundaries and a lot of noise. Although CDNet, DASNet, STANet, and SNUNet reduce the noise in the detected images, they can only detect a small portion of small targets with blurred boundaries. IFN can detect most of the small targets, but there are still problems of blurred unsmooth detection boundaries and noise. In contrast, the algorithm proposed in this paper has some superiority in the detection of small target changes compared with other methods, and can detect changes such as cars and small buildings, and the change boundary is smooth and clear with the best visualization effect.
Based on the experimental results of Siam-FAUnet and seven other state-of-the-art methods conducted on the CDD and SZTAKI datasets, it has been observed that the detection performance and evaluation metrics of each method exhibit superior performance on the CDD compared to the SZTAKI dataset. This discrepancy can be attributed to the CDD’s inherent characteristics, including a wider range of diverse feature types and a larger sample size, which provide a richer and more representative dataset for training and evaluation purposes. The seasonal differences between the before and after images of the CDD are large, and the differences before and after the changes can be clearly seen by considering the corresponding changes in the appearance or disappearance of objects between the two instances and not ignoring all seasonal changes. The SZTAKI dataset exhibits fragmented and less dense change areas, characterized by larger feature scales. The changes primarily manifest in new built-up regions, building operations, the planting of a large group of trees, fresh plough-land, and ground work before building over. However, the dataset fails to capture certain small-scale changes, necessitating the acquisition of deeper features for accurate detection. Furthermore, it is noteworthy that the TISZADOB sub-dataset of SZTAKI outperforms the SZADA sub-dataset in terms of detection efficacy and evaluation metrics. This discrepancy can be attributed to the higher prevalence of small target changes in the SZADA sub-dataset, posing greater challenges for change detection algorithms.