Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Temporal-Reliable Method for Change Detection in High-Resolution Bi-Temporal Remote Sensing Images

Remote Sens. 2022, 14(13), 3100; https://doi.org/10.3390/rs14133100

by Fei Pan¹

, Zebin Wu^1,*, Xiuping Jia²

, Qian Liu¹

, Yang Xu¹ and Zhihui Wei¹

Reviewer 1:

Chockalingam Jeganathan

Reviewer 2:

Roberto Corizzo

Reviewer 3:

Yakoub Bazi

Remote Sens. 2022, 14(13), 3100; https://doi.org/10.3390/rs14133100

Submission received: 9 May 2022 / Revised: 18 June 2022 / Accepted: 19 June 2022 / Published: 28 June 2022

Round 1

Reviewer 1 Report

The PAPER TITLED "A Temporal-Reliable Method for Change Detection in High Resolution Bi-temporal Remote Sensing Images" attempts to use deep learning method for change detection purposes.

Review comments are as below:

1. Mostly the English is fine, and at some places it is odd and sentences are not complete. Authors may do a thorough grammar check while submitting the modified version. Since there is no line number in the pages, i am not able to specify it particularly. For example, last para of 2nd page has lot of grammar issues.

2. In the first line of page 2, authors have classified the change detection approaches into different categories. They need to include much more references here for each category. Some of the classical papers to recent papers addressing change detection issues of different spectral/spatial/radiometric resolutions may be provided for the benefit of readers.

3. It is confusing at time what do authors mean by fusion. In remote sensing domain FUSION refers to bringing out the complementary information from 2 different data sources into a single data source. In this regard, author need to use different terminology if they mean differently. The process of fusion in integration of Downscaled and Upscaled images need to be revealed for the benefit for RS readers. Authors are from Computer Science background, and assume that all RS readers may understand CNN. Pl. provide.

4. In page 3, in the first para, they use a term "....novel objection function....". Authors need to change "objection" into "objective".

5. In page 3, authors are talking about major contributions of their paper. Generally such contributions must be moved to discussion section. In the introduction they should only highlight the problems, issues and Gaps of classical methods and CNN methods. Also the sentence "The regularization term aims to enforce the bi-temporal images before and after exchanging the sequence to share similar features to realize the temporal reliable." is not clear. Some writeup is given in section 4.2. However, if they can add little more as supplementary figure or table then it will add value.

6. In page 6, under section 4.1 they say that input has 6 channels. But in the figure 3, authors show one colour (RGB- 3bands) and 1 B&W image (Panchromatic - 1 band). In that case there are 4 channels (bands) in T1 and 4 channels in T2. It is not clear, HOW these 8 channels are converted in STAGE 1 input?. What do they mean by extracted features???? What is Feature Map? Please elaborate. In RS, feature map has different meaning!.

7. Generally, bi-linear interpolation is used in similar resolution - for resampling, but how did author use this for upscaling/downscaling? Please provide diagramatic explanation (and keep this figure as Supplementary figure, if they wish).

8. Author say that "As the spatial resolution decrerases, the number of feature channels increases". They need to provide one FIGURE explaining this statement for the benefit of the readers. Also they have to show figuratively the meaning of " The number of exchanging information across different resolution in stage 2, stage 3, stage 4 is set to 1, 4, 3."

9.In the BCE, authors mentioned about ground truth i.e., y. It is not explained in the figure 3 nor in section 4.1, how authors are introducing the ground truth. Please add few lines in section 4.1 to avoid a gap.

10.Generally, in Entropy measure we talk about frequency and probability measures. Here it is not clear what authors are doing with y and y-hat????

11. In equation 1, it is not clear what do they mean by ground truth?. Do they mean Samples used for training the module?

12. In the equation 6, author may reveal what is the value range of t1(i,j) and t2(i,j) in the possibility map?? Do they mean 0 to 1. Please confirm in the writeup.

13. In Equation 4, authors mentioned P represents change map. But in equation they have to say the values of P instead of map!!. What are the values of P. What is the difference between P and t1(i,j)?. Why do they use different notation for the prediction map??? Please make it uniform.

14. IN Section 5.1, authors mention about dataset from Google Earth. But they did no mention how did they get that data? Google Earth data is not a real REMOTE SENSING DATA. They are just RGB data made for just visual purposes only (not for any scientific analysis of radiometry). This is the major weakness of the study. Authors must use REAL REMOTE SENSING DATA and they should download the data from ESA (sentinels) and NASA (Landsat) sites. If authors are confident that these data are correct then they should reveal the WEBSITE, and proper steps involved in downloading the data. Did they do any atmospheric correction? Are there any previous studies which used these dataset?

15. what is the LEVIR-CD dataset? Please explain in Detail. Why did they use this dataset?

16. In Figure 5, one can see huge amount of changes in the vegetation. But authors have taken only human objects. If this is the case then they should mention it clearly in their title about these MAN-MADE CHANGES. (Instead of generalised change detection).

17. It is observed that authors have shown the results from the area where they have given ground truth. Please show the CHANGE DETECTION OUTPUT over a bigger area where authors have not given any ground truth!. Please take a completely different RS DATA and use their training model, and show the output.

18. The scientific questions related to the approach is not revealed in the results. Authors have simply shown the different evaluation metrics without actually revealing the DIFFERENT OPTIMISATION MEASURES and problems faced by them.

Considering the above, the paper requires major revision.

Author Response

Dear Editors and Reviewers:

Thank you very much for your comments concerning our manuscript entitled “A Temporal-Reliable Method for Change Detection in High Resolution Bi-temporal Remote Sensing Images” (ID: remotesensing-1741154). Those comments were all extremely valuable and very helpful for revising and improving our paper, as well as for providing important insights and significance to our research. We have addressed the comments carefully and have made corrections to our manuscript, which we hope will meet your approval.

Please find below our description of the changes made to the manuscript according to comments received (these changes have been highlighted in blue color in the revised version of the manuscript in order to facilitate their identification with regards to the previous version). We are indebted to the Editors and Reviewers for their outstanding comments and suggestions, which greatly helped us to improve the technical quality and presentation of our manuscript.

--------------------------------------------------------------------------------------------------------------

Reviewer #1:

Reviewer’s comment:

Mostly the English is fine, and at some places it is odd and sentences are not complete. Authors may do a thorough grammar check while submitting the modified version. Since there is no line number in the pages, i am not able to specify it particularly. For example, last para of 2nd page has lot of grammar issues.

Response :

Thanks for the reviewer's helpful advice. We have performed a comprehensive grammar check on the full text. At the same time, we also reviewed and corrected the grammer issues with the help of additional tools.

Reviewer’s comment:

In the first line of page 2, authors have classified the change detection approaches into different categories. They need to include much more references here for each category. Some of the classical papers to recent papers addressing change detection issues of different spectral/spatial/radiometric resolutions may be provided for the benefit of readers.

Response :

We appreciate the reviewer for pointing this out. According to the suggestion, we have added the related references:

“According to the different strategies of change detection, the change detection algorithms can be divided into the following categories: image algebra methods [7-10], classification methods [11-13], image transformation methods [14-16], deep learning based methods [17-29].”

[12]Wu, C.; Du, B.; Cui, X.; Zhang, L. A post-classification change detection method based on iterative slow feature analysis and Bayesian soft fusion. Remote Sensing of Environment 2017, 199, 241–255.

[13]Wan, L.; Xiang, Y.; You, H. A post-classification comparison method for SAR and optical images change detection. IEEE Geoscience and Remote Sensing Letters 2019, 16, 1026–10.

[15]Liu, Z.; Li, G.; Mercier, G.; He, Y.; Pan, Q. Change detection in heterogenous remote sensing images via homogeneous pixel transformation. IEEE Transactions on Image Processing 2017, 27, 1822–1834.

[16]Zhang, P.; Gong, M.; Su, L.; Liu, J.; Li, Z. Change detection based on deep feature representation and mapping transformation for multi-spatial-resolution remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing 2016, 116, 24–4.

Reviewer’s comment:

It is confusing at time what do authors mean by fusion. In remote sensing domain FUSION refers to bringing out the complementary information from 2 different data sources into a single data source. In this regard, author need to use different terminology if they mean differently. The process of fusion in integration of Downscaled and Upscaled images need to be revealed for the benefit for RS readers. Authors are from Computer Science background, and assume that all RS readers may understand CNN. Pl. provide.

Response:

Thanks for the reviewer's outstanding comment and advice. In deep learning, fusion mainly includes two ways: point-wise addition and concatenate. In particular, fusion refers to the process of turning two different feature tensors into one feature tensor, which is different from the definition in the remote sensing. So we change the “fusion” to “tensor fusion” in our paper. The process of fusion is briefly introduced in Figure 3. As depicated in Figure 3 (a), the number of channels of the two features with lower spatial resolution will be modified by the convolution operation firstly. Then the bilinear interpolation algorithm is used to upsample and keep the spatial resolution consistent. Finally, the three features are fused via element-wise summation. In Figure 3 (b), the process of upsampling is the same as that in Figure 3 (a) and we will introduce the process of downsampling. Convolution with kernel size of 3 x 3 and stride of 2 can be used to downsample to halve the spatial resolution, the detailed process can be found in response 7. At the same time, the convolution operation ensures that the number of channels are consistent with the feature of medium resolution. In Figure 3 (c), the process of downsampling is the same as in Figure 3 (b). We have added the explanation in the revised manuscript:

Figure 3. Tensor fusion module exchange information across different resolutions. (b) illustrates the fusion of high-resolution. (b) illustrates the fusion of medium resolution. (c) illustrates the fusion of low-resolution. The down-sampling is implemented by 3 x 3 convolution which stride is 2. The up sampling is implemented by bilinear interpolation.

“

the feature representations can exchange across the different resolutions by tensor fusion module as depicted in Figure 3. The process of tensor fusion from low resolution to high resolution is illustrated in Figure 3 (a). The convolution kernel of size 1 x 1 and stride 1 is applied firstly on low-resolution representations to ensure that the number of channels is as same as the high-resolution representation. Then, the bilinear interpolation algorithm is adopted for up-sampling. The process of tensor fusion from high resolution to low resolution is illustrated in Figure 3 (c). The convolution kernel of size 3 x 3 and stride 2 is used once or twice to reduce the spatial resolution. After obtaining the same spatial resolution representations, the features are fused via element-wise summation. The up-sampling and down-sampling operations in Figure 3 (b) are the same as those in Figure 3 (a) and Figure 3 (c). More details can refer to [48]

”

Reviewer’s comment:

In page 3, in the first para, they use a term "....novel objection function....". Authors need to change "objection" into "objective".

Response:

Many thanks for the reviewer’s helpful comments. We have corrected in the revised manuscript:

“we propose a novel objective function to learn a temporal-reliable feature and propose an effective network to solve the lack of small changes for change detection in high-resolution bi-temporal remote sensing images.”

Reviewer’s comment:

In page 3, authors are talking about major contributions of their paper. Generally such contributions must be moved to discussion section. In the introduction they should only highlight the problems, issues and Gaps of classical methods and CNN methods. Also the sentence "The regularization term aims to enforce the bi-temporal images before and after exchanging the sequence to share similar features to realize the temporal reliable." is not clear. Some writeup is given in section 4.2. However, if they can add little more as supplementary figure or table then it will add value.

Response:

Thanks for the reviewer's helpful advice. As suggested, we have moved the controbutions to discussion section. It may be that we are in a different research direction than yours. In all the articles we have read, the contributions are summarized in the end of introduction. We have not modified this in the revised manuscript and hope you can understand.

In response to the reviewer’s second question, we have drawn another figure to for readers to understand better and explained further while introducing the network architecture in Figure 3.

“The regular term aims to enforce the bi-temporal images before and after exchanging the sequence to share similar features to realize temporal reliability. As depicted in Figure 1, input1 and input2 are the results before and after exchanging the sequence of bi-temporal images, and output1 and output2 are the features extracted by our model. The goal of the regular term is to enable the same backbone network to capture the same difference information when the input is input1 and input2. The extracted features are called temporal-reliable features in our paper.”

Figure 2. The principle of regular term in objective function.The features of the red block are temporal-reliable and the features of the yellow block need to be corrected by the regular term in the objective function.

“The extracted features are output1 and output2 correspondingly. If the regular term is not included in the objective function, the extracted features are different in the case where the input is input1 and input2. As depicted in Figure 4, the red and yellow blocks represent different features. The regular term is aims to force the model to capture the same features regardless of whether the input is input1 or input2. The two same blocks of the red color represent the same features in Figure 4.”

Reviewer’s comment:

In page 6, under section 4.1 they say that input has 6 channels. But in the figure 3, authors show one colour (RGB- 3bands) and 1 B&W image (Panchromatic - 1 band). In that case there are 4 channels (bands) in T1 and 4 channels in T2. It is not clear, HOW these 8 channels are converted in STAGE 1 input?. What do they mean by extracted features???? What is Feature Map? Please elaborate. In RS, feature map has different meaning!.

Response:

Thanks very much for the outstanding comments and suggestions. The input images contains two RGB (3 bands) images, so the concatenated input is six channels. The B&W image (Panchromatic - 1 band) is ground truth and is used to train the network model.

I'm very sorry that making you are confused about extracted features and feature map due to our imprecise wording. Feature Map is the result of input image convolution through neural network. For example, if it is a grayscale image in the input layer, there is only one feature map. If it is a color image, there are generally 3 feature maps (red, green and blue). There will be many convolution kernels between layers. The each feature map in previous layer is convolved with convolutional kernel and a feature map of the next layer will be generated. Extracted features represent multiple feature maps of the same layer. We have checked the paper and corrected unsuited wording in the revised manuscript:

Reviewer’s comment:

Generally, bi-linear interpolation is used in similar resolution - for resampling, but how did author use this for upscaling/downscaling? Please provide diagramatic explanation (and keep this figure as Supplementary figure, if they wish).

Response:

Thanks for the reviewer's outstanding comments. In our paper, upsampling is achieved by bilinear interpolation and downsampling is achieved by convolution operation.

The bilinear interpolation algorithm can achieve any scaling of the image. This algorithm has corresponding implementations in many frameworks, such as in pytorch, which can be called by the F.interpolate(input, size, mode=’bilinear’).

The principle of the bilinear interpolation is shown in the figure above, Q12, Q22, Q11, and Q21 are known, but the point to be interpolated is point P. First, the two points R1 and R2 are interpolated in the x-axis direction and then the point P is interpolated according to R1 and R2.

The following is a set of diagrams to represent the process of downsampling, the size of convolution kernel is 3, stride is 2 and padding is 1:

(1) Due to the padding in kernel is equal to 1, we need to add a circle of 0 around the image.

(2) Kernel and image operation.

Point-wise multiplication and final sum to average.

(3) Kernel moves 2 strides to the right.

(4) Kernel moves 2 strides to the down and starts from left.

(5) Kernel moves 2 strides to the right.

(6) Getting the result.

It is obvious that the resolution of the ‘image’ is twice that of the ‘result’.

We believe that the process of upsampling and downsampling is not very relevant to the topic of the paper, so we do not intend to show the above process in the paper.

Reviewer’s comment:

Author say that "As the spatial resolution decrerases, the number of feature channels increases". They need to provide one FIGURE explaining this statement for the benefit of the readers. Also they have to show figuratively the meaning of " The number of exchanging information across different resolution in stage 2, stage 3, stage 4 is set to 1, 4, 3."

Response:

Thanks for the reviewer's outstanding comments. The process of “As the spatial resolution decrerases, the number of feature channels increases” has been shown in Figure 4.

Figure 4. Framework of the proposed TRCD. Input1 and input2 are combinations of bi-temporal images at different sequences. Features of different spatial resolutions will perform multiple information interactions in the same stage. The classification loss is calculated between output1 and change map. The regularization loss is calculated between output1 and output2.

Take stage 4 in Figure 4 as an example, “As the spatial resolution decrerases” means that the width (w) and height (H) of the feature are descreasing from the first row to the last row. “the number of feature channels increases” means that the channels (C) are increasing from the first row to the last row.As can be seen from top to bottom in the figure, the size of the block is getting smaller and the thickness of the block is getting bigger.

The process of " The number of exchanging information across different resolution in stage 2, stage 3, stage 4 is set to 1, 4, 3" is also shown in Figure 4. Take stage 4 in Figure 4 as an example, the number of red boxes is the number of exchanging information across different resolution. Therefore, there are 3 information interactions in stage 4. Similarly, it is 1 and 4 times respectively in stage 2 and stage 3.

Reviewer’s comment:

In the BCE, authors mentioned about ground truth i.e., y. It is not explained in the figure 3 nor in section 4.1, how authors are introducing the ground truth. Please add few lines in section 4.1 to avoid a gap.

Response:

Thanks for the reviewer's valuable advice. Our research method belongs to supervised learning and the training set is labeled. In ground truth, the value of 1 means the pixel has changed and the value of 0 means the pixel has not changed. We have added the explanation in revised manuscript as follows:

“The is the value of truth located at of change map in Figure 3. The equals to 1 means changed pixel pair at and equals to 0 means unchanged pixel pair at .”

Reviewer’s comment:

Generally, in Entropy measure we talk about frequency and probability measures. Here it is not clear what authors are doing with y and y-hat????

Response:

Thanks for the reviewer's valuable advice. Binary cross entropy is a commonly used loss function in binary classification problems, which is used to judge the quality of the prediction results of a binary classification model. In equation 1, is the classification loss with y equal to 1 and is the classification loss with y equal to 0. For the case where the label y is 1, if the predicted value y-hat approaches 1, then the value of the loss function should approach 0. Conversely, if the predicted value y-hat is close to 0, then the value of the loss function should be very large, which is very consistent with the properties of the log function. We have added the explanation in revised manuscript:

“

where stands for the ground truth, stands for the probability map. The value of is 0 or 1 and the value range of is 0-1. The is the classification loss with equal to 1 and is the classification loss with equal to 0.

”

Reviewer’s comment:

In equation 1, it is not clear what do they mean by ground truth?. Do they mean Samples used for training the module?

Response:

We greatly appreciate the reviewer for pointing this out. Training samples include bi-temporal images and change maps. I think you are right, ground truth is the change map which is used to train our model.

Reviewer’s comment:

In the equation 6, author may reveal what is the value range of t1(i,j) and t2(i,j) in the possibility map?? Do they mean 0 to 1. Please confirm in the writeup.

Response:

Thanks for the reviewer's helpful advice. The value range of t1(i,j) and t2(i,j) are both 0-1. We have confirmed in revised manuscript:

“As depicted in formula (6), the regular term uses the $L2$ norm and aims to enforce the feature under the different sequence to be similar. The smaller the regular term, the more similar the extracted features, thereby the extracted features are temporal reliable. The value range of and are both 0-1.”

Reviewer’s comment:

In Equation 4, authors mentioned P represents change map. But in equation they have to say the values of P instead of map!!. What are the values of P. What is the difference between P and t1(i,j)?. Why do they use different notation for the prediction map??? Please make it uniform.

Response:

Thanks for the reviewer's outstanding advice. The value of P is equal to t1(i,j). As you suggested, we revised the equation 6 and make it uniform in revised manuscript:

“

The value of D is between 0-1 and we aim to maximize the D.”

Reviewer’s comment:

IN Section 5.1, authors mention about dataset from Google Earth. But they did no mention how did they get that data? Google Earth data is not a real REMOTE SENSING DATA. They are just RGB data made for just visual purposes only (not for any scientific analysis of radiometry). This is the major weakness of the study. Authors must use REAL REMOTE SENSING DATA and they should download the data from ESA (sentinels) and NASA (Landsat) sites. If authors are confident that these data are correct then they should reveal the WEBSITE, and proper steps involved in downloading the data. Did they do any atmospheric correction? Are there any previous studies which used these dataset?

Response :

Thanks for the reviewer's outstanding advice. Our research direction is change detection for optical remote sensing images. The CDD dataset released in [49] and is available for public access at:https://drive.google.com/file/d/1GX656JqqOyBi_Ef0w65kDGVto-nHrNs9. The CDD dataset is the popular public dataset for change detection in hi-resolution bi-temporal remote sensing image and has been used in many previous studies [17,19,21,22,24,25,26], so we are confident that these data are correct. We have revealed the website in revised manuscript:

“The released CDD dataset contains 10000 training sets, 3000 validation sets and 3000 testing sets. The CDD dataset can be downloaded at: https://drive.google.com/file/d/1GX656JqqOyBi_Ef0w65kDGVto-nHrNs9.”

Reviewer’s comment:

what is the LEVIR-CD dataset? Please explain in Detail. Why did they use this dataset?

Response:

Thanks for the reviewer's outstanding advice. The dataset covers various types of buildings, such as villa residences, tall apartments, small garages and large warehouses. The most changes are due to the construction of new buildings. Therefore, this dataset can better verify the robustness of our proposed method, and this dataset is also used in many previous studies [22, 23, 25, 26]. We have explained in revised manuscript:

“The changes in LEVIR-CD dataset are a variety of buildings. As depicted in right of Figure 6, changes in buildings include villa residences, large warehouses, etc. It is necessary emphasizing that masses of changes are formed by the addition of increasing buildings. The dataset is randomly split into three parts: 445 pairs of images for training, 64 pairs of images for validating and 128 pairs of images for testing. The LEVIR-CD is avaiable for public access at: https://www.google.com/permissions/geoguidelines”

Reviewer’s comment:

In Figure 5, one can see huge amount of changes in the vegetation. But authors have taken only human objects. If this is the case then they should mention it clearly in their title about these MAN-MADE CHANGES. (Instead of generalised change detection).

Response:

Thanks for the reviewer's valuable advice. The vegetation is different because of seasonal changes and changes caused by season differences are called pseudo-changes in our paper. We have added the explanation in revised manuscript:

“

Figure 6. The left images are selected from CDD dataset and changes are man-made. The right images are selected from LEVIR dataset and most changes are former by the addition of new buildings. (a) Prechange, (b) PostChange, (c) Ground Truth.

As depicted in left of Figure 6, the dataset contains a large number of pseudo-changes caused by season differences and small region changes such as the cars.

”

Reviewer’s comment:

It is observed that authors have shown the results from the area where they have given ground truth. Please show the CHANGE DETECTION OUTPUT over a bigger area where authors have not given any ground truth!. Please take a completely different RS DATA and use their training model, and show the output.

Response:

Thanks for the reviewer's outstanding comments. Our research direction is change detection for optical remote sensing images. As noted in comment 14, the datasets we used are optical remote sensing images, which are different from the REAL REMOTE SENSING DATA. Considering our research direction, we do not think it is necessary to retrain the model on a completely different RS DATA. We hope the reviews can understand. At the same time, the size of the image relased after cropping is 256x256 and we cannot download the image before cropping. It is sorry that we have no way to display a larger area. In Figure 7, (a) and (b) represent the bi-temporal images and (c) represents the ground truth. The images shown are selected from the testset. In order to visualize the prediction results of the model more clearly, we re-visualized the results. We have optimized in revised manuscript:

“

White means true positive, black means true negative, green means false positive and red is false negative.

Figure 7. Qualitative Comparison in terms of the detected changes from image t1 to image t2 on the CDD dataset. TP is shown in white, TN is shown in black, FP is shown in green and FN is shown in red. (a) Prechange, (b) Postchange, (c) Ground Truth, (d) FC-siam-conc, (e) UNet++MSOF, (f) IFN, g) DASNet, (h) our, (i) Attention Map

Figure 8. Qualitative comparison including the detected changes from image t1 to image t2 and changes from image t2 to image t1 on the CDD dataset. The odd and even rows represent the detection results from image t1 to image t2 and results from image t2 to t1 correspondly. TP is shown in white, TN is shown in black, FP is shown in green and FN is shown in red. (a) Prechange, (b) Postchange, (c) Ground Truth, (d) FC-siam-diff, (e) UNet++MSOF, (f) IFN, (g) DCFF-Net, (h) our, (i) Attention Map

Figure 9. Qualitative comparison in terms of the detected changes from image t1 to image t2 on the LEVIR dataset. TP is shown in white, TN is shown in black, FP is shown in green and FN is shown in red. (a) Prechange, (b) Postchange, (c) Ground Truth, (d) FC-siam-diff, (e) UNet++MSOF, (f) IFN, (g) DASNet, (h) STANet, (i) our, (j) Attention Map

Figure 10. Qualitative Comparison including the detected changes from image t1 to image t2 and changes from image t2 to image t1 on the LEVIR dataset. The odd and even rows represent the detection results from image t1 to image t2 and results from image t2 to t1 correspondly. TP is shown in white, TN is shown in black, FP is shown in green and FN is shown in red. (a) Prechange, (b) Postchange, (c) Ground Truth, (d) FC-siam-diff, (e) UNet++MSOF, (f) IFN, (g) DCFF-Net, (h) our, (i) Attention Map

”

Reviewer’s comment:

The scientific questions related to the approach is not revealed in the results. Authors have simply shown the different evaluation metrics without actually revealing the DIFFERENT OPTIMISATION MEASURES and problems faced by them.

Response:

Thanks for the reviewer's outstanding advice. As suggested, we have added the related analysis in revised manuscript:

“

Our algorithm is significantly advanced than other SOTA algorithms for change detection of small targets and edges. The explanation for the satisfying detection effect of object edge and the small target is that our designed framework can extract finer features. The information interaction between features with different spatial resolutions enhances the discriminative ability of features.

The highest Precision, Recall, F1, OA are achieved by our method and the performance is much better than other methods. The benchmark methods compared are all fusion-based. And the temporal relationship between them is not considered during image or feature fusion, resulting in an ideal change map only obtained under one of the temporal sequences. After adding a regular term to the objective function in our method, temporal-reliable features can be obtained. Thus, the bi-temporal images of different sequences can obtain similar feature outputs in our model. For systematic comparisons, the different prediction change maps and attention map are displayed in Figure 8. A pair image is tested twice before and after exchanging the sequence of bi-temporal. The other fusion-based methods are extremely sensitive to the sequence of bi-temporal, while our method is robust. As depicted in Figure 8, our proposed method can obtain similar change maps for the raw bi-temporal images with different sequences. Besides, the attention map indicates the confidence of the CD results is very high.

At this time, other fusion-based methods almost fail while our method realizes strong robustness. It is worth emphasizing that the main contribution for obtaining almost the same change map for bi-temporal images of different temporal sequences is the regular term in the objective function so that the model will learn the ability to extract temporal-reliable features during the iterative update process.

”

Last but not least, we gratefully thank the reviewer again for his/her outstanding comments and suggestions, which greatly helped us to improve the technical quality and presentation of our manuscript.

--------------------------------------------------------------------------------------------------------------

Author Response File: Author Response.docx

Reviewer 2 Report

The authors propose a method to learn temporal features for change detection, which leverages a custom objective function.

The method aims to determine whether the pixel is changed or unchanged in bi-temporal images.

The change detection network processes a bi-temporal images and generates a binary change map which has the same spatial resolution as the input.

In the encoder-decoder architecture, networks firstly learn low-resolution representations and recover high-resolution representations in the following steps.

A regularization term in the objective function is devised to enforce the extracted features before and after exchanging sequences of bi-temporal images to be similar to each other.

Experiments are performed on two public datasets and showcase the potential of the method.

The paper is interesting and well-structured. Major concerns are the following:

- Related work: The authors seem to neglect recent change points detection methods for high-dimensional time series data which should be incorporated (see DOI: 10.1007/s10489-021-02532-x and 10.1109/BigData52589.2021.9671962)

- Method: I think that it is really important to have a clear description of the method and emphasize its novelty.

From my understanding, Figure 1 shows an existing method that inspired the proposed method, which is actually shown in Figure 3.

The caption in Figure 3 should briefly describe the conceptual workflow of the method.

- Experiments: 1) The ingestion pipeline is not fully clear. Changes are initially highlighted at a pixel level.

Are pixel-level predictions combined to label a full image as change/not-change?

2) If a change is predicted, does the change indicate a significant different for Image 2 with respect to Image 1?

3) Measures of deviation such as standard deviation should be reported if available, in order to assess model's prediction.

Minor issues:

- Figure 5: There is a typo: "Thruth."

- "Regular term" : is it regularization?

I encourage that the authors prepare a revised version of the manuscript.

Author Response

Dear Editors and Reviewers:

Thank you very much for you r comments concerning our manuscript entitled “A Temporal-Reliable Method for Change Detection in High Resolution Bi-temporal Remote Sensing Images” (ID: remotesensing-1741154). Those comments were all extremely valuable and very helpful for revising and improving our paper, as well as for providing important insights and significance to our research. We have addressed the comments carefully and have made corrections to our manuscript, which we hope will meet your approval.

-------------------------------------------------------------------------------------------------------

Reviewer #2:

Reviewer’s comment:

The authors propose a method to learn temporal features for change detection, which leverages a custom objective function. The method aims to determine whether the pixel is changed or unchanged in bi-temporal images.The change detection network processes a bi-temporal images and generates a binary change map which has the same spatial resolution as the input. In the encoder-decoder architecture, networks firstly learn low-resolution representations and recover high-resolution representations in the following steps. A regularization term in the objective function is devised to enforce the extracted features before and after exchanging sequences of bi-temporal images to be similar to each other. Experiments are performed on two public datasets and showcase the potential of the method. The paper is interesting and well-structured.

Response:

We gratefully thank the reviewer for the positive feedback and for his/her accurate summary of the main contributions of our work. We have carefully revised the paper following all the suggestions kindly provided by the reviewer.

Reviewer’s comment:

The authors seem to neglect recent change points detection methods for high-dimensional time series data which should be incorporated (see DOI: 10.1007/s10489-021-02532-x and 10.1109/BigData52589.2021.9671962).

Response:

We appreciate the reviewer for pointing this out. According to the suggestion, we have added the references in revised manuscript:

“1) the performance is extremely sensitive to the sequence of bi-temporal images and the robustness is extremely poor in terms of the different sequence of bi-temporal images. In change point detection tasks, whether it is the multivariate time series [45] or the high-dimensional time series[46], the time series usually contains complex correlations. The sequences of bi-temporal images are equally critical in the CD tasks, and there is a certain correlation.”

[45] Du H, Duan Z. Finder: A novel approach of change point detection for multivariate time series[J]. Applied Intelligence, 2022, 52(3): 2496-2509.

[46] Faber K, Corizzo R, Sniezynski B, et al. WATCH: Wasserstein Change Point Detection for High-Dimensional Time Series Data[C]//2021 IEEE International Conference on Big Data (Big Data). IEEE, 2021: 4450-4459.

Reviewer’s comment:

I think that it is really important to have a clear description of the method and emphasize its novelty. From my understanding, Figure 1 shows an existing method that inspired the proposed method, which is actually shown in Figure 3. From my understanding, Figure 1 shows an existing method that inspired the proposed method, which is actually shown in Figure 3. The caption in Figure 3 should briefly describe the conceptual workflow of the method.

Response 3:

Thanks very much for the outstanding comments and suggestions. Yes, figure 2 shows the architecture of the model in [42]. The structure of HRNet is introduced at first which aims to give readers a better understanding of backbone network before reading our proposed method. The backbone network of our proposed method is improved from figure 2. Specifically, the interaction of information between different resolutions is more frequent. In HRNet, the information interaction between features of different resolutions is only carried out in different stages. In our improved backbone network, features of different resolutions will perform multiple information interactions in the same stage. We have added the description of the method and emphasize its novelty in revised manuscript:

“Our framework receives the raw image pair and produces the binary change map, which is an end-to-end architecture. In HRNet, the information interaction between features of different resolutions is only carried out in different stages. In our improved backbone network, features of different resolutions will perform multiple information interactions in the same stage. It is worth emphasizing that our improved network architecture benefits the extraction of more semantically richer and more spatially precise information. The network architecture is illustrated in Figure 4.

”

Figure 4. Framework of the proposed TRCD. Input1 and input2 are combinations of bi-temporal images at different sequences. Features of different spatial resolutions will perform multiple information interactions in the same stage. The classification loss is calculated between output1 and change map. The regularization loss is calculated between output1 and output2.

Reviewer’s comment:

The ingestion pipeline is not fully clear. Changes are initially highlighted at a pixel level. Are pixel-level predictions combined to label a full image as change/not-change?

Response:

Thanks for the reviewer's outstanding and valuable comments. If we highlight the changed pixels in the bi-temporal images, it will overwrite the previous pixels, resulting in the loss of important information in the bi-temporal images. In order to visualize the prediction results of the model more clearly, we re-visualized the results. White means true positive, black means true negative, green means false positive and red is false negative. We have optimized in revised manuscript:

“

”

Reviewer’s comment:

If a change is predicted, does the change indicate a significant different for Image 2 with respect to Image 1?

Response:

Thanks for the reviewer's outstanding comments. First of all, my answer to your question is yes. The CDD dataset contains a large number of pseudo-changes caused by season differences, but the real changes are all man-made. The LEVIR dataset contains changes in vaious buildings. We have added the description in revised manuscript:

Reviewer’s comment:

Measures of deviation such as standard deviation should be reported if available, in order to assess model's prediction.

Response:

Thanks for the reviewer's outstanding comment. In the hyperspectral image classification tasks, the training sets and the testing sets are randomly divided according to a certain ratio. Due to the randomness of data division, multiple experiments are required to ensure the fairness of the experiment. Different from the random division of hyperspectral classification data set, the division of training set, validation set, and test set used in our paper has been fixed when the data sets are proposed. The released CDD dataset contains 10000 training sets, 3000 validation sets and 3000 testing sets. The released LECIR-CD is split into three parts: 70% samples for training, 10% for validation and 20% for testing. We only need to adopt the same data division to conduct the experiments. Besides, we have set the random number seed which can fix the training result of the model on the GPU every time. Last but not least, the pretrained weights in [42] will be used as the initialization weights of our model. Therefore, our experimental results do not have randomness and are reproducible. We have added the explanation in revised manuscript:

“

For CDD dataset, we obtain 10000 pairs of images for training with a size of 256 x 256, 3000 pairs of images for validating and 3000 pairs of images for testing with a size of 256 x 256. For LEVIR-CD dataset, we obtain 7120 pairs of images for training with a size of 256x 256, 1024 pairs of images for validating and 2048 pairs of images for testing with a size of 256x 256. The division of data set is fixed when the data set is released for public. The pretrained weights in[48] will be used as the initialization weights of our model. Therefore, our experiments are stable and reproducible.”

Reviewer’s comment:

Figure 5: There is a typo: "Thruth", "Regular term" : is it regularization?

Response:

Many thanks for the reviewer’s helpful comments. When mentioned in the objective function, we use the regular term to represent. When calculating the loss, we use the regularization loss to represent. We have revised the expression and carefully checked throughout the manuscript.

-------------------------------------------------------------------------------------------------------

Author Response File: Author Response.docx

Reviewer 3 Report

I have a few but major comments.

The authors should compare further against more recent SOTA models and present attention maps to increase the value of the work.

Author Response

Dear Editors and Reviewers:

-------------------------------------------------------------------------------------------------------

Reviewer #3:

Reviewer’s comment:

The authors should compare further against more recent SOTA models and present attention maps to increase the value of the work.

Response:

Thanks for the reviewer's outstanding comment and advice. We have added a more recent SOTA model called DCFF-Net [22] which is proposed in November 2021 to compare. The attention maps are provided in result comparison.

“

Table Ⅱ. Evaluation results of TRCD and other advanced algorithms on CDD with 6000 pairs of images.

Method	Pre.	Rec.	F1	OA
FC-EF	58.62	30.56	40.18	89.26
FC-Sima-conc	88.58	46.27	60.78	92.96
FC-Sima-diff	89.51	56.85	69.53	94.12
UNet++MSOF	45.96	56.47	50.68	87.03
IFN	84.69	52.94	65.15	93.32
DCFF-Net	80.38	66.28	72.66	93.84
TRCD/w	94.60	88.13	91.25	97.91

TABLE Ⅳ. Evaluation results of TRCD and other advanced algorithms on LEVIR with 4096 pairs of images.

Method	Precision	Recall	F1	OA
FC-EF	48.77	41.84	45.04	95.70
FC-Sima-conc	73.74	42.22	53.70	96.93
FC-Sima-diff	80.90	44.24	57.20	97.21
UNet++MSOF	51.76	42.55	46.70	95.91
IFN	80.86	42.82	55.99	97.16
DCFF-Net	89.64	69.13	78.06	98.02
TRCD/w	88.43	81.10	84.60	98.50

”

-------------------------------------------------------------------------------------------------------

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Authors have incorporated suggested changes.

My only concern is about the method. Most of the Remote Sensing Researchers may not really know the intricacy of Deep learning. In order to make this work more citable, i request the authors to give a good explanation of fundamental processes involved in the network .

For example give some example diagram about How the Downsampling is Done, and How the Upsampling is done.

How the Tensor Fusion works? Give some more explanation with 10x10 raster data.

Figure 5 is not clear at all. Please give some workable example with some 10x10 raster pixel values, if possible. what is "relu"?.

Author Response

Dear Editors and Reviewers:

-------------------------------------------------------------------------------------------------------

Reviewer #1:

Reviewer’s comment:

For example, give some example diagram about How the Downsampling is Done, and How the Upsampling is done.

How the Tensor Fusion works? Give some more explanation with 10x10 raster data.

Figure 5 is not clear at all. Please give some workable example with some 10x10 raster pixel values, if possible. what is "relu"?

Response:

Thanks for the reviewer's helpful advice. We have added the explanation of fundamental processes of up-sampling, down-sampling and tensor fusion in the revised manuscript. In order to make our work more clearly understood by most of the remote sensing researches, we have added the relevant references [51, 52]. In [51], up-sampling algorithms are thoroughly compared. In [52], convolution arithmetic is described in detail using a number of graphs and is very helpful for understanding the processes of down-sampling. The relu is an activation function commonly used in artificial neural network. The formula is as follows:

“

The fundamental processes involved in the network can be seen in Figure 4. The up-sampling is implemented by interpolation algorithm [51] and down-sampling is achieved by Convolution 2D [52]. The tensor fusion is illustrated in Figure 4 (c). The and mean the value of tensor. When the weights are equal to 1, it means the approach of element-wise summation. Besides, the can be a learnable parameter by concatenating two feature tensors and combining a convolution with the size of 1 1.

Figure 4. The fundamental processes involved in the network. (a) The diagram of up-sampling. (b) The diagram of down-sampling. (c) The diagram of tensor fusion.

[51] Kolarik M, Burget R, Riha K. Upsampling algorithms for autoencoder segmentation neural networks: A comparison study[C]//2019 11th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT). IEEE, 2019: 1-5.

[52] Dumoulin V, Visin F. A guide to convolution arithmetic for deep learning[J]. arXiv preprint arXiv:1603.07285, 2016.

”

-------------------------------------------------------------------------------------------------------

Author Response File: Author Response.docx

Reviewer 2 Report

The authors addressed the comments provided in the first review stage. In particular, they:

- Improved the related work section

- Clarified crucial steps of the method and emphasized its novelty.

- Added details on the the data ingestion pipeline in the experiments

- Fixed typos and language issues

The quality of the paper has significantly improved.

For these reasons I recommend the acceptance of the paper in its current form.

Author Response

Dear Editors and Reviewers:

-------------------------------------------------------------------------------------------------------

Reviewer #1:

Reviewer’s comment:

The authors addressed the comments provided in the first review stage. In particular, they:

- Improved the related work section

- Clarified crucial steps of the method and emphasized its novelty.

- Added details on the the data ingestion pipeline in the experiments

- Fixed typos and language issues

The quality of the paper has significantly improved.

For these reasons I recommend the acceptance of the paper in its current form.

Response:

We gratefully thank the reviewer for the positive feedback and for his/her comments. Thanks again to the reviewers for acknowledging our work.

-------------------------------------------------------------------------------------------------------

Author Response File: Author Response.docx

Article Menu

A Temporal-Reliable Method for Change Detection in High-Resolution Bi-Temporal Remote Sensing Images

Further Information

Guidelines

MDPI Initiatives

Follow MDPI