Next Article in Journal
Forest Age Mapping Using Landsat Time-Series Stacks Data Based on Forest Disturbance and Empirical Relationships between Age and Height
Next Article in Special Issue
MFTSC: A Semantically Constrained Method for Urban Building Height Estimation Using Multiple Source Images
Previous Article in Journal
Analysis of East Asia Wind Vectors Using Space–Time Cross-Covariance Models
Previous Article in Special Issue
Improvements in Forest Segmentation Accuracy Using a New Deep Learning Architecture and Data Augmentation Technique
 
 
Article
Peer-Review Record

Cloud Removal in Remote Sensing Using Sequential-Based Diffusion Models

Remote Sens. 2023, 15(11), 2861; https://doi.org/10.3390/rs15112861
by Xiaohu Zhao and Kebin Jia *
Reviewer 1:
Reviewer 2: Anonymous
Remote Sens. 2023, 15(11), 2861; https://doi.org/10.3390/rs15112861
Submission received: 11 April 2023 / Revised: 24 May 2023 / Accepted: 26 May 2023 / Published: 31 May 2023

Round 1

Reviewer 1 Report

General comment

The article proposes a novel method: Sequential-based Diffusion Models (SeqDMs) for cloud removal. SeqDMs is based on two components (1) the multi-modal diffusion model (MmDMs) and (2) the sequential-based training and inference strategy (SeqTIS). Generally, the article is written coherently. The following are recommendations to the authors

 

Recommendations

 

1. in section 3.1, lines 144-147, the authors make an assumption that “We assume the main 144 modality X (i.e., optical satellite data) is susceptible to clouds or haze, while auxiliary 145modalities A (e.g., SAR or other modalities more robust to the corruption of clouds) are 146 free from these influences.” The authors should provide an explicit basis that justifies their assumptions, perhaps an additional sentence to qualify them.

 

2. In lines 156-157 (below eq 13 and above eq 14), the sentence needs to be clarified on what the authors mean.

 

3. In sections 3.2.2 and 3.2.3, the authors refer to algorithms 1 and 2. However, the algorithms appear (are presented) after the reference section. The authors should ensure the algorithms are presented on the same pages or the next page.

 

4 In section 4.4, page 10, the authors should contextualize and elaborate on the architecture they use for MmDMs instead of just citing the source, i.e., [9].

 

5. In the discussion section, in support of their claim, “SeqDMs outperform several other state-of-the-art multi-modal The.” authors should provide citations in line 401 or refer to the results in the tables provided in the article

 

6. The authors should include a section on the Conclusion. Also, they should move lines 412 to 414 to the conclusion section, i.e., “In the future, we will pursue the direction of studying the characteristics of each 412bands of the multi-spectral optical satellite imagery to extract more helpful information for 413 further reducing the semantic gap between the reconstructions and target images.

Comments for author File: Comments.pdf

Author Response

Thank you for your detailed comments that help us improve this paper. Below is the detailed response to all the comments.

Point 1: in section 3.1, lines 144-147, the authors make an assumption that “We assume the main 144 modality X (i.e., optical satellite data) is susceptible to clouds or haze, while auxiliary 145modalities A (e.g., SAR or other modalities more robust to the corruption of clouds) are 146 free from these influences.” The authors should provide an explicit basis that justifies their assumptions, perhaps an additional sentence to qualify them.

Response 1: Rather than making an assumption, it is more precise to provide a description about the relations between main and auxiliary modality here. Thus, we rewritten this sentence as follow: “Since optical satellite data is susceptible to haze or clouds and SAR or other modalities are more robust against these influences [6,19], we consider optical satellite data as the main modality X and SAR or other modalities as auxiliary modalities A in this paper.” and provided a strong basis in the front half of the revised sentence by citing relevant references.

Point 2: In lines 156-157 (below eq 13 and above eq 14), the sentence needs to be clarified on what the authors mean.

Response 2: We added the phrase “described in the Equation (5) of DDPMs” in this sentence to make it clearer what we mean.

Point 3: In sections 3.2.2 and 3.2.3, the authors refer to algorithms 1 and 2. However, the algorithms appear (are presented) after the reference section. The authors should ensure the algorithms are presented on the same pages or the next page.

Response 3: We moved these algorithms to the same page of the referred positions.

Point 4: In section 4.4, page 10, the authors should contextualize and elaborate on the architecture they use for MmDMs instead of just citing the source, i.e., [9].

Response 4: We described the architecture used for MmDMs instead of just citing references.

Point 5: In the discussion section, in support of their claim, “SeqDMs outperform several other state-of-the-art multi-modal The.” authors should provide citations in line 401 or refer to the results in the tables provided in the article

Response 5: We referred our claim to the results in the tables provided in the article.

Point 6: The authors should include a section on the Conclusion. Also, they should move lines 412 to 414 to the conclusion section, i.e., “In the future, we will pursue the direction of studying the characteristics of each 412bands of the multi-spectral optical satellite imagery to extract more helpful information for 413 further reducing the semantic gap between the reconstructions and target images.

Response 6: We included a section on the Conclusion and moved lines 412 to 414 in original paper to this section.

Reviewer 2 Report

This paper addresses the common problem of cloud removal in spaceborne imagery. The novelty of this works consists on the use of a sequential-based diffusion model combined with the use of multi-modal data and. To develop and test the model the authors make use of the SEN12MS-CR-TS dataset that uses SAR and multispectral data from Sentinel 1 and Sentinel 2 respectively.

I think the method is sufficiently described in the paper (with algorithms being included) and the results show advantages of this method over other known method like STGAN or Seq2point. All in all the paper is in good shape for publication and the results are definitely interesting. I have only minor suggestions for the authors:

 - I miss in the introduction references to other, admitedly older, cloud removal methods prior to the use of DNNs in order to give a more complete overview. E.g. there is a somehow old review from 2015 "H. Shen, X. Li, Q. Cheng, C. Zeng, G. Yang, H. Li, and L. Zhang, “Missing  information reconstruction of remote sensing data: A technical review,” IEEE Geosci. Remote Sens. Mag., vol. 3, no. 3, pp. 61–85, Sep. 2015."

- I miss in the paper some information about the training and infering cost comparison with the other networks used. It is probably not a critical point to evaluate the models, but it would be interesting to compare

- The method was trained with a sequence of length L=3. Later we see that in many cases infering with longer sequences improves the results. Did you try to train with longer sequences? In any case, it is definitely an advantage the possibility of changing the length of the inference without re-training, but it makes me also wonder if the comparison with retrainied Seq2point is fair since SeqDMs was only trained with L=3. 

- In general results look very good with the proposed method, but I find SAM results a bit disapointing in Table 1. That could be a killer for certain applications. Have you consider network alternatives to try to keep better the spectral information in the inferred data?

- In the Discussion section, you plan to study the characteristics of the different bands. Actually, for this work you have upscaled to 10 m different bands in Sentinel 2 that are 20 or even 60 m. Have you noticed differences in performance in the present study that could be related to the upscaling pre-processing?

- In addition to the obvious effect of clouds not allowing to see the ground, there is also the effect that the cloud shadows have on the images. Are cloud shadows handled in any special way in this work (there is no mention in the paper). I mean, are there inside or outside the cloud masks? In some cases, inferring also the shadowed area could be a benefit. 

 

 

 

 

Author Response

Thank you for your detailed comments that help us improve this paper. Below is the detailed response to all the comments.

Point 1: I miss in the introduction references to other, admitedly older, cloud removal methods prior to the use of DNNs in order to give a more complete overview. E.g. there is a somehow old review from 2015 "H. Shen, X. Li, Q. Cheng, C. Zeng, G. Yang, H. Li, and L. Zhang, “Missing  information reconstruction of remote sensing data: A technical review,” IEEE Geosci. Remote Sens. Mag., vol. 3, no. 3, pp. 61–85, Sep. 2015."

Response 1: We added the cloud removal methods prior to the DNNs for a complete overview.

Point 2: I miss in the paper some information about the training and infering cost comparison with the other networks used. It is probably not a critical point to evaluate the models, but it would be interesting to compare

Response 2: We also strongly agree that providing some quantitative information about the training and inferring time cost would be more comprehensive for this paper. However, each model/experiment requires several GPU days to train or infer. Meanwhile, we trained and inferred our models on our lab GPU devices shared with other colleagues, resulting in an unstable time statistic for model training or inferring. Whereas, it is still evident that our method has an advantage in reducing the training time cost by avoiding model re-training (at least several GPU days) when handling different length data. Once the hardware conditions are met, we would like to evaluate the time cost quantitatively in the future.

Point 3: The method was trained with a sequence of length L=3. Later we see that in many cases infering with longer sequences improves the results. Did you try to train with longer sequences? In any case, it is definitely an advantage the possibility of changing the length of the inference without re-training, but it makes me also wonder if the comparison with retrainied Seq2point is fair since SeqDMs was only trained with L=3. 

Response 3: Due to the high training cost, we did not train our proposed method with longer sequences. In the term of whether the comparison is fair, we believe that SeqDMs trained with longer sequences (e.g., L=4 or 5) would perform better than those only trained with L=3 and do not significantly increasing training time. Since training with longer sequences would provide more information to the model leading to a better performance and do not increasing the iterations during the training period. Thus, we think the results of SeqDMs only trained with L=3 are sufficient to support our claims in this paper and can highlight the advantages of our method even more.

Point 4: In general results look very good with the proposed method, but I find SAM results a bit disapointing in Table 1. That could be a killer for certain applications. Have you consider network alternatives to try to keep better the spectral information in the inferred data?

Response 4: In our discussion, we have planned to study the characteristics of the bands. Since each band has a unique cloud penetrability due to different wavelength, the information corruption extent of each band might be different. Thus, we would further study a specialized method, like channel-wise attention mechanism, to more efficiently extract useful information based on the transparency of the cloud in our next work rather than simply separating cloudy and cloudless area in the current work.

Point 5: In the Discussion section, you plan to study the characteristics of the different bands. Actually, for this work you have upscaled to 10 m different bands in Sentinel 2 that are 20 or even 60 m. Have you noticed differences in performance in the present study that could be related to the upscaling pre-processing?

Response 5: It is an interesting focus on the data upscaling pre-processing. In fact, we directly used the public dataset SEN12MS-CR-TS curated by Ebel et al. and the upscaling processing is a step contained in their dataset work. According to their statement, the upscaling processing is designed to translate raw data into a format that neural network for cloud removal can handle. In the future, we would like to further study the differences in performance caused by different upscaling process.

Point 6: In addition to the obvious effect of clouds not allowing to see the ground, there is also the effect that the cloud shadows have on the images. Are cloud shadows handled in any special way in this work (there is no mention in the paper). I mean, are there inside or outside the cloud masks? In some cases, inferring also the shadowed area could be a benefit. 

Response 6: In this work, we only used the cloud detector s2cloudless provided by SEN12MS-CR-TS itself and do not have a special way like cloud shadow detector to extract the shadowed area out. During our experiments, we found that s2cloudless can imprecisely include partial shadows in some samples and the detected shadowed area would be removed to some extent by our method as well (as shown in column 2 of Fig 3 in paper). In the future, we would find a dedicated method to detect both clouds and shadows more accurately, leading to better inferring results in both the cloudy area and shadowed area.

Back to TopTop