Review Reports - Style Transfer from Sentinel-1 to Sentinel-2 for Fluvial Scenes with Multi-Modal and Multi-Temporal Image Fusion

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript proposes a novel approach to addressing the cloud-related limitations of Sentinel-2 imagery by employing a style transfer model to convert Sentinel-1 SAR data into Sentinel-2-like composites and by integrating long-term (eight-year) Band 8 observations to generate cloud-free synthetic Sentinel-2 images.

The references cited in the paper appear insufficiently rigorous and lack depth. The authors should incorporate more authoritative and up-to-date literature to strengthen the scientific foundation of the study.

The criteria and procedures used for selecting training and validation samples are not clearly described. A detailed explanation of the sampling process is necessary to ensure reproducibility and to justify the representativeness of the chosen data.

From a general perspective, the volume of training data employed in the study appears limited. Given the global claims made in the paper, the dataset is inadequate to demonstrate the robustness and generalizability of the proposed model convincingly.

The author mentions the use of a GAN model but simultaneously points out its inefficiencies. This contradictory stance requires clarification. If the inefficiencies are significant, the rationale for adopting GANs in this context should be explained more explicitly, and possible alternatives should be discussed.

Author Response

Review 1.

Comment 1
The references cited in the paper appear insufficiently rigorous and lack depth. The authors should incorporate more authoritative and up-to-date literature to strengthen the scientific foundation of the study.

Response 1

This comment is vague and very broad. However, in line with the comments of reviewer 3, we have improved our literature coverage notably in the area of SAR image transfer and Sentinel-1 performance in the task of water classification.

Comment 2
The criteria and procedures used for selecting training and validation samples are not clearly described. A detailed explanation of the sampling process is necessary to ensure reproducibility and to justify the representativeness of the chosen data.

Reponse 2

This comment is once again not very specific. Our initial manuscript had a description of sample site selection for both training and validation that had ~1000 words of text and 5 figures. We had also produced a GitHub site that contains a notebook showing the exact code we use to download data for Sentinel 1 and 2 and giving the model weights so that readers can reproduce our synthetic image inference process. Furthermore, we have made our 267 validation tiles available to the journal so that in effect, readers can use the code in the notebook and reproduce our quality results. In terms of descriptions, the original manuscript, we had detailed the full process going from a random selection from a pool of 2 million samples to a final set of 2678 tiles which involved a random trial-and error process to find quasi-synchronous acquisitions from both Sentinel 1 and 2. We had presented a detailed count of the number of pixels in each semantic class for both training and validation. We had also produced 2 figures detailing both the spatial (figure 1) and temporal coverage (figure 2) of our data. We had added a third figure showing the spatial coverage of our external benchmark dataset. Figures 5 and 6 then show examples of training tiles for both small water bodies (figure 5) and large water bodies (figure 6).

We have made 2 minor edits to improve the clarity of our data description. First, we have clarified which map projections are used and, Second, we have edited the captions of figures 5 and 6 to make it clearer that these are examples drawn from our training data. Aside from these 2 minor edits, we do not understand which aspects of data selection were left unclear, and the reviewer has not been specific about which aspect remains unclear. We have provided code, model weights and validation data so that our results are in fact reproducible. We also note that the other 2 reviewers had no issues with this aspect of the manuscript. Without any specific issues, we do not see how we can make appropriate edits to our manuscript based on this comment.

Comment 3

Response 3

The reviewer does not suggest what they might think is a suitable data volume. Nor do they propose published literature that shows what might be a suitable amount of data. To our knowledge, we in fact have the largest training dataset in the literature. Notably, Brown et al (2022), published in Nature Scientific Data, have a total of 6.9 x109 training pixels. Nyberg et al, (2023), published in Nature Communications, have a total of 5.7 x106 training pixels. Both these papers are working at global scales. In our case, for rivers, lakes and gravels, we have a total 1.2 x1010, to this we add a total of 1.2 x 1012 background pixels. All this was described in the original manuscript on P14. In terms of validation, our total dataset is 1.4 x109 pixels which is larger than Nyberg et al and comparable to Brown et al. To this we have added the published data of Wieland et al (about 6.5 x109) to conduct a validation against an independent benchmark. Our data set is therefore comparable or larger than similar datasets published in top-tier journals. Then if we look at figure 1, present in the original manuscript, we can see that we cover every biome with the possible exception of the Sahara desert which is obviously not relevant to this work and perhaps a slightly weaker coverage of Polynesia. We cannot find a published example of similar work with better spatial coverage. Figure 2 shows that we have data for all years from 2017 to 2024 (inclusively) and for every month of the year. Therefore we do not see any justification for the reviewer's comment. They have not provided any evidence to support this claim and seem to reject, or are perhaps not aware of, a body of existing literature published in top-tier journals that sets a benchmark which we have slightly exceeded. Without evidence, we do not accept this comment and have made no changes.

Comment 4
The author mentions the use of a GAN model but simultaneously points out its inefficiencies. This contradictory stance requires clarification. If the inefficiencies are significant, the rationale for adopting GANs in this context should be explained more explicitly, and possible alternatives should be discussed.

Response 4

The reviewer seems to think that we have used a GAN. We would suggest a closer reading of our paper. Specifically, the original manuscript had the following text at the start of the model Architecture section:

“Style transfers and/or translations from between sensor formats are typically done with with Generative Adversarial Networks (GANs) [27–30] or Fully Convolutional Networks (eg Unets) [31,32]. GANs require the training of multiple networks and are significantly more processing intensive. For example, Li et al [33] report that GANs can be 1 or 2 orders more computationally demanding. Given our objective of working at global scales (objective 3), we therefore opted for a Unet type of architecture.”

This text, present in the original manuscript, clearly states that we have not used a GAN and “opted for a Unet type of architecture”. We have mentioned GANs because many style transfer methods do use GANs. We therefore expected many readers to assume we used a GAN and thus would appreciate a justification of our avoidance of this architecture. We therefore pointed out the high computational cost in the context of our global scale application and made a clear description of our chosen Unet architecture in both text and figure 4. This comment seems to originate from an incomplete reading of our manuscript and we have not actioned it.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper entitled "Style Transfer from Sentinel-1 to Sentinel-2 for Fluvial Scenes with Multi-Modal and Multi-Temporal Image Fusion" present the application of a deep learning technique developed by the same author applied to Copernicus product through GEE. In this work, the author demonstrated how effective the fusion between radar and optical images can enhance the analysis in terms of accuracy of the results. In particular, this fusion helped to take advantage of the main "good features" from the two types of sensors. Cloud coverage is a limit that can be accurately bypassed.

The paper is full and well presented in all sections so here there are only minor revisions, more related to form than substance.

rows 8-12 I think should be delated as being part of the original template
remove all the "eg" close to the citations
row 78: probabily this sentence misses a dot after citation 9
row 81: please specify the acronyms the first time they appears in the text (IoU)
row 83: maybe the performances drop from Sentinel-2 to Sentinel-1 and not the opposite
I do not think there is the need to create two subparagraph 1.1 and 1.2, I would put all in the introduction
row 110: space is missing "i.e. translation"
rows 134-136: actually using polarization VV is less sensitive the "ripples" effects, so I would make this sentence more precise
row 151: I would specify which datum and cartographic projection you have used
row 193: there is a repetition of the word "composed"
row 196: a dot is missing after "of the method"
row 218: acquire "also" probably, and not "both"
row 275: Sentinel-2 does not have VH band, I think there is a typo
row 288: figure 5 and not figures 5
row 426: I cannot find correspondence between point 2) description and figure 9 on the left, please check it.

Author Response

Review 2

The paper is full and well presented in all sections so here there are only minor revisions, more related to form than substance.

We thank the reviewer for their support. Below we detail how we have edited the paper in response to each comment.

rows 8-12 I think should be delated as being part of the original template

We have tightened the layout of the presentation page and we will monitor the draft of the final version.

remove all the "eg" close to the citations

In British academia, the use of the latin eg as a prefix in citations is common and we find is useful to distinguish cases where the list of references is to be seen as a set of examples rather than an exhaustive list.

row 78: probably this sentence misses a dot after citation 9

Corrected as requested.

row 81: please specify the acronyms the first time they appears in the text (IoU)

We have expanded the acronym for IoU in the abstract where it first occurs.

row 83: maybe the performances drop from Sentinel-2 to Sentinel-1 and not the opposite

Corrected. We thank the reviewer for noticing this error.

I do not think there is the need to create two subparagraph 1.1 and 1.2, I would put all in the introduction

row 110: space is missing "i.e. translation"

Corrected

rows 134-136: actually using polarization VV is less sensitive the "ripples" effects, so I would make this sentence more precise

row 151: I would specify which datum and cartographic projection you have used

We have clarified our usage of projections: “In order to facilitate the process of data augmentation described below, the data is downloaded as a single 6 channel stack, projected to a UTM (Universal Transverse Mercator) coordinate reference system (CRS) using the WGS84 datum and with a spatial resolution of 10 meters. We manage the multiple CRS in the data by using the EPSG (European Petroleum Survey Group) numbers and use EPSG numbers ranging from 32601 to 32660 for UTM zones 1 to 60 in the northern hemisphere and 32701 to 32760 for UTM zones in the southern hemisphere”

row 193: there is a repetition of the word "composed"

Corrected

row 196: a dot is missing after "of the method"

Corrected

row 218: acquire "also" probably, and not "both"

Corrected by deleting “both”.

row 275: Sentinel-2 does not have VH band, I think there is a typo

Corrected: “Training samples are composed of an input image with Sentinel-1 VV, VH and cloud-free Sentinel-2 Band 8 mosaics composited as described above.”

row 288: figure 5 and not figures 5

Corrected2)

row 426: I cannot find correspondence between point 2) description and figure 9 on the left, please check it.

We understand the reviewers' confusion. We had used an insufficient number of significant figures to allow readers to match the text and the figure We have therefore edited this whole paragraph and now use 2 decimal places with simple truncation. We have also emphasised the difference between True and Predicted pixels to help readers make the connection between figure 9 and the relevant text.

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

This manuscript presents a method that fuses Sentinel-1 SAR imagery with long-term cloud-free composites of Sentinel-2 Band 8 to generate synthetic Sentinel-2 imagery. The approach addresses the critical challenge of cloud contamination in optical data and achieves classification performance nearly equivalent to native Sentinel-2 imagery.

Limitations and Suggestions

Novelty and differentiation: While style transfer and SAR-to-optical translation have been explored in prior studies, the main novelty here lies in the inclusion of the Band 8 fusion. The manuscript would benefit from a clearer discussion on how this contribution advances the state of the art beyond existing approaches.

Comparison with Sentinel-1 performance: Although the paper briefly reports that Sentinel-1 alone yields much lower performance (IoU ~0.70), the results and discussion largely omit a more explicit comparison with direct Sentinel-1–based water classification. Including this comparison more systematically—ideally with references to existing SAR-based water mapping studies—would better highlight the added value of the proposed style transfer approach.

Class-specific performance: Although rivers and lakes are classified with high accuracy, the gravel bar class shows a relatively high error rate (~25%). A deeper discussion of the practical implications of this limitation for geomorphic studies would strengthen the paper.

Computational considerations: The authors emphasize global scalability and efficiency, but more details on computational requirements (e.g., runtime, hardware needs, data handling) would enhance the practical significance of the work.

Broader applicability: The study focuses primarily on fluvial monitoring. A brief discussion on whether the synthetic Sentinel-2 imagery could be applied to other domains (e.g., agriculture, land cover mapping) would increase the paper’s potential impact.

The writing style of the manuscript is somewhat conversational, and I recommend revising it to conform to a formal academic style.

Overall Evaluation
This is a solid and well-executed study that makes an important contribution to remote sensing applications for global river and water monitoring. By enabling semantic classification under any cloud condition, the proposed method provides significant value for hydrology and geomorphology research. With further clarification of its novelty relative to prior work, a more systematic comparison against Sentinel-1–based performance, expanded discussion on class-specific limitations, and elaboration on computational and broader application aspects, the manuscript would be further strengthened and its scientific contribution more clearly highlighted.

Author Response

Review 3

We thank the reviewer for this supportive review. We have made many edits that have significantly improved the paper.

Limitations and Suggestions

We have added a paragraph (see below when discussing errors of other methods) that makes a more explicit comparison with the results of other work with SAR. This also serves to highlight the novelty of our results and it evidences that the addition of S2 band 8 data makes a significant improvement.

The reviewer is correct to point out that we have done limited comparisons to an existing body of work reporting water classification performance based on Sentinel 1 data. Our initial rationale for this was that the vast majority of papers on the topic are from local or regional studies. In our experience, all types of machine learning semantic classification models perform better at smaller scales because the intrinsic variability of the data is lesser. This is why we chose the work of Wieland et al and their S1S2water dataset as a benchmark. To our knowledge this is the only work in current literature that reports classification success for both Sentinel 1 and Sentinel 2 at truly global scales. We felt that those results were the most suitable comparator. Our approach was therefore to use our synthetic Sentinel 2 imagery and compare to their benchmarks for both Sentinel 1 and 2. We felt that the finding that our synthetic sentinel 2 imagery delivers classification IoUs of 0.93 which almost match the performance of 0.94 from native Sentinel 2 data served as a powerful illustration that our method can provide a high-quality alternative to cloud-free Sentinel 2 data.

However, we have now expanded our discussion to include a broader range of comparator studies. We have a new paragraph in the discussion which actually shows that our results are better by a good margin. This paragraph therefore also highlights the novelty of our work. We thank the reviewer for this comment, it has improved our paper. New [paragraph:

“We find that our F1 result of 0.96 along with an IoU of 0.93 compare favorably to other reports of water classification performance using Sentinel 1 data. Fakhri and Gknatsios [37], report a best F1 score of 0.84 in a flood detection study in New South Wales, Australia. Zhang et al [38], report an IoU of 0.83 in a flood detetion study of Hainan island, China. Ghosh et al [39], develop global scale flood detection models using Sentinel 1 data with a range of deep architectures, when validated against a flood event in Florence, they report a best IoU of 0.75. Finally, Zhang et al [40], use a range of models and SAR datasets to detect flood inundation extents. They report a best IoU of 0.812 and a best F12 score of 0.86. However we should note that these studies are all focused on flood mapping and often include urban areas. Comparisons should be treated with caution. Nevertheless, our results compare favorably to these findings which suggests that the fusion of band 8 data from Sentinel 2 to VV and VH data from Sentinel 1 makes a significant contribution to semantic classification performance and differentiates this work from others.”

We have expanded our discussion of errors associated to the exposed sediment class in the model. The initial manuscript read:

“We see in figure 9 (left) that 25% of gravel bars predicted from native Sentinel-2 images are confused with the background class if we use synthetic imagery. Having higher errors for this class is not new. The gravel bar class is the hardest to predict given it’s spectral similarity with bare ground and, in some cases, senescent vegetation [3]. The wide diversity of shapes associated to sediment deposits further hinders semantic classification.”

This has been changed and expanded. It now reads:

“Having higher errors for this class is not new. Carbonneau and Bizzi [3] faced similar issues. From an ontological perspective, the gravel class is arguably the most weakly defined. Rivers and lakes share similar spectral characteristics. Specifically, the use of an infrared band in the input data allows both human observers and models to distinguish water as patches that are dark in infrared, red and green, whilst vegetation is bright in the infrared and darker in reds and greens. Furthermore, rivers and lakes tend to have more distinct shapes. Fluvial sediment patches do not benefit from such clear distinctions. Fallow fields and bare ground patches that are not connected to a water body can have similar spectral characteristics. This makes the identification of gravel patches with purely spectral criteria highly error prone. In terms of shape, gravel bars can have a variety of shapes. Whilst some shapes such as point bars are effectively learned by the model, visible bars that result from changes in water level can have a very diverse shapes that become hard to capture with convolutional or even ViT (Vision Transformer) approaches. Additionally, the contrast of the water-connected edge of a gravel bar will vary with both the slope and the turbidity of the water. If a gravel bar has a low slope, then the shallow submerged portion of the bar will be visible through the water. This lowers the contrast of the dry gravels compared to the wet gravels and makes delineation more difficult. This is compounded by the levels of turbidity which can make the water more or less clear and change the Secci depth thus making the shallow submerged portion of a bar more or less visible. These factors combine to make the gravel class the most difficult class to predict.”

We have added a paragraph at the end of our discussion to show the requirements of global scale processing on our chosen hardware setup:

“Thinking ahead to global scale deployment. We find that our chosen Unet architecture has delivered on a fast inference with a low computational load. We have deployed the algorithm on a modest workstation with an older XEON ES-260 CPU running at 2.1 Ghz and equipped with an NVIDIA GEFORCE 1080Ti GPU with 11 Gb of RAM. Our first step is to acquire the full global dataset for cloud-free Sentinel 2 band 8 images. Given that the cloud free composites are created from the full archive of Sentinel 2 data, the processing requirement is high and the download requires approximately 3 weeks. This set of single band images in 8-bit format and with a spatial resolution of 20 meters requires 70 Gb of storage at high compression. Fortunately, this is a process that only needs to be done once. Then we have tested inference speed on data from the Po basin in northern Italy. We find that our system can infer synthetic imagery at a rate of approximately 100 km2 per second. This would therefore lead to a total processing time of 17 days for the 148 million km2 land surface area of the globe on our relatively modest older-generation GPU. However, in studies specifically focused on river corridors, we find that using existing datasets to establish river corridors can cut the area to process by as much as 90%. This means that, in theory, global inference of synthetic imagery can be achieved in as little as ~2 days. However, in practice, we have found that the bottleneck of this pipeline is the Sentinel-1 VV and VH download speeds from Google Earth Engine. These can be variable and depending on the current traffic on Google servers, individual users may get 2 or 3 parallel workers to execute jobs. Global scale downloads of the needed Sentinel-1 data requires an estimated 7-10 days. This is satisfactory because it aligns well with our Sentinel 2 classification workflow which requires ~10 days to process global scale data [2].”

We understand the reviewer's viewpoint here. But one limitation of our method is that it is specific to water environments. We have chosen a state-of-the-art loss function, perceptual loss, to achieve our goal of getting synthetic images that are tailored to an existing model and thus augment our existing data processing pipeline. Unfortunately, by definition, training with a perceptual loss function produces specific results that are tailored to the embeddings produced by the model used in the loss function. The high quality of the outputs comes at the cost of generalisation of the outputs to other applications.

We have now made this clearer to readers. We have added the following text in the discussion:

“Another key limitation that readers should note pertains to additional uses of the synthetic imagery generated with our method. Our choice of an inverse weighted RMSE loss and of a perceptual loss component calculated from an existing trained model designed for semantic classification of rivers and lakes will, by definition, create synthetic imagery that is tailored to our model and, at best, water classification. An obvious weakness of perceptual loss functions is that they produce models trained to mimic the performance of the model used to calculate the perceived loss. This means that the synthetic imagery presented here is less well suited to applications not focused on water features. For example, preliminary assessments of the similarity of NDVI values calculated from native Sentinel 2 imagery and matching synthetic imagery shows significant errors as large as 0.2, which are quite significant within the range of expected NDVI values. Readers interested in applying the methods shown here to non-water facing problems are encouraged to re-train the style transfer model with a suitable loss function.”

The writing style of the manuscript is somewhat conversational, and I recommend revising it to conform to a formal academic style.

We have made several small edits to tighten up the text and improve the flow of reading.

Overall Evaluation
This is a solid and well-executed study that makes an important contribution to remote sensing applications for global river and water monitoring. By enabling semantic classification under any cloud condition, the proposed method provides significant value for hydrology and geomorphology research. With further clarification of its novelty relative to prior work, a more systematic comparison against Sentinel-1–based performance, expanded discussion on class-specific limitations, and elaboration on computational and broader application aspects, the manuscript would be further strengthened and its scientific contribution more clearly highlighted.

We thank the reviewer for this supportive statement.