Next Article in Journal
Multiple Ship Tracking in Remote Sensing Images Using Deep Learning
Next Article in Special Issue
Fusing Sentinel-1 and -2 to Model GEDI-Derived Vegetation Structure Characteristics in GEE for the Paraguayan Chaco
Previous Article in Journal
LIDAR Scanning as an Advanced Technology in Physical Hydraulic Modelling: The Stilling Basin Example
Previous Article in Special Issue
Carbon Stocks, Species Diversity and Their Spatial Relationships in the Yucatán Peninsula, Mexico
 
 
Article
Peer-Review Record

Land Use Land Cover Classification with U-Net: Advantages of Combining Sentinel-1 and Sentinel-2 Imagery

Remote Sens. 2021, 13(18), 3600; https://doi.org/10.3390/rs13183600
by Jonathan V. Solórzano 1, Jean François Mas 2,*, Yan Gao 2 and José Alberto Gallardo-Cruz 3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2021, 13(18), 3600; https://doi.org/10.3390/rs13183600
Submission received: 20 June 2021 / Revised: 31 August 2021 / Accepted: 2 September 2021 / Published: 9 September 2021

Round 1

Reviewer 1 Report

GENERAL COMMENTS:

The topic of the paper is timely, relevant, and appropriate for the journal.  However, a large number of wording revisions and clarifications are necessary. 

 

As per the title of the paper and the stated objectives, the emphasis is U-net and Sentinel data for a very particular classification problem.  First, exercise caution in generalizing the results of the study beyond these constraints.  Second, keep the results and discussion more tightly focused on the stated main emphasis.  The paper is rather long for a narrowly focused study.  Try to reduce the length of the paper to about 500 lines.

 

At the very beginning of Section 2.3, the authors should include a brief overview of the technical details of CNN and U-net, but in terms that general readers will understand.  Constrain the technical details regarding the methods to this section.  The authors literature review in the Introduction is about the right length, but a forward reference to Section 2.3 could be included “for additional details.” 

 

In the Introduction, the authors should include a brief section on how these methods have been used previously for similar problems, and they should briefly indicate why they were selected for this study, i.e., what particular advantages do they bring?  But, keep it brief.

 

Although the estimation method described in the Olofsson et al paper [84] is appropriate (line 271), this entire procedure is more correctly characterized as simply stratified estimation and sampling.  It has been well-known since long, long before Olofsson et al, and is described in nearly all statistical books on sampling.  As such, the authors should refer to it as stratified sampling and estimation with references to familiar textbooks such as Cochran (Sampling Techniques, 1977), but also to Olofsson et al if the authors prefer.  In particular, do NOT convey the impression that this approach was first conceived and developed by Olofsson et al.  In addition, instead of the awkward, confusing and inappropriate characterization of the estimates as “corrected” as in Olofsson, simply refer to them as “the stratified estimates” in Table 2 and elsewhere.  The stratified estimators are known to be unbiased, so there is no need for a correction.  Olofsson started with the area estimates as simply the areas of the classes from the image classification, a biased method sometimes called “pixel counting”, and then used the stratified estimators to “correct” these estimates.  Because the authors make no mention of this biased “pixel counting” approach, there is no reason to refer to any corrections.  See Giannetti et al, 2020, Remote Sensing 12, 3720, and Nguyen et al, 2020, Remote Sensing, 12, 1367 for similar applications of stratified sampling and estimation for land cover area estimation.

 

The authors must describe in greater detail how the stratified estimates were calculated (lines 289-294).  First, the stratified estimators are crucial to understanding the methods and results.  Second, not everyone understands or is familiar with Openforis.  Specifically, the authors should include the mathematical expressions for the estimators.

 

Avoid the use of subjective terms such as “best” and “better” whose criteria are not explicitly stated (lines 22, 74, 116, 173,176, and many, many other places).  Specifically, “best” or “better” with respect to what?  Similarly, what is meant by “relatively good” (line 346)?

 

The first dictionary definitions of terms such as "high", "low", "over", "under", etc, refer to height, altitude, or vertical position, whereas the first dictionary definitions of terms such as "large", "small", "greater", "less", etc, refer to size or amount. Despite widespread use to the contrary in casual conversation and the applied literature, in scientific papers use the latter terms to refer to size or amount. Technical writing consultants strongly recommend using terms with only one meaning or definition, or if a term with multiple definitions is necessary, then the use the first definition.   For resolution, use “fine” or “coarse” rather than “high” or “low.”  For example, at line 45, what is meant by “lower AGB”?  Is it AGB in valleys instead of on mountains?  Is it AGB low in the canopy?  Or, do the authors mean “less AGB.”  At line 103, what is meant by “highest deforestation rates”?  Does it refer to deforestation at high altitudes or does it mean “greatest deforestation rates”?  At line 137, what is meant by “lowest cloud cover”?  Does it mean that the clouds were close to the ground?

 

SPECIFIC COMMENTS:

Line 21: The convention for scientific papers is to use words for numbers “nine” and less and digits for number “10” and greater.  See line 190.

 

Line 28:  What are the advantages?

 

Lines 43, 262: To what does “its” refer?

 

Line 49:  Remote sensors do not provide information on land surface.  Rather, spectral sensors provide information on the intensity of light, and laser scanners provide information on distance from pulse emission to pulse interception. However, remote sensors provide information that can be used to PREDICT land surface features, albeit with error and uncertainty.

 

Line 55:  What is a “feature engineering step”?

 

Line 56:  Perhaps “among” the most used but not necessarily THE “most” used.

 

Line 60:  How is this definition of “image segmentation” different from “image classification”?  Where does the segmentation come in?

 

Lines 90, 237, 616:  In his original paper, Breiman used the term “random forests” with lower case letters and “forests” in the plural.

 

Line 159:  What is meant by “one-hot coded”?

 

Lines 166-168, 271, 471, elsewhere:   A sample consists of a set of sample units so that the term “samples” means multiple sets of sample units.  Do all these references to “samples” really each mean multiple sets of sample units or should all these references to “samples” really be references to “sample units.”

 

Line 185:  What are “skip connections”?

 

Line 186:  What is meant by “upsample”?

 

Line 286: How accurate were the visual interpretations?  What kind of checks were done?  See McRoberts et al, 2018, ISPRS Journal of Photogrammetry and Remote Sensing 142: 292–300 who show that even small interpreter errors for these kinds of classification problems can have serious effects on estimates, particular estimates of uncertainty.

 

Line 376:  Specify that the advantages are for this particular study area, these particular land cover classes, and relative to RF, but not necessarily for all possible study areas, all land cover classes, or all competing classification techniques.

 

Lines 413, 468:  The discussions in Sections 4.2.1and 4.3 are quite good.

 

Line 580: What is mean by “evenly distributed”?  Is there such a thing as an “even distribution”?  Do the authors mean a systematic distribution or some kind of random distribution?

 

Author Response

We thank the reviewer for her/his very relevant comments and suggestions. In the following paragraphs, the way each comment was attended is indicated in bold.

Reviewer 1

Open Review

(x) I would not like to sign my review report
( ) I would like to sign my review report

English language and style

( ) Extensive editing of English language and style required
( ) Moderate English changes required
(x) English language and style are fine/minor spell check required
( ) I don't feel qualified to judge about the English language and style

 

 

Yes

Can be improved

Must be improved

Not applicable

Does the introduction provide sufficient background and include all relevant references?

(x)

( )

( )

( )

Is the research design appropriate?

(x)

( )

( )

( )

Are the methods adequately described?

( )

(x)

( )

( )

Are the results clearly presented?

(x)

( )

( )

( )

Are the conclusions supported by the results?

(x)

( )

( )

( )



GENERAL COMMENTS:

The topic of the paper is timely, relevant, and appropriate for the journal.  However, a large number of wording revisions and clarifications are necessary. 

As per the title of the paper and the stated objectives, the emphasis is U-net and Sentinel data for a very particular classification problem.  First, exercise caution in generalizing the results of the study beyond these constraints.  Second, keep the results and discussion more tightly focused on the stated main emphasis.  The paper is rather long for a narrowly focused study.  Try to reduce the length of the paper to about 500 lines.

We made an effort to reduce the length of the manuscript; however, most of the parts that were removed were balanced by further additions asked by the reviewers, leaving a manuscript with almost the same length.

At the very beginning of Section 2.3, the authors should include a brief overview of the technical details of CNN and U-net, but in terms that general readers will understand.  Constrain the technical details regarding the methods to this section.  The authors literature review in the Introduction is about the right length, but a forward reference to Section 2.3 could be included “for additional details.” 

A section of the introduction was passed to the beginning of the Methods section, in order to avoid repetition and to mention the details of CNN and U-net in the Methods section. Additionally, several technical details were removed from the Methods section in order to avoid very specialized descriptions that could hinder a fluid reading. Finally, in the Introduction section a forward reference to Section 2.3 was included (Lines 63 - 74).

In the Introduction, the authors should include a brief section on how these methods have been used previously for similar problems, and they should briefly indicate why they were selected for this study, i.e., what particular advantages do they bring?  But, keep it brief.

In the Introduction section we already mention certain application for which the U-net was used in similar LULC classification problems (lines 75 - 76). Additionally, in the previous section we mention that the U-net has been frequently used due to its capability of summarizing patterns in both the spectral and spatial domain (lines 71 – 74). We consider this sentence explains the advantages of using this algorithm. Finally, in the ending part of the Introduction, we mention that we expect better results (more accurate) with the U-net, in comparison with random forests, due to this capability of summarizing patterns in the spectral and spatial domain, in contrast with only spectral (as random forests; lines 98 - 104).

Although the estimation method described in the Olofsson et al paper [84] is appropriate (line 271), this entire procedure is more correctly characterized as simply stratified estimation and sampling.  It has been well-known since long, long before Olofsson et al, and is described in nearly all statistical books on sampling.  As such, the authors should refer to it as stratified sampling and estimation with references to familiar textbooks such as Cochran (Sampling Techniques, 1977), but also to Olofsson et al if the authors prefer.  In particular, do NOT convey the impression that this approach was first conceived and developed by Olofsson et al.  In addition, instead of the awkward, confusing and inappropriate characterization of the estimates as “corrected” as in Olofsson, simply refer to them as “the stratified estimates” in Table 2 and elsewhere.  The stratified estimators are known to be unbiased, so there is no need for a correction.  Olofsson started with the area estimates as simply the areas of the classes from the image classification, a biased method sometimes called “pixel counting”, and then used the stratified estimators to “correct” these estimates.  Because the authors make no mention of this biased “pixel counting” approach, there is no reason to refer to any corrections.  See Giannetti et al, 2020, Remote Sensing 12, 3720, and Nguyen et al, 2020, Remote Sensing, 12, 1367 for similar applications of stratified sampling and estimation for land cover area estimation.

 We acknowledge that the stratified random sampling procedure was not conceived and developed by Oloffson et al., 2014. We apologize for giving this wrong impression in the manuscript. We changed the sentence to indicate that we followed a stratified random sampling design (Cochran, 1977; Card, 1982; Oloffson et al., 2014). Additionally, we agree with the reviewer that a better name for the “corrected estimates” is “unbiased estimates”. The new version now includes this change (lines 295 - 301).

The authors must describe in greater detail how the stratified estimates were calculated (lines 289-294).  First, the stratified estimators are crucial to understanding the methods and results.  Second, not everyone understands or is familiar with Openforis.  Specifically, the authors should include the mathematical expressions for the estimators.

We agree with the reviewer that further details should be given in order to understand the stratified estimators. In the current version of the manuscript, two equations were added to give more details in this respect: the equation for calculating the standard error of the estimated area proportion for each class, as well as the calculation of the 95 % confidence intervals for the area occupied by each class (lines 319 - 332).

Avoid the use of subjective terms such as “best” and “better” whose criteria are not explicitly stated (lines 22, 74, 116, 173,176, and many, many other places).  Specifically, “best” or “better” with respect to what?  Similarly, what is meant by “relatively good” (line 346)?

We changed the terms “better” and “best” to several alternatives that are more specific, e.g., “higher F1-score”. In the sentences where the terms “better” or “best” referred to a comparison between or among different elements, both terms were preserved. The term best U-net architecture is defined as the one with the highest F1-score. We consider there is not a clear substitute to call this architecture; thus, the term was maintained. Finally, the term “relatively good” was eliminated and the idea was better explained (line 388).

The first dictionary definitions of terms such as "high", "low", "over", "under", etc, refer to height, altitude, or vertical position, whereas the first dictionary definitions of terms such as "large", "small", "greater", "less", etc, refer to size or amount. Despite widespread use to the contrary in casual conversation and the applied literature, in scientific papers use the latter terms to refer to size or amount. Technical writing consultants strongly recommend using terms with only one meaning or definition, or if a term with multiple definitions is necessary, then the use the first definition.   For resolution, use “fine” or “coarse” rather than “high” or “low.”  For example, at line 45, what is meant by “lower AGB”?  Is it AGB in valleys instead of on mountains?  Is it AGB low in the canopy?  Or, do the authors mean “less AGB.”  At line 103, what is meant by “highest deforestation rates”?  Does it refer to deforestation at high altitudes or does it mean “greatest deforestation rates”?  At line 137, what is meant by “lowest cloud cover”?  Does it mean that the clouds were close to the ground?

 We thank the reviewer for her / his suggestion. We changed the terms “high” and “low” when referring to resolution to “fine” and “coarse” (lines 478, 584). In the case of “lower AGB values” we changed the phrase to “less AGB”, as it was more concise (line 46). For the other two cases, deforestation rates and cloud cover, we decided that high and low could be used as they refer to numerical rates or percentages, so we stayed with the original terms of “high / low cloud cover” (line 145) and “high / low deforestation rates” (line 110).

SPECIFIC COMMENTS:

Line 21: The convention for scientific papers is to use words for numbers “nine” and less and digits for number “10” and greater.  See line 190.

We thank the reviewer for pointing this out. In the updated version of the manuscript we changed the mentioned numbers to words (lines 214 - 216).

Line 28:  What are the advantages?

 We removed the word advantages to state that the U-net with MS and SAR images obtains a higher accuracy and F1-score in comparison with the other evaluated methods (lines 28 - 30).

Lines 43, 262: To what does “its” refer?

 We changed the word “its” to refer to the differences between old-growth forests and plantations, as well as secondary forests due to its their differences in environmental management and biodiversity conservation (lines 43 - 44).

Line 49:  Remote sensors do not provide information on land surface.  Rather, spectral sensors provide information on the intensity of light, and laser scanners provide information on distance from pulse emission to pulse interception. However, remote sensors provide information that can be used to PREDICT land surface features, albeit with error and uncertainty.

We thank the reviewer for this suggestion. In the current version of the manuscript, we cleared this out by changing the sentence to: “Previous studies have relied on remote sensors to obtain reflectance or backscattering signals of the land surface to predict different LULC” (lines 50 - 52).

Line 55:  What is a “feature engineering step”?

 A feature engineering step refers to a procedure where the raw inputs with which a machine or deep learning algorithm are transformed into features that can facilitate the discrimination among the classes of interest. For example, a common procedure is using vegetation indices instead of the multispectral bands to train these algorithms. We consider that this term might not be familiar to other readers, thus, we removed it and instead we explain that no previous transformation of the inputs is required, e.g., calculating spectral transformations such as vegetation indices (lines 58 - 59).

Line 56:  Perhaps “among” the most used but not necessarily THE “most” used.

We agree with the reviewer that the sentence is better formulated with “among” instead of “most”. The new sentence now incorporates this change (lines 60 - 61).

Line 60:  How is this definition of “image segmentation” different from “image classification”?  Where does the segmentation come in?

Previous studies have referred to the image classification tasks using deep learning algorithms as image segmentation. However, this term might be confusing as the term “image segmentation” has also been used to name the process of segmenting images into objects. This selection of terms is unfortunate, as the image segmentation to which several deep learning refer to can be conceived as a process including both an image segmentation and classification tasks. We consider that this term is not critical for the manuscript, thus, we removed the term and only refer to the task as “obtaining a class prediction for each pixel in an image” (line 65).

Lines 90, 237, 616:  In his original paper, Breiman used the term “random forests” with lower case letters and “forests” in the plural.

We thank the reviewer for noticing this detail and agree with her / his suggestion. Although both terms can be found in the literature (i.e., random forest and random forests), we consider that it is better to call the algorithm as the original paper. Thus, we updated all the mention to the random forest algorithm as “random forests”.

Line 159:  What is meant by “one-hot coded”?

One-hot encoded is a term mostly used in the Tensorflow programming community that refers to the process of transforming the data into binary images for each class. We consider that future readers might not be familiar with this term, thus we removed it, as the explanation of this process is already mentioned (line 170). 

Lines 166-168, 271, 471, elsewhere:   A sample consists of a set of sample units so that the term “samples” means multiple sets of sample units.  Do all these references to “samples” really each mean multiple sets of sample units or should all these references to “samples” really be references to “sample units.”

We thank the reviewer for making this suggestion. We agree that a sample consists of a set of sample units; thus, the use of the term “samples” might be slightly confusing to refer to observations. Following this comment, we decided to change the use of the term “samples” either to “observations” or “sample units”.

Line 185:  What are “skip connections”?

Skip connections refer to a process where the output of a hidden layer in the encoder part of the U-net is concatenated with the output of the decoder part in the following hidden layer. This process has the purpose of providing information with higher resolution than the output of each hidden layer in the decoder part of the U-net. Thus, the skip connections help in obtaining detailed classifications as a final output. We reduced the section that described the U-net algorithm to only mention the general process followed by this architecture; thus, the section mentioning the skip connections was removed (line 209).

Line 186:  What is meant by “upsample”?

 Upsample is referred to increasing the spatial resolution of an image. In the U-net, the output of the encoder part is an image with an increased spectral resolution, but with lower spatial resolution in comparison with the input image. Thus, in the decoder part of the U-net, the spatial resolution of the output image of the encoder part is returned to the original spatial resolution of the input image. We reduced the section that described the U-net algorithm to only mention the general process followed by this architecture (line 209).

Line 286: How accurate were the visual interpretations?  What kind of checks were done?  See McRoberts et al, 2018, ISPRS Journal of Photogrammetry and Remote Sensing 142: 292–300 who show that even small interpreter errors for these kinds of classification problems can have serious effects on estimates, particular estimates of uncertainty.

As any other training set, we acknowledge that certain errors can be found in our training set; however, we tried to minimize this error following several steps. First, field data recorded in March 2019 was regarded as the one with the highest confidence. Thus, this was always used as a reference to determine the class of each area. In the cases where no field data was available to assign the class of certain area it was visually compared with other areas where field data was available. Additionally, VHR imaged provided by Google, Yandex and Bing were very useful to determine certain classes, particularly for old-growth plantations and young plantations. Additionally, the interpreter that made the classifications is familiar with the study site and its different LULC classes, and was also the one responsible of acquiring the field data information. Finally, by consulting other images from different periods (e.g., Sentinel-2), we had more information about the temporal variation of the land surface, which sometimes rendered vital information to classify an area as certain class.

We thank the reviewer for sharing the abovementioned paper. Although, it would have been desirable to perform a similar check as the one performed in McRoberts et al., 2018, some of the LULC classes used in our classification system cannot be contrasted with other data sources, as they are not considered in other datasets, e.g., roads, human settlements, young and old-growth plantations, in the Global Forest Change dataset. Additionally, other available data sources might have more LULC classes that could be used to contrast our visual interpretations (e.g., Serie VI from Mexico’s National Institute of Statistics and Geography); however, this data has a scale (1:250 000) that is not useful for the scale of the interpretations we made.

Line 376:  Specify that the advantages are for this particular study area, these particular land cover classes, and relative to RF, but not necessarily for all possible study areas, all land cover classes, or all competing classification techniques.

We tried to narrow down the conclusions drawn from our results to the conditions and comparisons made in our study. In the opening section of the discussion we added a pair of sentences to mention that the results are relative to the algorithm and image comparisons made and that future studies could further explore the potential of these algorithms in different study sites and LULC systems (lines 406 - 424).

Lines 413, 468:  The discussions in Sections 4.2.1and 4.3 are quite good.

We thank the reviewer for this comment.

Line 580: What is mean by “evenly distributed”?  Is there such a thing as an “even distribution”?  Do the authors mean a systematic distribution or some kind of random distribution?

We thank the reviewer for noticing this unclear phrase. We agree that a better formulation of this phrase is “randomly distributed”. The current version now includes this change (line 634).

Author Response File: Author Response.docx

Reviewer 2 Report

The paper titled ‘Land use land cover classification with U-net: advantages of combining Sentinel-1 and Sentinel-2 imagery’ deals with the combined use of multispectral and SAR imagery in the LULC classification using a deep learning approach. Moreover, the deep learning approach was compared with a machine learning approach using the random forest algorithm (RF). The defined classification system included twelve LULC classes.

This paper deals with a very interesting topic and it is well organized and written. I also appreciated the quality of the English language, very good and perfectly in tune with the scientific style. By the way, some methodological major issues dealing with the image composition and the used MS imagery must be improved. Definitely, I strongly agree to support this paper for its acceptance but I ask the authors to take into consideration my questions carefully.

Having said that, I agree with the acceptance of this manuscript after providing the following major corrections:

  • Lines 58-59 – actually the image segmentation does not rely upon the class prediction for each pixel in an image. The image segmentation deals with the pixel grouping into objects. Please specific better this.
  • In line 146, the authors stated that ‘all the image processing was carried out in Google Earth Engine’. Indeed, several operations are described as implemented using the R platform and related packages. Maybe, some packages like the ‘rgee’ (Google Earth Engine for R) have been used? I ask the authors to clarify these aspects along with the manuscript.
  • About the LULC classes, the authors defined also the ‘cloud’ class. Some things about that. First, I do not agree with that considering that the clouds are not a land cover or land use. Rather, I would aspect the inserting of a cloud cover threshold for MS data. Indeed, using SAR data, clouds are not detected (as it can be noticed by observing figure 4). This can be interpreted as an error in the accuracy assessment but it is not if we refer to the actual land cover. The same reasoning should be applied to the cloud shadows that potentially could be any one of the other eight LULC classes. As proposed in some papers dealing with the use of GEE and recently published in different scientific journals (i.e., https://doi.org/10.3390/ijgi10070464; https://doi.org/10.3390/rs13040586; https://doi.org/10.3390/rs13132510;, I suggest the authors improve the used dataset optimizing the Input Image Composition trying to exclude the cloud cover. Finally, consider that in a multitemporal approach, to consider clouds and their shadows could be misleading in comparing the obtained datasets.
  • By the way, also looking at figure 5, it can be noticed as clouds cover a significant part of the study area. As I suggested this class should be avoided and a cloud free image composite should be implemented. About that, another question is why all these images contain this very consistent cloudiness.

Minor corrections

In figure 4, the LULC class ‘mature plantations’ is reported. This class does not appear in the table and along with the manuscript. Please revise accordingly.

Author Response

We thank the reviewer for her/his very relevant comments and suggestions. In the following paragraphs, the way each comment was attended is indicated in bold.



Reviewer 2

Open Review

(x) I would not like to sign my review report
( ) I would like to sign my review report

English language and style

( ) Extensive editing of English language and style required
( ) Moderate English changes required
(x) English language and style are fine/minor spell check required
( ) I don't feel qualified to judge about the English language and style

 

 

Yes

Can be improved

Must be improved

Not applicable

Does the introduction provide sufficient background and include all relevant references?

( )

(x)

( )

( )

Is the research design appropriate?

( )

( )

(x)

( )

Are the methods adequately described?

(x)

( )

( )

( )

Are the results clearly presented?

( )

(x)

( )

( )

Are the conclusions supported by the results?

( )

(x)

( )

( )



The paper titled ‘Land use land cover classification with U-net: advantages of combining Sentinel-1 and Sentinel-2 imagery’ deals with the combined use of multispectral and SAR imagery in the LULC classification using a deep learning approach. Moreover, the deep learning approach was compared with a machine learning approach using the random forest algorithm (RF). The defined classification system included twelve LULC classes.

This paper deals with a very interesting topic and it is well organized and written. I also appreciated the quality of the English language, very good and perfectly in tune with the scientific style. By the way, some methodological major issues dealing with the image composition and the used MS imagery must be improved. Definitely, I strongly agree to support this paper for its acceptance but I ask the authors to take into consideration my questions carefully.

Having said that, I agree with the acceptance of this manuscript after providing the following major corrections:

Lines 58-59 – actually the image segmentation does not rely upon the class prediction for each pixel in an image. The image segmentation deals with the pixel grouping into objects. Please specific better this.

Previous studies have referred to the image classification tasks using deep learning algorithms as image segmentation ones. However, this term might be confusing as the term “image segmentation” has also been used to name the process of segmenting images into objects. This selection of terms is unfortunate, as the image segmentation to which several deep learning refer to can be conceived as a process including both an image segmentation and classification tasks. We consider that this term is not critical for the manuscript, thus, we removed it and only refer to the task as “obtaining a class prediction for each pixel in an image” (line 65).

In line 146, the authors stated that ‘all the image processing was carried out in Google Earth Engine’. Indeed, several operations are described as implemented using the R platform and related packages. Maybe, some packages like the ‘rgee’ (Google Earth Engine for R) have been used? I ask the authors to clarify these aspects along with the manuscript.

Although there are packages like “rgee” that allow working with Google Earth Engine from an R environment, we did not use it. We are more familiarized with the Google Earth Engine Javascript API, thus, this tool was used to preprocess the images, export the images and then work with them in an R environment. In the current version of the manuscript we specify that the Google Earth Engine Javascript API was used for the image preprocessing (line 155).



About the LULC classes, the authors defined also the ‘cloud’ class. Some things about that. First, I do not agree with that considering that the clouds are not a land cover or land use. Rather, I would aspect the inserting of a cloud cover threshold for MS data. Indeed, using SAR data, clouds are not detected (as it can be noticed by observing figure 4). This can be interpreted as an error in the accuracy assessment but it is not if we refer to the actual land cover. The same reasoning should be applied to the cloud shadows that potentially could be any one of the other eight LULC classes. As proposed in some papers dealing with the use of GEE and recently published in different scientific journals (i.e., https://doi.org/10.3390/ijgi10070464; https://doi.org/10.3390/rs13040586; https://doi.org/10.3390/rs13132510;, I suggest the authors improve the used dataset optimizing the Input Image Composition trying to exclude the cloud cover. Finally, consider that in a multitemporal approach, to consider clouds and their shadows could be misleading in comparing the obtained datasets.

We agree with the reviewer that clouds and shadows cannot be considered as a LULC class. In the case of SAR inputs, the manually classified samples corresponded to the LULC classes without clouds or shadows. However, in the case of MS and MS + SAR inputs, the manually classified images corresponded to LULC classes including clouds and shadows classes, as we expected (based on previous studies) that MS would give more information to correctly classify the images (and clouds and shadows are only detectable in MS imagery). We are aware that there are methods that enable removing the clouds and shadows using multitemporal composites. This approach might be adequate for pixel-oriented classification schemes, as each pixel is classified independently from its neighbors. Thus, the effect of multitemporal artifacts, such as spectral variations caused by a temporal difference in the time of acquisition, might be irrelevant (although it might be important for certain classes with high temporal variation in its spectral signal). However, for algorithms such as the U-net that take into account spatial features, the effect of these artifacts could be more significant, since the spatial configurations of these temporal spectral variations might affect what the U-net learn to detect as a particular LULC class. Furthermore, few studies have addressed the possible effect of multitemporal composites on the LULC classification potential of deep learning algorithms, as most of the studies that evaluate the U-net to perform LULC classifications have used a single time image. Therefore, we consider that the followed approach has the advantage of being able to associate a certain LULC with a particular date and we do not have to assess the possible effect of multitemporal composites over the obtained LULC. We agree that this is a very promising line of research for future studies to test whether the time window or function (e.g., mean, median) used to build a multitemporal composite has an effect over the capabilities of algorithms such as the U-net to obtain LULC maps; however, we decided to made an initial evaluation of the use of a deep learning algorithm to obtain a LULC map with a single-time image. Nevertheless, we added a sentence in the Methods section to indicate that clouds and shadows cannot be considered a LULC class and that these areas would correspond to masked areas in the final LULC map (lines 168 - 170).

On the other hand, we agree that in multitemporal evaluations the areas with clouds and shadows will vary with the time of acquisition. However, correctly identifying these areas to afterwards mask them is also useful to obtain detailed LULC maps, especially for studies that do not use multitemporal composites for the reasons abovementioned. We are aware that Sentinel-2 images have their own quality assessment band (QA60) that could be used to mask clouds and shadows prior to perform the U-net and RF training; however, for the U-net, the algorithm is incapable of working with masked images (as it works with convolutions in the spatial domain). Thus, every pixel in an image needs to have a class assigned to it. For this reason, we considered it was a better approach to work with the clouds and shadows classes, instead of grouping these two in a “masked” class. Finally, we agree that in order to obtain the LULC map of the study area, predictions need to be made in the areas covered by clouds and shadows. However, the main objective of the study was to show the advantages of using the U-net vs RF algorithm, as well as using MS + SAR imagery instead of only one of them. Thus, the emphasis was made in this respect.

By the way, also looking at figure 5, it can be noticed as clouds cover a significant part of the study area. As I suggested this class should be avoided and a cloud free image composite should be implemented. About that, another question is why all these images contain this very consistent cloudiness.

We consider that the main purpose of the study was to compare a deep learning algorithm against a machine learning algorithm. This comparison would help future scientists to acknowledge the advantages of deep learning algorithms over other machine learning equivalents. A large portion of the study area is covered by clouds, due to two reasons: the first one is that the study area has a tropical humid climate, thus, it is usually covered by clouds throughout the entire year. The second reason is that the field data was considered as the most reliable information, as the LULC assignation was performed with direct observations of the land surface. For this reason, the images selected as inputs for the LULC mapping were selected as the ones closest in date to the field data collection. Finally, due to the dynamic nature of the LULC classes in the region, this procedure helps minimize the error of assigning wrong classes to the manual classification images, especially for LULC classes that are difficult to assign with only remote sensing imagery (e.g., young and old-growth plantations, as well as secondary forest).





Author Response File: Author Response.docx

Reviewer 3 Report

 

It is not clear how the authors made the training dataset. In 2.2.2 they present the way they did it but remains not clear how they did. Also, it is important to have an idea about the accuracy of the training dataset and any effects might have in the training phase.

I was wondering also if a classical accuracy assessment with omission and commission errors would help to understand more about the performance of the methods.

Author Response

We thank the reviewer for her/his very relevant comments and suggestions. In the following paragraphs, the way each comment was attended is indicated in bold.



Reviewer 3

Minor corrections

In figure 4, the LULC class ‘mature plantations’ is reported. This class does not appear in the table and along with the manuscript. Please revise accordingly.

We thank the reviewer for noticing this error. We also noticed that this error was present in Fig. 5. Both figures were updated to include the class “old-growth plantation” instead of “mature plantation”.

 

Open Review

(x) I would not like to sign my review report
( ) I would like to sign my review report

English language and style

( ) Extensive editing of English language and style required
( ) Moderate English changes required
(x) English language and style are fine/minor spell check required
( ) I don't feel qualified to judge about the English language and style

 

 

Yes

Can be improved

Must be improved

Not applicable

Does the introduction provide sufficient background and include all relevant references?

( )

(x)

( )

( )

Is the research design appropriate?

( )

(x)

( )

( )

Are the methods adequately described?

( )

(x)

( )

( )

Are the results clearly presented?

( )

(x)

( )

( )

Are the conclusions supported by the results?

( )

(x)

( )

( )

Comments and Suggestions for Authors

 

It is not clear how the authors made the training dataset. In 2.2.2 they present the way they did it but remains not clear how they did. Also, it is important to have an idea about the accuracy of the training dataset and any effects might have in the training phase.

We now mention that the training dataset was made using visual interpretation based on in-field observations and other remote sensing inputs such as Sentinel-2 images, Planet images, as well as Google, Bing and Yandex very high resolution images (lines 162 - 163). Additionally, we mention that the clouds and shadows classes were assigned by only interpreting the Sentinel-2 image (line 168). Finally, in the current version of the manuscript we added the accuracy and F1-scores achieved for the training datasets for the tested architectures and algorithms.

I was wondering also if a classical accuracy assessment with omission and commission errors would help to understand more about the performance of the methods.

We chose the F1-score metric to evaluate and compare the different architectures and algorithms, so the results for all the classes were comparable with a single metric. We acknowledge that a classical accuracy assessment might be more easily interpreted; thus, the appendix tables contain the error matrix with its corresponding user’s and producer’s accuracy. These accuracies are inversely related to the commission and omission errors. i.e., omission error = 1 – producer’s accuracy and commission error = 1 – user’s accuracy (lines 698 - 727).

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

GENERAL COMMENTS:The paper is almost unreadable with all the change tracker red text and red lines and comments.  Use simple yellow highlighting to indicate revisions.

The authors have correctly revised the text related to stratified estimation and have included the appropriate statistical estimators.

The authors state that their objective is to evaluate the potential of the U-net in combination with Sentinel-1 and Sentinel-2 images to develop a detailed LULC classification with four particular emphases: (1) to differentiate young and old-growth plantations, (2) to differentiate secondary and old-growth 100 forests, (3) to compare U-101 net and the random forests (RF) algorithm, and (4)  to evaluate the effectiveness of various sources of remotely sensed data on classification accuracy.  The authors conclude that the combination of MS+SAR data with U-net produced the most accurate classification which apparently addresses the third and fourth objectives.  Regarding the first two objectives, the authors state that old growth forests were apparently well-classified (line 486), but that young plantations secondary forests were poorly classified (lines 457, 486).  Thus, it seems that regardless of which combination of remotely sensed data and classification method was most accurate, none were accurate enough to achieve the primary classification objectives. 

A strong comment for the original version of this paper which ran to 900 lines was that the paper was too long relative to the content.  This version of the paper is not only not shorted but, in fact, in more than 200 lines longer.  The objectives of the paper, the content and the questionable achievement of the objectives simply do not warrant a paper of 1100 lines.  Again, as per the comment for the previous version of the paper, the length should be reduced to no more than 500 lines. Focus explicitly and exclusively on these primary objectives and delete everything peripheral or tangential to them. 

For the previous version of the paper, a comment was provided regarding the accuracy of the visual interpretations and a reference was provided regarding the consequences of misinterpretations.  The authors were advised to include an assessment of the accuracy of their visual interpretations.   The authors have declined to do so. 

For the previous version of the paper, the authors were strongly encouraged to use terms such as “large” and “small” rather than “high” and “low” when referring to size or amount.  The authors responded that they have done so except, for example, when refer to “high/low” cloud cover.  What does this mean?  That the clouds were low (close to the earth) of that their coverage was sparse.  For this version of the paper, accuracies and scores are large or small, they are not high or low.  This version of the paper is replete with use of “high”, “low” and related terms to refer to size or amount.  Check the entire paper for each instance and revise accordingly.

For the previous version of the paper, the authors were strongly encouraged to avoid subjective terms such as “best” and “better” whose criteria are not explicitly stated.  The authors have made a few changes, but enough.  Instead of referring to the “best architecture” refer to the “most accurate architecture.”  Revise the entire paper accordingly. 

SPECIFIC COMMENTS: Line 337:  The verb “to sample” means to select a subset of a population.  Did the authors select a subset of the observations as the statement indicates, or did they select a subset of the population?  Presumably the latter.   Change to “sample observations”, not “sampled observations.” 

Lines 193, 221, 342, 607, 762, elsewhere: The term “data” is plural and requires a plural verb such as “were” or “are.”

Line 400:  Bias is a property of a statistical estimator (a formula, a procedure) but not an estimate.  If something is unbiased, it means it equals the true value.  How do the authors know the estimates equalled the true value, assuming they do not know the true value?  Further, if they do know the true value, then they do then need estimates?

Line 786:  What kind of randomness?  Simple random?  Stratified random?  Two-stage? Something else?

Author Response

We thank the reviewer for her/his very relevant comments and suggestions. In the following paragraphs, the way each comment was attended is indicated in blue.

Reviewer 1

GENERAL COMMENTS:The paper is almost unreadable with all
the change tracker red text and red lines and comments. Use
simple yellow highlighting to indicate revisions.

We apologize for the inconvenient format used to track the changes. The present version now marks the changes with a yellow background.


The authors have correctly revised the text related to stratified
estimation and have included the appropriate statistical estimators.
The authors state that their objective is to evaluate the potential of
the U-net in combination with Sentinel-1 and Sentinel-2 images to
develop a detailed LULC classification with four particular
emphases: (1) to differentiate young and old-growth plantations,
(2) to differentiate secondary and old-growth 100 forests, (3) to
compare U-101 net and the random forests (RF) algorithm, and (4)
to evaluate the effectiveness of various sources of remotely
sensed data on classification accuracy. The authors conclude that
the combination of MS+SAR data with U-net produced the most
accurate classification which apparently addresses the third and
fourth objectives. Regarding the first two objectives, the authors
state that old growth forests were apparently well-classified (line
486), but that young plantations secondary forests were poorly
classified (lines 457, 486). Thus, it seems that regardless of which
combination of remotely sensed data and classification method
was most accurate, none were accurate enough to achieve the
primary classification objectives.


A strong comment for the original version of this paper which ran to
900 lines was that the paper was too long relative to the content.
This version of the paper is not only not shorted but, in fact, in
more than 200 lines longer. The objectives of the paper, the
content and the questionable achievement of the objectives simply
do not warrant a paper of 1100 lines. Again, as per the comment
for the previous version of the paper, the length should be reduced
to no more than 500 lines. Focus explicitly and exclusively on
these primary objectives and delete everything peripheral or
tangential to them.

In the last version we found out that the pdf that included the changes, reduced the width of the sheet, thus, the length of the manuscript was artificially increased to 1100 lines. We apologize for giving the wrong impression that the length of the manuscript was in fact increased. The present version of the manuscript has a length of approximately 650 lines without the references, while the complete document consists of approximately 940 lines. We again tried to compensate the additions asked by the reviewers by removing additional sentences, especially in the discussion section.

We apologize for insisting in the length of manuscript, but we do not share the suggestion made by the reviewer to reduce the length of the manuscript any further. We consider that further removing any of the current sections of the manuscript will result in a decrease of the quality of the paper. In the updated version of the manuscript a complete discussion is offered about the potential of different image inputs (MS, SAR, MS + SAR) and different algorithms (U-net, RF), the possible sources of error, a comparison with similar scientific literature and certain methodological considerations. These discussed topics cannot be viewed as peripheral under the light of our methods and results. Nevertheless, we decided to shorten the starting paragraph of the discussion in order to remove some lines of the manuscript and compensate for the newer additions.

Additionally, the instructions for authors of the journal only mention a minimum length of 18 pages for articles such as ours. The present version of the manuscript complies with this limit. Finally, we asked the editors of the journal if there was a problem with the actual length of the manuscript and they answered that there is not. Thus, given the arguments provided above, we decided to maintain the aprox. 900 lines manuscript for its publication, to maintain a high quality paper.


For the previous version of the paper, a comment was provided
regarding the accuracy of the visual interpretations and a reference
was provided regarding the consequences of misinterpretations.
The authors were advised to include an assessment of the
accuracy of their visual interpretations. The authors have declined
to do so.

The reference and method suggested by the reviewer in McRoberts et al., 2018 indeed presents a method to assess the accuracy of the visual interpretations. We thank the reviewer for recommending this paper. Due to the fact that a single interpreter generated the manual classifications data to train the U-net and random forests algorithm we made an additional assessment of the accuracy of the visual interpretations by making predictions over a sample of the ground data without having access to the “ground truth” labels. In this procedure, the same images (Yandex, Bing, Google, Planet and Sentinel-2) were interpreted by a single interpreter and in a single repetition. We briefly mention this in the Methods and Discussion section, to provide further insights about the assessment of the visual interpretations. We now acknowledge in the paper that the use of more than one interpreter would be one possible modification to the study that could result in a training data with less error, as well as a mechanism to evaluate the interpreter error in the data.

Although the McRoberts paper proposes two methods to correct the estimates to include the interpreter error, our training data was performed by a single interpreter, thus, we lack visual interpretations for multiple interpreters, as well as repetitions. In addition, we could have asked multiple interpreters to make a visual interpretation of the same observations; however, this could potentially change the training data with which all the U-net and random forests algorithms were trained, as now the reference data would correspond to the most voted class for each observation in the training data (instead of a single observation). Thus, we limited ourselves to mention the accuracy obtained in this procedure and the possible implications of these results.


For the previous version of the paper, the authors were strongly
encouraged to use terms such as “large” and “small” rather than
“high” and “low” when referring to size or amount. The authors
responded that they have done so except, for example, when refer
to “high/low” cloud cover. What does this mean? That the clouds
were low (close to the earth) of that their coverage was sparse.
For this version of the paper, accuracies and scores are large or
small, they are not high or low. This version of the paper is replete
with use of “high”, “low” and related terms to refer to size or
amount. Check the entire paper for each instance and revise
accordingly.

In the updated version of the manuscript we now added the word percentage to refer to cloud cover. Thus, the manuscript now reads low cloud cover percentage.

We understand the suggestion made by the reviewer that high and low do not refer to size or amount. However, we found that several previous papers use the terms high and low to describe accuracies and scores. Thus, we consider that these terms might sound more familiar to future readers. Some of the cited studies where we found this terms were : Flood et al., 2019; Stoian et al., 2019; Du et al., 2020; Wagner et al., 2019; Gargiulo et al., 2020; Hoeser et al., 2020; Kattenborn et al., 2021; Ma et al., 2019 and others not included in the article such as Olofsson et al., 2013.

Olofsson, P., Foody, G. M., Stehman, S. V., & Woodcock, C. E. (2013). Making better use of accuracy data in land change studies: Estimating accuracy and area and quantifying uncertainty using stratified estimation. Remote Sensing of Environment, 129, 122–131. https://doi.org/10.1016/j.rse.2012.10.031



For the previous version of the paper, the authors were strongly
encouraged to avoid subjective terms such as “best” and “better”
whose criteria are not explicitly stated. The authors have made a
few changes, but enough. Instead of referring to the “best
architecture” refer to the “most accurate architecture.” Revise the
entire paper accordingly.

We followed the recommendation made by the reviewer and changed the term “best architecture” to the “most accurate architecture” in all the manuscript.


SPECIFIC COMMENTS: Line 337: The verb “to sample” means to
select a subset of a population. Did the authors select a subset of
the observations as the statement indicates, or did they select a
subset of the population? Presumably the latter. Change to
“sample observations”, not “sampled observations.”

We agree with the reviewer that the correct term should be “sample observations”. We added this change in the updated version of the manuscript.

Lines 193, 221, 342, 607, 762, elsewhere: The term “data” is plural
and requires a plural verb such as “were” or “are.”

We apologize for not making a consistent use of plural verbs to refer to data. The updated script now includes this change.


Line 400: Bias is a property of a statistical estimator (a formula, a
procedure) but not an estimate. If something is unbiased, it means
it equals the true value. How do the authors know the estimates
equalled the true value, assuming they do not know the true
value? Further, if they do know the true value, then they do then
need estimates?

We are aware of the statistical meaning of bias. In this case the unbiased estimates are not single values, but rather the interval of values found between the mentioned unbiased area in Table 2 -/+ its 95 % CI (confidence interval). This means that the parameter of the population (“true value”) has 95 % of confidence of being found in the latter mentioned interval. Thus, we do not know the exact value of the population, but rather an interval where there is a 95 % probability that contains the true value. Additionally, we used the same term “unbiased area” estimates as used by studies cited in the paper such as Cochran 1977; Card 1982; Olofsson et al., 2014 and other studies, e.g., Bullock et al., 2019; Arévalo et al., 2020; McRoberts et al., 2018.

Bullock, E. L., Nolte, C., Segovia, A. R., & Woodcock, C. E. (2019). Ongoing forest disturbance in Guatemala’s protected areas. Remote Sensing in Ecology and Conservation, 6, 141–152. https://doi.org/10.1002/rse2.130

Arévalo, P., Bullock, E. L., Woodcock, C. E., & Olofsson, P. (2020). A Suite of Tools for Continuous Land Change Monitoring in Google Earth Engine. Frontiers in Climate, 2, 576740. https://doi.org/10.3389/fclim.2020.576740

McRoberts, R. E., Stehman, S. V., Liknes, G. C., Næsset, E., Sannier, C., & Walters, B. F. (2018). The effects of imperfect reference data on remote sensing-assisted estimators of land cover class proportions. ISPRS Journal of Photogrammetry and Remote Sensing, 142(February), 292–300. https://doi.org/10.1016/j.isprsjprs.2018.06.002




Line 786: What kind of randomness? Simple random? Stratified
random? Two-stage? Something else?

Yes, we refer to simple random. In the updated version of the paper now it reads “random spatial distribution” in order to be more specific about what are we referring to.





Author Response File: Author Response.pdf

Reviewer 2 Report

This revised version of the paper titled ‘Land use land cover classification with U-net: advantages of combining Sentinel-1 and Sentinel-2 imagery’ has been a little improved. By the way, I have to notice that some of the rebuttals provided by the authors are not satisfactory.

As specified in my previous comments, the present paper deals with the combined use of multispectral and SAR imagery in the LULC classification using a deep learning approach. Moreover, the deep learning approach was compared with a machine learning approach using the random forest algorithm (RF), and the defined classification system included twelve LULC classes. I confirm that the covered research topic of this paper is very interesting.

By the way, some major methodological issues dealing with the image composition and the used MS imagery must be improved. Definitely, I strongly agree to support this paper for its acceptance, but I ask the authors to consider my questions carefully.

Having said that, I agree with the acceptance of this manuscript after providing the following major corrections that I previously asked but for which not suitable revisions were provided in the manuscript. Actually, in the responses to my comments, more information can be found. Therefore, I ask the authors to better follow my suggestions in the next version of this paper. Moreover, they can use some of the rebuttals used in the answers’ to my previous comments to enrich this manuscript.

As I specified in my previous comments, the authors also defined the ‘cloud’ class among the other LULC classes. As written in my earlier comments, I do not agree with that, considering that the clouds are not a land cover or land use. Therefore, a suitable answer to this issue must be provided in the manuscript. Instead, I would aspect the inserting of a cloud cover threshold for MS data. Why did the authors not provide this threshold? Indeed, using SAR data, clouds are not detected (as can be noticed by observing figure 4). This can be interpreted as an error in the accuracy assessment, but it is not if we refer to the actual land cover. The same reasoning should be applied to the cloud shadows that potentially could be any one of the other eight LULC classes.

Moreover, as proposed in some papers dealing with GEE use and recently published in different scientific journals (i.e., https://doi.org/10.3390/ijgi10070464; https://doi.org/10.3390/rs13040586; https://doi.org/10.3390/rs13132510), I suggest the authors discuss the Input Image Composition optimization that surely can help these researches. Moreover, the input image composition optimization can significantly help in excluding the cloud cover. Finally, consider that in a multitemporal approach, to consider clouds and their shadows could be misleading in comparing the obtained datasets. Indeed, as specified in my previous comments, looking at figure 5, it can be noticed as clouds cover a significant part of the study area. As I suggested, this class should be avoided, and a cloud-free image composite should be implemented.

Technical comments

Figure 2 is not readable in its current graphical edit.

Author Response

We thank the reviewer for her/his very relevant comments and suggestions. In the following paragraphs, the way each comment was attended is indicated in blue.

Reviewer 2

This revised version of the paper titled ‘Land use land cover
classification with U-net: advantages of combining Sentinel-1 and
Sentinel-2 imagery’ has been a little improved. By the way, I have
to notice that some of the rebuttals provided by the authors are not
satisfactory.
As specified in my previous comments, the present paper deals
with the combined use of multispectral and SAR imagery in the
LULC classification using a deep learning approach. Moreover, the
deep learning approach was compared with a machine learning
approach using the random forest algorithm (RF), and the defined
classification system included twelve LULC classes. I confirm that
the covered research topic of this paper is very interesting.


By the way, some major methodological issues dealing with the
image composition and the used MS imagery must be improved.
Definitely, I strongly agree to support this paper for its acceptance,
but I ask the authors to consider my questions carefully.
Having said that, I agree with the acceptance of this manuscript
after providing the following major corrections that I previously
asked but for which not suitable revisions were provided in the
manuscript. Actually, in the responses to my comments, more
information can be found. Therefore, I ask the authors to better
follow my suggestions in the next version of this paper. Moreover,
they can use some of the rebuttals used in the answers’ to my
previous comments to enrich this manuscript.

We thank the reviewer for her / his suggestion. In the present version of the manuscript we added a paragraph to discuss the implications of working with a single date image or multitemporal composites in the final part of the discussion. We hope this addition meets the reviewer expectations.


As I specified in my previous comments, the authors also defined
the ‘cloud’ class among the other LULC classes. As written in my
earlier comments, I do not agree with that, considering that the
clouds are not a land cover or land use. Therefore, a suitable
answer to this issue must be provided in the manuscript. Instead, I
would aspect the inserting of a cloud cover threshold for MS data.
Why did the authors not provide this threshold? Indeed, using SAR
data, clouds are not detected (as can be noticed by observing
figure 4). This can be interpreted as an error in the accuracy
assessment, but it is not if we refer to the actual land cover.

We agree with the reviewer that clouds and shadows cannot be considered as a LULC class. We first considered that showing the complete classification results for the LULC classes, as well as clouds and shadows could give a more general idea about the advantages of the U-net in comparison with random forests. However, given that the paper is focused on LULC, we decided to limit the results and show only those related to strictly LULC classes. Therefore, we eliminated the verification observations made for clouds and shadows from the error matrices and recalculated the accuracy and F1-scores for the most accurate U-net architectures, the random forest method and the estimates for the complete LULC map. Additionally, the proportion of the study area occupied by each class were recalculated. We hope this change highlights the potential of the U-net to classify LULC classes and eliminates the problem of including clouds and shadows in the classification system.

The same reasoning should be applied to the cloud shadows that
potentially could be any one of the other eight LULC classes.
Moreover, as proposed in some papers dealing with GEE use and
recently published in different scientific journals (i.e.,
https://doi.org/10.3390/ijgi10070464;
https://doi.org/10.3390/rs13040586;
https://doi.org/10.3390/rs13132510), I suggest the authors discuss
the Input Image Composition optimization that surely can help
these researches. Moreover, the input image composition
optimization can significantly help in excluding the cloud cover.

We are aware that these methods exist; however, none of these studies use a CNN based method. For example, Pan et al., 2021 use both Sentinel-2 and Landsat 7-8 to construct an annual map of winter crops using a time series approach, Luo et al., 2021, use Sentinel-2 monthly composites with gap fill to predict LULC with random forests, Practicó et al., 2021 also use a per pixel approach using SVM, RF and CART to perform classifications. All of them use a pixel-based approach to perform its classification task which do not consider the spatial domain of the image to classify each pixel as a certain class.

At the planning phase of this study we faced a dilemma as to what images should be used to make the LULC predictions. Thus, we considered two options: 1) using a single date image that could minimize the difference between the field registered data and the image, as well as the potential artifacts caused by a scene that contained pixels from different dates (such as unremoved clouds or shadows as well as reflectance differences caused by different acquisition date), and 2) using a multitemporal composite that could obtain reflectance information for the complete scene, but that could potentially include more differences in comparison with the in-field data and potentially more image artifacts. Based on previous studies that have used the U-net to perform different classification tasks (e.g., Wagner et al., 2019; Flood et al., 2019) we found that most of them used a single date image. Additionally, we did not find studies that evaluated the effect of multitemporal composition on the results obtained with a CNN-based algorithm. We agree with the reviewer that this is an aspect worth studying, particularly for CNN-based algorithm which takes into account the spatial information of each pixel to determine its class. However, we considered that this aspect was beyond the scope of this study (i.e., optimizing the selection of the image that could obtain highest accuracies and F1-scores such as Practicó et al., 2021). The present study was more interested in evaluating the performance of a CNN-based algorithm in comparison with a random forest algorithm to develop a LULC map; instead of selecting the image that could obtain the highest accuracies. Nevertheless, we now discuss this topic in the final part of the discussion.


Finally, consider that in a multitemporal approach, to consider
clouds and their shadows could be misleading in comparing the
obtained datasets. Indeed, as specified in my previous comments,
looking at figure 5, it can be noticed as clouds cover a significant
part of the study area. As I suggested, this class should be
avoided, and a cloud-free image composite should be
implemented.

We discussed this aspect and the changes made to the manuscript following this recommendation in the first comment made by this reviewer.


Technical comments


Figure 2 is not readable in its current graphical edit.

The font size of the figure was increased so that it becomes easier to read.

Reviewer 3 Report

none

Author Response

Thanks for the review of the paper

Back to TopTop