Next Article in Journal
Annual and Seasonal Variations in Aerosol Optical Characteristics in the Huai River Basin, China from 2007 to 2021
Next Article in Special Issue
A Deep Learning Classification Scheme for PolSAR Image Based on Polarimetric Features
Previous Article in Journal
Early Season Forecasting of Corn Yield at Field Level from Multi-Source Satellite Time Series Data
Previous Article in Special Issue
PointMM: Point Cloud Semantic Segmentation CNN under Multi-Spatial Feature Encoding and Multi-Head Attention Pooling
 
 
Article
Peer-Review Record

AFMUNet: Attention Feature Fusion Network Based on a U-Shaped Structure for Cloud and Cloud Shadow Detection

Remote Sens. 2024, 16(9), 1574; https://doi.org/10.3390/rs16091574
by Wenjie Du 1, Zhiyong Fan 1,2,*, Ying Yan 1,2, Rui Yu 1 and Jiazheng Liu 1
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
Remote Sens. 2024, 16(9), 1574; https://doi.org/10.3390/rs16091574
Submission received: 24 January 2024 / Revised: 19 March 2024 / Accepted: 19 April 2024 / Published: 28 April 2024
(This article belongs to the Special Issue Remote Sensing Image Classification and Semantic Segmentation)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Review of the manuscript (article) titled AFMUNet: Attention Feature Fusion Network based on a U-shaped Structure

 

In this article a convolutional self-attention mechanism feature fusion network model for cloud shadow segmentation based on a U-shaped structure was proposed. Compared with previous deep learning and segmentation methods, an approach proposed in this article achieves significant improvement in accuracy in cloud shadow segmentation tasks. Since the experimental results on the selected dataset were shown that the segmentation results of this method are better than the existing methods, therefore this article is a valuable contribution to the promotions of the proposed model is of great significance for practical cloud shadow segmentation, because the proposed model initiates constant down-sampling to extract high-level features.

 

Below are some suggestions for improving your article.

·       In lines 5 and 7 miss e-mails of autors.

·       In lines 46–49 should be written „... other parts, such as Saunders and Kriebel [3] who processed the NOAA-9 dataset over a week by determining thresholds for a range of physical parameters such as cloud-top temperatures, optical depths, and liquid water content.“ instead of „... other parts, such as SAUNDERS et al. who processed the NOAA-9 dataset over a week by determining thresholds for a range of physical parameters such as cloud-top temperatures, optical depths, and liquid water content [Error! Reference source not found.].“ (two authors)

·       In lines 59–62 should be written „... for images, Danda et al. and Liu et al. [8,9] constructed skeleton features to help analyze the morphology of the cloud and thus separate it from other regions by using a gray-level morphological edge extraction method, ....“ instead of „... for images, Danda and Xiang Liu et al. constructed skeleton features to help analyze the morphology of the cloud and thus separate it from other regions by using a gray-level morphological edge extraction method [8,9], ....“

·       In lines 70–73 should be written „... clouds, e.g., Amato et al. [11] used PCA and nonparametric density estimation applied to the SEVIRI sensor dataset, and Wylie et al. [12] combined time-series analyses of more than 20 years of polar-orbiting satellite cloud data to predict future cloud trends.instead of „... clouds, e.g., Amato et al. used PCA and nonparametric density estimation applied to the SEVIRI sensor dataset [11], and Wylie et al. combined time-series analyses of more than 20 years of polar-orbiting satellite cloud data to predict future cloud trends [12].“

·       In lines 76–79 should be written „... features of images, e.g., Abuhussein et al. and Shao et al. [13,14] completed segmentation by analyzing the GLCM gray level covariance matrix to capture the spatial relationship and covariance frequency between pixels of different gray levels in the image, and obtaining the key information about the texture of the image.instead of „... features of images, e.g., Abuhussein et al. completed segmentation by analyzing the GLCM gray level covariance matrix to capture the spatial relationship and covariance frequency between pixels of different gray levels in the image, and obtaining the key information about the texture of the image [13,14].

·       In lines 79–83 should be written Reiter, Gupta and Panchal, Changhui et al. [15-17] completed segmentation by using the wavelet transform to detect texture features and edge information in the image at different spatial scales, and decompose the cloud image into details at different scales to obtain local and global features of the cloud, and Surya et al. [18] used a clustering algorithm to group out texture regions similar to the cloud shadow.instead of Reiter and Changhui et al. completed segmentation by using the wavelet transform to detect texture features and edge information in the image at different spatial scales, and decompose the cloud image into details at different scales to obtain local and global features of the cloud [15-17], and Surya et al. used a clustering algorithm to group out texture regions similar to the cloud shadow [18].

·       In lines 87–95 should be written „For instance, Li et al. [19] proposed a classifier based on support vector machines to detect clouds in images, while Ishida et al. [20] quantitatively guided the support vector machines with the help of classification effect metrics to improve the feature space used for detecting cloud shadows and to reduce the frequency of erroneous results, Fu et al. [21] combined the ensemble thresholding method and random forest to the FY-2G image set to improve the meteorological satellite cloud detection technique, and Jin et al. [22] established a BP neural network backpropagation model for the MODIS dataset, which improved the learning model to a certain extent.instead of „For instance, Li et al. proposed a classifier based on support vector machines to detect clouds in images [19], while Ishida et al. quantitatively guided the support vector machines with the help of classification effect metrics to improve the feature space used for detecting cloud shadows and to reduce the frequency of erroneous results [20], Fu et al. combined the ensemble thresholding method and random forest to the FY-2G image set to improve the meteorological satellite cloud detection technique [21], and Jin et al. established a BP neural network backpropagation model for the MODIS dataset, which improved the learning model to a certain extent [22].

·       In lines 103–106 should be written Long et al. [23] first proposed a fully convolutional neural network, FCN, for semantic segmentation in 2015, which can directly realize end-to-end-pixel-by-pixel classification, and Mohajerani et al. [24] applied the FCN network to the remote sensing image Landsat dataset cloud detection technique in 2018, which ...“ instead of Long et al. first proposed a fully convolutional neural network, FCN, for semantic segmentation in 2014 [23], which can directly realize end-to-end-pixel-by-pixel classification, and Mohajerani et al. applied the FCN network to the remote sensing image Landsat dataset cloud detection technique in 2018 [24], which ...“

·       In lines 110–113 should be written „... proposed, in 2015, Badrinarayanan et al. [25] proposed a segmentation network SegNet based on an encoder-decoder structure, with up-sampling using unpooling operation, and in 2019 Lu et al. [26] built the SegNet network model applied to remote sensing image cloud recognition, which ...“ instead of „... proposed, in 2015, Badrinarayanan et al. proposed a segmentation network SegNet based on an encoder-decoder structure, with up-sampling using unpooling operation [25], and in 2019 Lu et al. built the SegNet network model applied to remote sensing image cloud recognition [26], which ...“

·       In lines 116–119 should be written „In 2016, Chen et al. [27] designed an inflated convolutional network DeepLab, which is used to expand the sensory field by introducing a null in the convolutional kernel by introducing voids in the kernel to expand the sensory field, which ...“ instead of „In 2016, Chen et al. designed an inflated convolutional network DeepLab, which is used to expand the sensory field by introducing a null in the convolutional kernel by introducing voids in the kernel to expand the sensory field [27], which ...“

·       In lines 123–124 should be written In 2015, Ronneberger et al. [28] proposed the UNet image segmentation network, named .....“ instead of in 2017. Ronneberger et al. proposed the UNet image segmentation network [28], named .....“

·       In lines 128–132 should be written „In 2017, Zhao et al. [29] designed a pyramidal scene parsing network structure, PSPNet, which integrates contextual information from different regions, applies convolutional kernels of different sizes, and employs a multi-scale sensory field to efficiently combine local and global cues, and in 2022 Zhang et al. [30] proposed a dual pyramidal network, DPNet, inspired by PSPNet.instead of „In the same year, Zhao et al. designed a pyramidal scene parsing network structure, PSPNet, which integrates contextual information from different regions, applies convolutional kernels of different sizes, and employs a multi-scale sensory field to efficiently combine local and global cues [29], and in 2022 Zhang et al. proposed a dual pyramidal network, DPNet, inspired by PSPNet [30].

 

·      In line 364 should be written „From Table 1, ...“ instead of „From the above table, ...“

·      In lines 522–542, the text should be modified and completed.

·       All references listed in References were cited in the text. However, some references are not listed according to the template from the Instructions for Authors, some references should be completed or corrected.

o    Some of the cited articles are available online, so it would be good to list DOI.

Comments for author File: Comments.pdf

Author Response

In this article, a convolutional self-attention mechanism feature fusion network model for cloud shadow segmentation based on a U-shaped structure was proposed. Compared with previous deep learning and segmentation methods, an approach proposed in this article achieves significant improvement in accuracy in cloud shadow segmentation tasks. Since the experimental results on the selected dataset showed that the segmentation results of this method are better than the existing methods, therefore this article is a valuable contribution to the promotion of the proposed model and is of great significance for practical cloud shadow segmentation, because the proposed model initiates constant down-sampling to extract high-level features.

Response: Thank you very much for your comments. We have improved our writing and rephrased some sentences. Please see the responses to the “Detailed comments”.

Detailed comments:

  1. In lines 5 and 7 miss e-mails of authors.
  2. In lines 46-49 reference source is not found and there is a loss in author information.

Response: Thank you very much for your comments. We have modified the reference source and author name.

Please see lines 54-61 on page 2 of the paper for details:” Early on in the research people used fixed thresholds to distinguish clouds from other parts. For instance, Saunders and Kriebel [3] who processed the NOAA-9 dataset over a week by determining thresholds for a range of physical parameters including cloud-top temperatures, optical depths, and liquid water content.” 

  1. In lines 59-62 there is a mistake in reference source.

Response: Thank you very much for your comments. We have adjusted the position of the citation. Please see lines 69-75 on page 2 of the paper for a detailed explanation:” Danda and Xiang Liu et al. [8,9] constructed skeleton features to help analyze the morphology of the cloud and thus separate it from other regions by using a gray-level morphological edge extraction method. Moreover, Tom et al. [10] established a common method based on morphological an efficient computational paradigm for the combination of simple nonlinear grayscale operations so that the cloud detection filter exhibits spatial high-pass properties, emphasizes cloud shadow regions in the data, and suppresses all other clutter.”

  1. In lines 70-73 there is a mistake in reference source.

Response: Thank you very much for your comment. We have adjusted the position of the citation. Please see lines 80-82 on page 2 of the paper for an elaborate description:” Amato et al. [11] used PCA and nonparametric density estimation applied to the SEVIRI sensor dataset, and Wylie et al. [12] combined time-series analyses of more than 20 years of polar-orbiting satellite cloud data to predict future cloud trends.”

  1. In lines 76-79 there is a mistake in reference source.

Response: Thank you very much for your comment. We have adjusted the position of the citation. Please see lines 86-88 on page 2 of the paper for details:” For example, Abuhussein et al. [13,14] conducted segmentation by analyzing the GLCM (Gray-Level Co-occurrence Matrix) to capture spatial relationships and covariance frequencies between pixels of varying gray levels in the image.”.

  1. In lines 79-83 there is a mistake in reference source.

Response: Thank you very much for your comment. We have adjusted the position of the citation. Please see lines 89-94 on page 2 of the paper for a detailed explanation:” Reiter and Changhui et al. [15-17] completed segmentation by using the wavelet transform to detect texture features and edge information in the image at different spatial scales, and decompose the cloud image into details at different scales to obtain local and global features of the cloud, while Surya et al. [18] used a clustering algorithm to group out texture regions similar to the cloud shadow.”

  1. In lines 87-95 there is a mistake in reference source.

Response: Thank you very much for your comment. We have adjusted the position of the citation. Please see lines 98-105 on page 3 of the paper for details:” For instance, Li et al. [19] proposed a classifier based on support vector machines to detect clouds in images, while Ishida et al. [20] quantitatively guided the support vector machines with the help of classification effect metrics to improve the feature space used for detecting cloud shadows and to reduce the frequency of erroneous results, Fu et al. [21] combined the ensemble thresholding method and random forest to the FY-2G image set to improve the meteorological satellite cloud detection technique, and Jin et al. [22] established a BP neural network backpropagation model for the MODIS dataset, which improved the learning model to a certain extent.”

  1. 8. In lines 103-106 there is a mistake in reference source.

Response: Thank you very much for your comment. We have corrected the position and time of the citation. Please see lines 113-116 on page 3 of the paper for an elaborate description:” Long et al. [23] first proposed a fully convolutional neural network, FCN, for semantic segmentation in 2015, which can directly realize end-to-end-pixel-by-pixel classification, and Mohajerani et al. [24] applied the FCN network to the remote sensing image Landsat dataset cloud detection technique in 2018 ”

  1. In lines 110-113 there is a mistake in reference source.

Response: Thank you very much for your comment. We have adjusted the position of the citation. Please see lines 120-123 on page 3 of the paper for details:” In 2015, Badrinarayanan et al. [25] introduced SegNet, a segmentation network based on an encoder-decoder structure, utilizing up-sampling with unpooling operation. Subsequently, in 2019, Lu et al. [26] adapted the SegNet network model for cloud recognition in remote sensing images.”

  1. In lines 116-119 there is a mistake in reference source.

Response: Thank you very much for your comment. We have adjusted the position of the citation. Please see lines 127-129 on page 3 of the paper for a detailed explanation:” In 2016, Chen et al. [27] designed an inflated convolutional network called DeepLab, aimed at expanding the sensory field by introducing voids in the convolutional kernel.”

  1. In lines 123-124 there is a mistake in reference source.

Response: Thank you very much for your comment. We have corrected the position and time of the citation. Please see lines 133-135 on page 3 of the paper for an elaborate description:” In 2015, Ronneberger et al. [28] proposed the UNet image segmentation network, named because the network framework is shaped like the letter U.”

  1. In lines 128-132 there is a mistake in reference source.

Response: Thank you very much for your comment. We have corrected the position and time of the citation. Please see lines 137-142 on page 3 of the paper for details:” In 2017, Zhao et al. [29] designed a pyramidal scene parsing network structure, PSPNet, which integrates contextual information from different regions, applies convolutional kernels of different sizes, and employs a multi-scale sensory field to efficiently combine local and global cues. In 2022, Zhang et al. [30] proposed a dual pyramidal network, DPNet, inspired by PSPNet.”

  1. In line 364 the table should be cited in the main text as Table 1.

Response: Thank you very much for your comment. We have corrected all the citation of the tables, equations and figures.

Please see lines 285-286 on page 7 of the paper for a detailed explanation:” The above computational process is expressed as the Equation (1) shown below.”

Please see lines 310-311 on page 8 of the paper for a detailed explanation:” The above computational process is expressed as the Equation (2) shown below.”

Please see line 327 on page 8 of the paper for a detailed explanation:” The steps of the FFM module are depicted in Figure 5.”

Please see lines 340-341 on page 9 of the paper for a detailed explanation:” The above computational process is expressed as Equation (3) shown below.”

Please see lines 391-392 on page 11 of the paper for a detailed explanation:” From Table 1, it is evident that the last row, which utilizes different weight proportions in the loss function weighted combination, achieves the best performance.”

Please see lines 428-429 on page 12 of the paper for a detailed explanation:” In Equations (10)-(15) above, TP represents true positives, which correspond to the number of pixels correctly identified as positive samples.”

  1. In lines 522-542, the text should be modified and completed.

Response: Thank you very much for your comment. We have completed the content of Author Contributions, Funding, and Data Availability.

Please see lines 568-574 on page 17 of the paper for detailed:”

Author Contributions: Conceptualization, Zhiyong Fan; methodology, Wenjie Du and Zhiyong Fan; validation, Wenjie Du, Ying Yan and Rui Yu; writing—original draft preparation, Wenjie Du and Jiazheng Liu; writing—review and editing, Ying Yan; visualization, Jiazheng Liu; supervision, Rui Yu. All authors have read and agreed to the published version of the manuscript.

Funding: This research received no external funding.

Data Availability Statement: The data and the code of this study are available from the corresponding author upon request. ([email protected])”

  1. All references listed in References were cited in the text. However, some references are not listed according to the template from the Instructions for Authors, some references should be completed or corrected. Some of the cited articles are available online, so it would be good to list DOI.

Response: Thank you very much for your comment. We have revised and completed the references according to the GB/T 7714 format.

Please see lines 577-676 on pages 17-19 of the paper for detailed:”

  1. Mahajan S, Fataniya B. Cloud detection methodologies: Variants and development—A review[J]. Complex & Intelligent Systems, 2020, 6(2): 251-261. DOI: 10.1007/s40747-019-00128-0.
  2. Saunders R W, Kriebel K T. An improved method for detecting clear sky and cloudy radiances from AVHRR data[J]. International Journal of Remote Sensing, 1988, 9(1): 123-150 Hutchinson K D, Hardy K R. Threshold functions for automated cloud analyses of global meteorological satellite imagery[J]. International Journal of Remote Sensing, 1995, 16(18): 3665-3680. DOI: 10.1080/01431169508954653.

4.Xiong Q, Wang Y, Liu D, et al. A cloud detection approach based on hybrid multispectral features with dynamic thresholds for GF-1 remote sensing images[J]. Remote Sensing, 2020, 12(3): 450. DOI: 10.3390/rs12030450.

5.Derrien M, Farki B, Harang L, et al. Automatic cloud detection applied to NOAA-11/AVHRR imagery[J]. Remote Sensing of Environment, 1993, 46(3): 246-267. DOI: 10.1016/0034-4257(93)90046-Z.

6.Derrien M, Farki B, Harang L, et al. Automatic cloud detection applied to NOAA-11/AVHRR imagery[J]. Remote Sensing of Environment, 1993, 46(3): 246-267. DOI: 10.1016/0034-4257(93)90046-Z.

  1. Danda S, Challa A, Sagar B S D. A morphology-based approach for cloud detection[C]//2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2016: 80-83. DOI: 10.1109/IGARSS.2016.7729011.
  2. Liu X, Shen J P, Huang Y. Cloud automatic detection in high-resolution satellite images based on morphological features[C]//Eleventh International Conference on Graphics and Image Processing (ICGIP 2019). SPIE, 2020, 11373: 159-166. DOI: 10.1117/12.2557221.
  3. Tom V T, Peli T, Leung M, et al. Morphology-based algorithm for point target detection in infrared backgrounds[C]//Signal and Data Processing of Small Targets 1993. SPIE, 1993, 1954: 2-11. DOI: 10.1117/12.157758.
  4. Amato U, Antoniadis A, Cuomo V, et al. Statistical cloud detection from SEVIRI multispectral images[J]. Remote Sensing of Environment, 2008, 112(3): 750-766. DOI: 10.1016/j.rse.2007.06.004.
  5. Wylie D, Jackson D L, Menzel W P, et al. Trends in global cloud cover in two decades of HIRS observations[J]. Journal of climate, 2005, 18(15): 3021-3031. DOI: 10.1175/JCLI3461.1.
  6. Abuhussein M, Robinson A. Obscurant Segmentation in Long Wave Infrared Images Using GLCM Textures[J]. Journal of Imaging, 2022, 8(10): 266. DOI: 10.3390/jimaging8100266.
  7. Shao L, He J, Lu X, et al. Aircraft Skin Damage Detection and Assessment From UAV Images Using GLCM and Cloud Model[J]. IEEE Transactions on Intelligent Transportation Systems, 2023. DOI: 10.1109/TITS.2023.3323529.
  8. Gupta R, Panchal P. Cloud detection and its discrimination using Discrete Wavelet Transform in the satellite images[C]//2015 International Conference on Communications and Signal Processing (ICCSP). IEEE, 2015: 1213-1217. DOI: 10.1109/ICCSP.2015.7322699.
  9. Changhui Y, Yuan Y, Minjing M, et al. Cloud detection method based on feature extraction in remote sensing images[J]. The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2013, 40: 173-177. DOI: 10.5194/isprsarchives-XL-2-W1-173-2013.
  10. Surya S R, Rahiman M A. Cloud detection from satellite images based on Haar wavelet and clustering[C]//2017 International Conference on Nextgen Electronic Technologies: Silicon to Software (ICNETS2). IEEE, 2017: 163-167. DOI: 10.1109/ICNETS2.2017.8067921.
  11. Li P, Dong L, Xiao H, et al. A cloud image detection method based on SVM vector machine[J]. Neurocomputing, 2015, 169: 34-42. DOI: 10.1016/j.neucom.2014.09.102.
  12. Ishida H, Oishi Y, Morita K, et al. Development of a support vector machine based cloud detection method for MODIS with the adjustability to various conditions[J]. Remote sensing of environment, 2018, 205: 390-407. DOI: 10.1016/j.rse.2017.11.003.
  13. Fu H, Shen Y, Liu J, et al. Cloud detection for FY meteorology satellite based on ensemble thresholds and random forests approach[J]. Remote Sensing, 2018, 11(1): 44. DOI: 10.3390/rs11010044.
  14. Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495. DOI: 10.1109/TPAMI.2016.2644615.
  15. Lu J, Wang Y, Zhu Y, et al. P_SegNet and NP_SegNet: New neural network architectures for cloud recognition of remote sensing images[J]. IEEE Access, 2019, 7: 87323-87333. DOI: 10.1109/ACCESS.2019.2925565.
  16. Chen L C, Papandreou G, Kokkinos I, et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 40(4): 834-848. DOI: 10.1109/TPAMI.2017.2699184.

28.Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015: 234-241. DOI: 10.1007/978-3-319-24574-4_28.

  1. Zhao H, Shi J, Qi X, et al. Pyramid scene parsing network[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2881-2890.
  2. Zhang Z, Yang S, Liu S, et al. Ground-based remote sensing cloud detection using dual pyramid network and encoder–decoder constraint[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-10. DOI: 10.1109/TGRS.2022.3163917.
  3. Tsotsos J K. Analyzing vision at the complexity level[J]. Behavioral and brain sciences, 1990, 13(3): 423-445. DOI: 10.1017/S0140525X00079577.
  4. Hu K, Li Y, Zhang S, et al. FedMMD: A Federated weighting algorithm considering Non-IID and Local Model Deviation[J]. Expert Systems with Applications, 2024, 237: 121463. DOI: 10.1016/j.eswa.2023.121463.
  5. Guo M H, Xu T X, Liu J J, et al. Attention mechanisms in computer vision: A survey[J]. Computational visual media, 2022, 8(3): 331-368. DOI: 10.1007/s41095-022-0271-y.
  6. Hu K, Zhang D, Xia M, et al. LCDNet: Light-weighted cloud detection network for high-resolution remote sensing images[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 4809-4823. DOI: 10.1109/JSTARS.2022.3181303.
  7. 3 Hu K, Zhang E, Dai X, Xia M, Zhou F, Weng L, Lin H. MCSGNet: A Encoder–Decoder Architecture Network for Land Cover Classification. Remote Sensing. 2023; 15(11):2810. DOI: 10.3390/rs15112810.
  8. Wang Q, Ma Y, Zhao K, et al. A comprehensive survey of loss functions in machine learning[J]. Annals of Data Science, 2020: 1-26. DOI:10.1007/s40745-020-00253-5.
  9. Hu K, Weng C, Shen C, et al. A multi-stage underwater image aesthetic enhancement algorithm based on a generative adversarial network[J]. Engineering Applications of Artificial Intelligence, 2023, 123: 106196. DOI: 10.1016/j.engappai.2023.106196.
  10. Ma J. Segmentation loss odyssey[J]. arXiv preprint arXiv:2005.13449, 2020. DOI: 10.48550/arXiv.2005.13449.
  11. Li Z, Shen H, Li H, et al. Multi-feature combined cloud and cloud shadow detection in GaoFen-1 wide field of view imagery[J]. Remote sensing of environment, 2017, 191: 342-358. DOI:10.1016/j.rse.2017.01.026.
  12. Fu J, Liu J, Tian H, et al. Dual attention network for scene segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 3146-3154.
  13. Yu C, Gao C, Wang J, et al. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation[J]. International Journal of Computer Vision, 2021, 129: 3051-3068. DOI: 10.1007/s11263-021-01515-2.
  14. Wang J, Sun K, Cheng T, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 43(10): 3349-3364. DOI: 10.1109/TPAMI.2020.2983686.
  15. Qu Y, Xia M, Zhang Y. Strip pooling channel spatial attention network for the segmentation of cloud and cloud shadow[J]. Computers & Geosciences, 2021, 157: 104940. DOI: 10.1016/j.cageo.2021.104940.
  16. Hu K, Zhang D, Xia M. CDUNet: Cloud detection UNet for remote sensing imagery[J]. Remote Sensing, 2021, 13(22): 4533. DOI: 10.3390/rs13224533.”

 

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

Overall Suggestions:

The title does not match the content of the paper, and the innovative aspects are not clearly articulated.

1. The title "AFMUNet: Attention Feature Fusion Network based on a U-2 shaped Structure" does not align with the content, which focuses on cloud shadow detection. There is a mismatch between the title and the content.

2. The literature review discusses cloud detection, but it fails to clarify the relationship between cloud detection and cloud shadow detection. Does cloud detection encompass cloud shadow detection?

3. There are various types of clouds in remote sensing imagery, such as thick clouds and thin clouds. Which type of clouds and cloud shadows does the paper's method target? Is the algorithm design tailored to specific types of clouds? The specific problem points addressed by the paper's algorithm are unclear.

4. The conclusion states, "In this paper, an attention mechanism feature aggregation algorithm is proposed for cloud shadow segmentation, fully leveraging the advantages of convolutional neural networks in deep learning." However, there is no clear explanation in the methodology section regarding the design for cloud shadow detection. The dataset used is a conventional cloud detection dataset, but does it include data specifically tailored to cloud shadow features?

5. The writing of the paper is very rough, including directly inserting a screenshot for Equation 2.

 

6. The main innovation mentioned in the paper lies in the loss function, but its innovativeness is low, as it simply involves weighting and adding two common loss functions together.

    1.  
    2.  
  1.  
    1.  
Comments on the Quality of English Language

Moderate editing of English language required.

Author Response

 

The title does not match the content of the paper, and the innovative aspects are not clearly articulated.

Detailed comments:

  1. The title "AFMUNet: Attention Feature Fusion Network based on a U-2 shaped Structure" does not align with the content, which focuses on cloud shadow detection. There is a mismatch between the title and the content.

Response: Thank you very much for your comments. Our model aims to segment clouds, cloud shadows, and background from images, thus performing a three-class classification task. We have modified the title to emphasize the aim. Please see lines 1 to 2 on page 1 for detailed:” AFMUNet: Attention Feature Fusion Network based on a U-shaped Structure for cloud and cloud shadow detection”.2. The literature review discusses cloud detection, but it fails to clarify the relationship between cloud detection and cloud shadow detection. Does cloud detection encompass cloud shadow detection?

Response: Thank you very much for your comments. Introduction fails to discuss the relationship between cloud detection and cloud shadow detection. Cloud detection and cloud shadow detection are common tasks in remote sensing image analysis, both aimed at enhancing the quality and usability of remote sensing image data. However, they focus on slightly different aspects. Cloud detection involves identifying and extracting areas covered by clouds in remote sensing images, aiming to accurately identify cloud regions in the image for subsequent processing. On the other hand, cloud shadow detection refers to identifying and extracting areas of shadow formed by cloud projections in remote sensing images. The objective of cloud shadow detection is to accurately identify these dark areas caused by cloud shadows, facilitating more precise tasks such as land cover classification and surface feature extraction. We have added descriptions of our work. It not only involves identifying clouds from images but also segmenting out cloud shadows.

Please see lines 39-48 in paragraph 1 for an elaborate description:” However, merely identifying the location of cloud cover is insufficient. The presence of cloud shadows can obstruct analysis in precision agriculture and other fields, leading to biases in the results. Therefore, applications of cloud shadow detection are increasingly widespread in meteorological forecasting, environmental monitoring, and natural disaster detection. The cloud detection technology is inadequate, and utilizing cloud shadow detection technology to accurately detect cloud cover from remote sensing images is a crucial preprocessing step for most satellite imagery. In this paper, we propose a segmentation algorithm for separating the three components of clouds, cloud shadows, and background in remote sensing images.”

3. There are various types of clouds in remote sensing imagery, such as thick clouds and thin clouds. Which type of clouds and cloud shadows does the paper's method target? Is the algorithm design tailored to specific types of clouds? The specific problem points addressed by the paper's algorithm are unclear.Response: Thank you very much for your comments. We have incorporated an additional experimental detail in the paper to compare the segmentation results between thick and thin clouds. Please see lines 524-536 on pages 15-16 of the paper for detailed:”

 

   
     
     

(a)Image

(b)Label

(c)Ours

Figure 9. Results of Thin Clouds Segmentation. (a) the original image; (b)corresponding label; (c) the prediction of proposed AFMUNet.

     
     
     

(a)Image

(b)Label

(c)Ours

Figure 10. Results of Thick Clouds Segmentation. (a) the original image; (b)corresponding label; (c) the prediction of proposed AFMUNet.To further analyze our algorithm, we compared the segmentation results of different types of clouds, as shown in Figures 9 and 10. It can be observed that our proposed model performs well in segmenting both thin and thick clouds, effectively delineating the overall contours of the clouds and shadows and clearly distinguishing them from the background. However, upon comparing the third row on the left with the second row on the right, it is evident that AFMUNet exhibits superior segmentation performance for thick clouds compared to thin clouds. Thick clouds only lose some fine texture details, while thin clouds tend to lose fragmented point cloud and shadow information during segmentation.”Besides, we have added the corresponding content in Conclusion. Please see details in lines 552-556 and 559-560 on page 17 of the paper:” Experiments demonstrate the remarkable noise resistance and identification capabilities of this method. It accurately locates cloud shadows and segments fine cloud crevices in complex environments, while also producing smoother edge segmentation. Particularly noteworthy is its performance in the task of identifying thick clouds.””2) Refinement is still needed for the segmentation of thin clouds to capture fragmented information of cloud shadows;”4. The conclusion states, "In this paper, an attention mechanism feature aggregation algorithm is proposed for cloud shadow segmentation, fully leveraging the advantages of convolutional neural networks in deep learning." However, there is no clear explanation in the methodology section regarding the design for cloud shadow detection. The dataset used is a conventional cloud detection dataset, but does it include data specifically tailored to cloud shadow features? Response: Thank you very much for your comments. We apologize for the mistake in the name of the dataset, confusing HRC_WHU and GT1_WHU datasets. We have modified all the contents about the dataset. All experimental results and conclusions are based on the latter one instead of HRC_WHU. The image labels delineate 3 categories: clouds, cloud shadows, and background. Each original image is captured by three channels of RGB, while the labels white are clouds, gray are cloud shadows, and black is the background.窗体顶端Please see lines 23-25 on page 1 of the paper for detailed:” Finally, the experimental results on the dataset GF1_WHU show that the segmentation results of this method are better than the existing methods.”The introduction of dataset is in lines 399-412 on page 11 of the paper:” To further validate the generalization performance of the proposed model, we employed the GF1_WHU cloud shadow dataset created by Li et al. [42] as a generalization dataset. This dataset utilizes high-resolution GF-1 Wide Field of View (WFV) images with a spatial resolution of 16 meters and covers four multispectral bands spanning from visible to near-infrared spectral regions. The dataset consists of 108 GF-1 WFV 2a-level scene images, manually labeled by experts in remote sensing image interpretation at the SENDIMAGE laboratory of Wuhan University. These images encompass five main land cover types, including water, vegetation, urban areas, snow and ice, and barren land, representing different regions worldwide. During the model training process, we cropped the images to 256×256 pixels, removing black borders and unclear images, resulting in a total of 5428 images used for training and 1360 images for validation and testing to evaluate the model's training results, detection accuracy, and generalization performance. To illustrate the dataset effectively, we selected images from different scenes as shown in Figure 6.

         
         

(a)

(b)

(c)

(d)

(e)

Figure 6. The examples of GF1_WHU Wuhan University Cloud Shadow Dataset. (a) water; (b) vegetation; (c) snow; (d) ice; (e) barren.

”Please see lines 437-440 on page 12 of the paper for detailed:” Our experiments are conducted on the GF1_WHU dataset with an initial learning rate of 0.001, the number of samples used in each round is 4, the number of training samples is 5428, the number of training times is 150, and the quantitative metrics are used as PR, RC, MIoU and F1.”Actually, the dataset (GT1_WHU) we used in this paper includes data specifically tailored to cloud shadow features. We apologize for any confusion this may have caused.5. The writing of the paper is very rough, including directly inserting a screenshot for Equation 2.Response: Thank you very much for your comments. We have corrected the problem and reviewed the equations in the full text to make sure such errors don’t occur again. All equations are rewritten by MathType7.0(a formula editor). The manuscript has been enhanced with concise sentences, refined language, and the reduction of awkward constructions. Please see lines 58-61 on page 2 of the paper for detailed:” While the fixed threshold method is straightforward and user-friendly, it lacks the adaptability needed to accommodate various meteorological conditions, lighting scenarios, geographical regions, and times of day. Additionally, it often necessitates manual threshold adjustments, which pose numerous shortcomings and limitations.”Please see lines 63-67 on page 2 of the paper for detailed:” The dynamic thresholding method adjusts thresholds based on environmental conditions through the construction of diverse physical models, thereby enhancing the accuracy of automatic cloud analysis. However, for complex cloud and feature types, this method can be challenging to apply to the background, and it also incurs significant computational costs.”Please see lines 84-89 on page 2 of the paper for detailed:” Fourthly, texture feature method identifies cloudy and non-cloudy regions by extracting the texture features of images. For example, Abuhussein et al. [13,14] conducted segmentation by analyzing the GLCM (Gray-Level Co-occurrence Matrix) to capture spatial relationships and covariance frequencies between pixels of varying gray levels in the image. This process enables the extraction of crucial information regarding the image texture.”Please see lines 118-127 on page 3 of the paper for details:” Since then, there has been a surge in deep learning networks, with numerous CNN frameworks continuously being proposed. In 2015, Badrinarayanan et al. [25] introduced SegNet, a segmentation network based on an encoder-decoder structure, utilizing up-sampling with unpooling operation. Subsequently, in 2019, Lu et al. [26] adapted the SegNet network model for cloud recognition in remote sensing images. Their approach improved the accuracy of cloud recognition by preserving positional indices during the pooling process, thus retaining image details through a symmetrical parallel structure. Although it demonstrated some ability in cloud-snow differentiation, its training time was found to be excessively long and inefficient.”Please see lines 166-173 on page 4 of the paper for detailed:” After inputting the image, the high-level image features are initially extracted through down-sampling. Subsequently, during the up-sampling process and enhancement of feature map resolution, we progressively enhance the receptive field adaptively and employ different channel operations. In addition, the feature fusion module is utilized in each layer to integrate contextual information more accurately and fuse low-level and high-level information. Furthermore, an innovative loss function is employed during the training process, and classification results are outputted after multiple samplings.”Please see lines 391-396 on page 11 of the paper for detailed:” From Table 1, it is evident that the last row, which utilizes different weight proportions in the loss function weighted combination, achieves the best performance. This finding aligns with our initial conjecture. The Dice loss effectively distinguishes between overlap regions and boundaries, aiding in completing the classification task more effectively. Moreover, continuous training is essential for further enhancing the model's classification accuracy.”Please see lines 449-457 on pages 12-13 of the paper for detailed:” In order to enhance deep feature extraction, alleviate information loss resulting from constant downsampling, and effectively capture multi-scale contextual information, as indicated by the ablation results of the deep feature sampling process, the CSAM Attention Mechanism Module proves beneficial for information recovery to capture detailed information. Additionally, the FFM Module aids in better integrating contextual information, facilitating the fusion of features from different scales. Table 2 demonstrates a significant improvement in model performance following the introduction of these modules. Notably, the introduction of the Feature Fusion Module alone does not yield substantial improvements to the original model.”Please see lines 467-470 on page 13 of the paper for detailed:” Among all the networks considered, SegNet and FCN8 exhibit the poorest performance in terms of the metrics evaluated. While the metrics of the other models have shown improvement over successive iterations, they still fall short of the performance achieved by the models proposed in this paper.”Please see lines 512-516 on page 15 of the paper for detailed:” When confronted with remote sensing images containing significant noise interference, the performance of UNet, SegNet, and HRNet models is deemed insufficient. Instances of omission and misdetection, such as in the snowy mountain zones depicted in the comparative images, are observed. These models encounter challenges in accurately distinguishing between ice, snow, and clouds.”6. The main innovation mentioned in the paper lies in the loss function, but its innovativeness is low, as it simply involves weighting and adding two common loss functions together.Response: Thank you very much for your comments. In addition to utilizing a weighted loss function, our model presents other innovative aspects. Please see lines 174-189 on page 4 of the paper for detailed:” The main contributions of this paper's work are as follows:·         An integrated module of channel space attention mechanism, suitable for cloud shadow segmentation tasks within a U-shaped structure, has been proposed. This model facilitates dynamic adjustment of feature map weights, enhancing the capability to capture crucial image features and thereby improving segmentation accuracy;·         The feature fusion operation of the original network is updated, which helps to better understand the target and background in the image, segment the image using information from different scales, and deal with cloud shadow targets of different sizes and shapes;·         An innovative weighted loss function is developed for the dataset, which improves the accuracy of model learning and optimizes the model performance to some extent;·         A network that integrates the above three features and combines them with a feature extraction network is proposed to segment high-resolution remote sensing images.”

 

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The methodology is sound, the research entails sufficient technicality and formalism, the amount of work is substantial, and the results are validated and credible.

The writing, however, is quite verbose, while at the same time the explanation is inadequate in some important places, e.g., equation (1) on page 7 should be explained in detail: what is sigma, W1, W0?  Even if readers with CNN expertise can figure out what these symbols denote, they should be clearly annotated.  Likewise, equations (2) and (3) require better explanation.  If necessary or desirable, consider breaking them into a sequence of equations to enhance readability.

Comments on the Quality of English Language

The paper presents a convolutional network based on a U-shaped structure to achieve improved performance in cloud shadow segmentation. The technical contribution is solid.

The presentation, writing, and editing, however, all require improvement. 

--Error of reference source on line 48.

--On line 62, the author's (of reference 10) last name is Tom, not Victor.

--The writing can be improved with more concise sentences and more polished language, e.g., a single sentence (in line 116-123) has more than 100 words and is poorly structured and punctuated.  It is followed by another sentence with incorrect punctuation and capitalization (the end of line 123).

--Improve the many awkward sentences, e.g., on line229-230: "The concept of attention mechanism originally originated from the field of natural language processing ... "

Author Response

The methodology is sound, the research entails sufficient technicality and formalism, the amount of work is substantial, and the results are validated and credible. The writing, however, is quite verbose, while at the same time, the explanation is inadequate in some important places

Comments on the Quality of English Language:

The paper presents a convolutional network based on a U-shaped structure to achieve improved performance in cloud shadow segmentation. The technical contribution is solid. The presentation, writing, and editing, however, all require improvement. 

Detailed comments:

  1. Equation (1) on page 7 should be explained in detail: what is sigma, W1, W0?  Even if readers with CNN expertise can figure out what these symbols denote, they should be clearly annotated.  Likewise, equations (2) and (3) require better explanation.  If necessary or desirable, consider breaking them into a sequence of equations to enhance readability.

Response: Thank you very much for your comments. We have carefully reviewed all the formulas and added explanatory notes where necessary.  Above each equation in 2.2 and 2.3, there is also a corresponding description of the calculation process to help understand.

Please see Equation (1) on page 7 for detailed:” Step 1: Firstly, the input feature map is subjected to global average and maximum pooling operations respectively, and the input information is compressed and downgraded to obtain two 1×1 average pooled features,, and maximum pooled features,. Step 2: Then, they are fed into a weight-sharing two-layer neural network, MLP. Step 3: Finally, the MLP output features are subjected to an element-by-element summation operation, which is applied to the input feature map after activation by the Sigmoid function to generate the final Channel Attention Feature,.The above computational process is expressed as the Equation (1) shown below:

 

where  means the sigmoid function and  represents the weights of the hidden/output layer. The parameters of  and  are shared in MLP.”

Please see Equation (2) on page 8 for detailed:” Step 1: First, the feature map output from the CAB module is used as the input of this module,, and global maximum pooling and average pooling are done on the channel dimensions respectively, and then these two results are done as a splicing operation. Step 2: Next, a 7×7 convolution kernel is chosen to do a convolution operation on the splicing result, and the channel dimensions are reduced to 1. Step 3: Finally, after the Sigmoid activation function maps the weights between 0 and 1 to represent the order of importance of each position, these spatial attention weights are applied to the inputs to generate the feature map of the spatial channel attention mechanism, . The above computational process is expressed as the Equation (2) shown below.

where  stands for the kernel of convolution. This size performs better than others.”

Please see Equation (3) on page 9 for detailed:” Step 1: Accept two feature maps with different resolutions from the encoder and decoder sections respectively as input. Step 2: Perform a series of operations such as splicing, convolution, and so on, to fuse them into an enhanced hybrid feature map, which strengthens the representation of the hybrid features and makes them more suitable for subsequent processing. Step 3: Then we perform a global averaging of the hybrid feature map pooling to reduce the spatial dimension to 1×1 to obtain global channel statistics. Step 4: Then we introduce two consecutive 1×1 convolution operations via Relu and Sigmoid activation functions, respectively, in order to enhance the nonlinearity and show the importance of each channel. Step 5: Multiply the channel attention weights with the element-by-element hybrid feature map obtained from Step 2 to perform mul operation to obtain a weighted feature map. Step 6: Finally, the weighted feature map obtained from Step 5 is subjected to element-by-element add-sum operation with the hybrid feature map obtained from Step 2 to produce the final fused feature map. The above computational process is expressed as Equation (3) shown below.

where  means the fusion of the input from shallow and deep layer and represents the enhanced nonlinear result as an intermediate variable.”

Please see Equation (7) on page 10 for detailed:”

where  means the intersection of between samples X and Y,  represents the number of samples X, and  stands for the number of samples Y. ”

Please see Equation (8) on page 10 for detailed:”

where  depicts the union between samples X and Y. ”

  1. Error of reference source on line 48.

Response: Thank you very much for your comments. We have corrected reference source on line 48.

  1. On line 62, the author's (of reference 10) last name is Tom, not Victor.

Response: Thank you very much for your comments. We have modified the author’s last name in line 65. Please see lines 72-75 on page 2 of the paper for detailed:” Moreover, Tom et al. [10] established a common method based on morphological an efficient computational paradigm for the combination of simple nonlinear grayscale operations so that the cloud detection filter exhibits spatial high-pass properties, emphasizes cloud shadow regions in the data, and suppresses all other clutter.”

  1. 4. The writing can be improved with more concise sentences and more polished language, e.g., a single sentence (in lines 116-123) has more than 100 words and is poorly structured and punctuated.  It is followed by another sentence with incorrect punctuation and capitalization (the end of line 123).

Response: Thank you very much for your comments. We have rephrased some sentences and modified the structure.

Please see lines 54-58 on page 2 of the paper for detailed:” Early on in the research people used fixed thresholds to distinguish clouds from other parts. For instance, Saunders and Kriebel [3] who processed the NOAA-9 dataset over a week by determining thresholds for a range of physical parameters including cloud-top temperatures, optical depths, and liquid water content.”

Please see lines 127-133 on page 3 of the paper for detailed:” In 2016, Chen et al. [27] designed an inflated convolutional network called DeepLab, aimed at expanding the sensory field by introducing voids in the convolutional kernel. DeepLab enhances the robustness of image segmentation. However, it imposes specific requirements on the size of the segmented target. It excels in segmenting foreground targets within the general size range. Nonetheless, when faced with extreme size variations in the target, such as very small or very large targets, DeepLab exhibits poor performance and suffers from segmentation instability.”

Please see lines 156-162 on page 4 of the paper for detailed:” The encoder-decoder architecture of UNet effectively extracts and restores feature information across various scales, making it particularly suitable for smaller-scale datasets. Therefore, we adopt this U-shaped network structure as our baseline and integrate the channel attention mechanism and spatial attention mechanism module into it. This integration allows for adaptive attention to different channels of the image and feature map information, with the goal of enhancing the fine detection of cloud shadows.”

Please see lines 173-175 on page 4 of the paper for detailed:” Through the combined effect of the above several modules, the detection accuracy of our network has been substantially improved. The main contributions of this paper's work are as follows:”

Please see lines 419-424 on page 11 of the paper for detailed:” In this section, based on the dataset introduced in the preceding section, we employed the PyTorch library to train and test all models on a GeForce RTX2080Ti graphics card. This comprehensive evaluation aimed to assess the efficiency and accuracy of our proposed network model for cloud shadow segmentation. Through a series of ablation experiments and comparison experiments, we thoroughly evaluated our model from both qualitative and quantitative perspectives.”

Please see lines 518-524 on page 15 of the paper for detailed:” None of the aforementioned models are suitable for the challenging task of cloud shadow segmentation across diverse and complex environments. In contrast, the algo-rithm proposed in this paper adeptly addresses cloud shadow segmentation in various situations and scenarios. By optimizing deeper features and leveraging the enhanced channel and feature fusion capabilities enabled by the spatial attention mechanism module, our algorithm effectively recovers high-definition remote sensing images.”

  1. 5. Improve the many awkward sentences, e.g., on lines 229-230: "The concept of attention mechanism originally originated from the field o natural language processing ... "

Response: Thank you very much for your comment. We have improved awkward sentences and provided some examples to help understand.

Please see lines 105-108 on page 3 of the paper for detailed: “Although these methods are indeed more effective, they necessitate manual feature engineering to select suitable labels for training and testing a large volume of data annotations. Furthermore, the quality of the model is even directly influenced by the features selected.”

Please see lines 241-250 on pages 5-6 of the paper for detailed: “The concept of attention mechanism originated in the field of natural language processing. It serves to emphasize words at different positions within an input sentence, thereby facilitating improved translation into the target language [32,33]. For instance, in machine translation, attention mechanism helps the model focus on relevant parts of the input sentence when generating each word of the translation. This allows for more accurate and contextually appropriate translations, especially in cases where the input sentence is long or complex. Similarly, in text summarization, attention mechanism aids in identifying important sentences or phrases to include in the summary, resulting in more concise and informative summaries.”

Please see lines 265-267 on page 6 of the paper for detailed: “This module enhances our ability to focus on the channel information of the image during cloud shadow segmentation tasks, thereby improving cloud perception and segmentation accuracy.”

Please see lines 428-433 on page 12 of the paper for detailed: “In Equations (10)-(15) above, TP represents true positives, which correspond to the number of pixels correctly identified as positive samples. Similarly, FP denotes false positives, indicating the number of pixels incorrectly classified as positive samples. TN refers to true negatives, representing the number of pixels accurately identified as negative samples. Lastly, FN signifies false negatives, indicating the number of pixels incorrectly classified as negative samples.”

Please see lines 488-494 on page 14 of the paper for detailed: “UNet is a classic segmentation network known for its superior performance in training on smaller datasets and producing smoother segmentation edges. However, it still requires improvement in processing details. Our model addresses this limitation to some extent, effectively recognizing cloud and cloud shadow boundaries while enhancing detail processing. Nonetheless, further refinement is needed to effectively handle very low light or thin cloud bodies.”

 

 

Author Response File: Author Response.docx

Back to TopTop