Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Interpretable Deep Learning Applied to Rip Current Detection and Localization

Remote Sens. 2022, 14(23), 6048; https://doi.org/10.3390/rs14236048

by Neelesh Rampal¹

, Tom Shand², Adam Wooler³ and Christo Rautenbach^4,5,*

Reviewer 1: Anonymous

Reviewer 2:

Alex Pang

Remote Sens. 2022, 14(23), 6048; https://doi.org/10.3390/rs14236048

Submission received: 26 August 2022 / Revised: 13 November 2022 / Accepted: 18 November 2022 / Published: 29 November 2022

Round 1

Reviewer 1 Report

Enclosed is a review for “Interpretable Deep Learning applied to Rip Current Detection and Localization” by Rampal et al. submitted to Remote Sensing. A machine learning approach to identifying rip currents in static images is detailed, with novelty related to the ability to evaluate the method’s performance by revealing false positives. I commend the authors on a very interesting technique, and although the below review requests some further analysis and details, I believe the revisions will not require a major revision/effort as they do not request any fundamental changes to the methods.

Major Comments:

My first general comment is that I think the paper could be made more impactful with just a little further analysis to demonstrate the kind of unique products you can obtain with this method. ”The ability to capture some shape characteristics of the amorphous rip current structure” is repeated a number of times in the text across the abstract, intro, methods, discussion, and conclusion - so much so that I was excited to see a figure demonstrating some metrics extracted from the identified rips. I think this is a quality methods paper that could be turned into a more citable work, with relevance to future rip current quantification, if you took the next step and actually identified something like the width, duration, and offshore extent of evolving rip currents identified in even just one of the 23 validation videos. This would be a final discussion figure truly demonstrating what the paper currently claims is an advantage of this method, and explicitly states as the number 1 aim in line 132.

The discussion of performance could be improved in a number of ways.

The bar chart used in Figure 6 is a poor representation of accuracy. This could be improved by using a box plot in place of the right hand bar for the accuracy across all 23 videos. This would better communicate the average, the typical range, and better acknowledge the occurrence of the outliers that are currently buried in the appendix table (I’d recommend this even if you can only generate the box plot for your own model and can’t for the accuracy of de Silva). Otherwise this bar chart can just be communicated as a list in the text and is not a useful visual.
I also think that a limitation of the presented method is that you have created only 1 model. If you removed a different 20% for validation and trained with a different 80%, how repeatable is your performance? Not knowing this is a major limitation to a methods-based paper that ideally would give confidence to the reader that this method would be a useful approach for them to adapt. K-folding your methods across the entire library (say removing a unique 20% five different times to create five different versions of the model) would be a good way to communicate the repeatability.
This may be beyond what’s necessary for this publication, but I am curious just how frequent of a problem the issue of identifying a rip for the wrong reasons occurs. You’ve given an example in Figure 3, but what percentage of times did you find this? As in, how big of a problem might this actually be in simpler methods that just give a positive/negative output? I’m envisioning the ability to make an extended confusion matrix with a 3rd column/row to account for this phenomena could really benefit the metrics reported, as this would highlight true/false positives/negatives and then support the claim that the interpretable AI is necessary.

While I generally agree with the perspectives on advantages/limitations of different rip current identification algorithms, I do think you are missing some of the most recent literature. In the last year, Anderson et al. [2021] (doi:10.3390/rs13040690) introduced an optical flow-based algorithm that runs in a matter of seconds (effectively real-time) and Rodriguez-Padilla et al. [2021] (doi:10.3390/rs13101874) further enhanced the filtering. Both works clearly have the caveat of requiring a fixed camera and recording video but they can similarly capture amorphous rip structures (see last figure of Anderson et al. [2021]). I think the “orders of magnitude slower” comment in the table should be removed (or at least caveated to recognize that it is dependent on the algorithm - PIV vs. optical flow), and I think because both of those works are from the same journal you are submitting to, that you should be encouraged to reference them and highlight potential advantages over them.

Another recent work from this journal and likely worth highlighting in your discussion of timex imagery to identify rips is Ellenson et al. [2020], who use a CNN + image augmentation to identify different beach states with the presence of rip channels being associated with the presence of specific classes. As presently written, the manuscript infers that these images can only help with identifying wave breaking, but I think it should acknowledge that you can do more processing with those images that provides a downscaled product with greater use than simply a spatial wave breaking pattern.

The paper notes the speed of this algorithm as an advantage, but never quantifies the computational resources and time required. How long was the upfront fitting on the GPU? How quickly does it process the validation videos in terms of time or per frame? If the reader wanted to replicate your method, what computational needs do they need to plan for?

Minor comments:

Line 48: Correct formatting so start of the sentence is “Brander et al. [2016]” not [11].

Line 63: missing an end parentheses.

Line 69: This perspective is a little too Argus-centric - there’s actually a pretty rich number of platforms developed in other regions of the world outside of these two: HORUS, CoastalCOMS, KOSTASYSTEM, COSMOS (Taborda & Silva, 2012; doi:10.1016/j.cageo.2012.07.013), SIRENA (Nieto et al. 2010, doi:10.1002/esp.2025), Beachkeeper (Brignone et al. 2012, doi:10.1016/j.cageo.2012.06.008), and ULISES (Simaroo et al. 2017, doi:10.2112/JCOASTRES-D-16-00022)

Line 126-127: You have already defined what a CNN is in line 97 so don’t need to repeat that abbreviation’s definition here.

Line 153: You have not introduced/defined YOLO to the reader at this point.

Line 158: “surf-live saving” … is this surf-life saving?

Line 167: Probably should spell out Grad-CAM’s full name here, and define the abbreviation, rather than in lines 254-255.

Line 215: replace “as appose to” with “as opposed to”

Lines 272 & 273: Looks like an issue with automatic latex referencing.

Line 276: Replace GRADCAM with Grad-CAM

Figure 5: I’m surprised to see such rigid polygons for the random shadow. Is this supposed to be representative of passing cloud cover?

Line 374: able “to” accurately

Line 382-384: Are there any hypotheses for why these videos are not working?

Line 424-427: But does the method in this paper actually identify feeder currents? I would argue that Figure 7 indicates it does not. Do you have a better example you could provide as evidence?

Line 474-475: The only performance metric I can find is the accuracy across the 23 videos, is there supposed to be more supplementary/appendix material with a list of other metrics?

Author Response

Please find our responses, inline, in blue text below each comment. All changes have been made via track changes in the manuscript.

Reviewer 1:

Enclosed is a review for “Interpretable Deep Learning applied to Rip Current Detection and Localization” by Rampal et al. submitted to Remote Sensing. A machine learning approach to identifying rip currents in static images is detailed, with novelty related to the ability to evaluate the method’s performance by revealing false positives. I commend the authors on a very interesting technique, and although the below review requests some further analysis and details, I believe the revisions will not require a major revision/effort as they do not request any fundamental changes to the methods.

I’d like to thank the reviewer for their time, they have made some fantastic suggestions that will overall increase the quality of our work.

Major Comments:

This is an excellent suggestion, and I think this would absolutely add significant value to our existing work. We are currently working on including more metrics to quantify shape and width, but in particular we also want to classify different types of rip-currents (e.g., feeders rips), and in domains with multiple rip-currents. While we have not added any further work from this suggestion, we have added a description of some future work (for which we are currently applying for funding), and some limitations of this work in the conclusions.

The discussion of performance could be improved in a number of ways.

The bar chart used in Figure 6 is a poor representation of accuracy. This could be improved by using a box plot in place of the right-hand bar for the accuracy across all 23 videos. This would better communicate the average, the typical range, and better acknowledge the occurrence of the outliers that are currently buried in the appendix table (I’d recommend this even if you can only generate the box plot for your own model and can’t for the accuracy of de Silva). Otherwise, this bar chart can just be communicated as a list in the text and is not a useful visual.

This is good feedback; I’ve removed this figure and converted this to a table. Note, we decided not include a box plot as most of the accuracies are 100% (for the videos), this makes the distributions look strange. I’ve added some more text to communicate these results better (lines 448 – 460).

I also think that a limitation of the presented method is that you have created only 1 model. If you removed a different 20% for validation and trained with a different 80%, how repeatable is your performance? Not knowing this is a major limitation to a methods-based paper that ideally would give confidence to the reader that this method would be a useful approach for them to adapt. K-folding your methods across the entire library (say removing a unique 20% five different times to create five different versions of the model) would be a good way to communicate the repeatability.

This is a good point; we’ve repeated our analysis with K-fold validation. This has highlighted that the CNNs accuracy of 0.57 (and other experiments) is of course not statistically significant. We’ve added some more discussion of this in the relevant sections. These results are incorporated in lines 448 -460)

This may be beyond what’s necessary for this publication, but I am curious just how frequent of a problem the issue of identifying a rip for the wrong reasons occurs. You’ve given an example in Figure 3, but what percentage of times did you find this? As in, how big of a problem might this actually be in simpler methods that just give a positive/negative output? I’m envisioning the ability to make an extended confusion matrix with a 3rd column/row to account for this phenomenon could really benefit the metrics reported, as this would highlight true/false positives/negatives and then support the claim that the interpretable AI is necessary.

This is a good point. I think this would require a lot more work, and more careful inspection. We’ve added some commentary about this point beneath Figure 3, and in the conclusions. Text added to lines 317—320. More information for augmentation is added from lines 380 – 389.

While I generally agree with the perspectives on advantages/limitations of different rip current identification algorithms, I do think you are missing some of the most recent literature. In the last year, Anderson et al. [2021] (doi:10.3390/rs13040690) introduced an optical flow-based algorithm that runs in a matter of seconds (effectively real-time) and Rodriguez-Padilla et al. [2021] (doi:10.3390/rs13101874) further enhanced the filtering. Both works clearly have the caveat of requiring a fixed camera and recording video, but they can similarly capture amorphous rip structures (see last figure of Anderson et al. [2021]). I think the “orders of magnitude slower” comment in the table should be removed (or at least caveated to recognize that it is dependent on the algorithm - PIV vs. optical flow), and I think because both of those works are from the same journal you are submitting to, that you should be encouraged to reference them and highlight potential advantages over them.

Thank you for these great references. The words, “orders of magnitude” have been removed and the references added on lines 105 to 107.

Thank you. This reference was added on line 109 to 110.

Very good point, we should have outlined this. We have added some more information in the model architecture and training sections.

Minor comments:

Line 48: Correct formatting so start of the sentence is “Brander et al. [2016]” not [11].

Modified

Line 63: missing an end parenthesis.
Added

ine 69: This perspective is a little too Argus-centric - there’s actually a pretty rich number of platforms developed in other regions of the world outside of these two: HORUS, CoastalCOMS, KOSTASYSTEM, COSMOS (Taborda & Silva, 2012; doi:10.1016/j.cageo.2012.07.013), SIRENA (Nieto et al. 2010, doi:10.1002/esp.2025), Beachkeeper (Brignone et al. 2012, doi:10.1016/j.cageo.2012.06.008), and ULISES (Simaroo et al. 2017, doi:10.2112/JCOASTRES-D-16-00022)

Thank you, these references have been added on lines 75 and 76.

Line 126-127: You have already defined what a CNN is in line 97 so don’t need to repeat that abbreviation’s definition here.

Modified

Line 153: You have not introduced/defined YOLO to the reader at this point.

Modified

Line 158: “surf-live saving” … is this surf-life saving?
Modified

Line 167: Probably should spell out Grad-CAM’s full name here, and define the abbreviation, rather than in lines 254-255.

Modified, and updated the full name earlier in the text.

Line 215: replace “as appose to” with “as opposed to”

Modified, and updated the full name earlier in the text.

Lines 272 & 273: Looks like an issue with automatic latex referencing.

It looks like it was the pdf conversion that did something wrong.

Line 276: Replace GRADCAM with Grad-CAM

Modified in text.

Figure 5: I’m surprised to see such rigid polygons for the random shadow. Is this supposed to be representative of passing cloud cover?

Correct, however they can also represent shadows of buildings. The idea here to make the algorithm more generalizable to imagery that it has never seen before. While the synthetically created imagery isn’t necessarily an accurate depiction of real shadows, clouds, or buildings it can improve the accuracy and ability of the model to generalize. Moreover, the introduction of synthetic data provides an intermediate solution to generating more data instead of manually collecting data.

Line 374: able “to” accurately

Modified in text.

Line 382-384: Are there any hypotheses for why these videos are not working?

We are not sure but hoping that the answers will become clear as we further develop the rip size and type classifications.

Line 424-427: But does the method in this paper actually identify feeder currents? I would argue that Figure 7 indicates it does not. Do you have a better example you could provide as evidence?

In some instances, our approach is capable of feeder rip-currents, however for many it is not able to. We have slightly redacted, this comment to reflect the performance of our model.

Line 474-475: The only performance metric I can find is the accuracy across the 23 videos, is there supposed to be more supplementary/appendix material with a list of other metrics?

There should be a large table in the appendix that has the accuracy metric for each individual video. Because each video consists of only one class, I think it should be relatively informative alone. I’ve added a confusion matrix to the results section also.

Author Response File: Author Response.pdf

Reviewer 2 Report

This review is compiled from 3 independent reviewers from the same institution.

Reviewer A

Paper Summary:

This paper proposes a method for the identification and localization of rip currents from either still images or video frames. The underlying method is an ML method called Grad-CAM which was originally designed to provide transparency in ML algorithms. The authors use a MobileNet CNN model to determine the presence of rip, the output layer of that model is then used to generate a heatmap that accounts for the portions of the image that led to that determination. The authors also performed various data augmentation to make the detection model more robust to environmental conditions. Finally, the authors also compared their work with a previously published method as well as a brief description of a variation of aggregating video frames to find rip boundaries.

High Level Review:

Overall, I found the paper interesting, addresses an important problem, and provides a viable alternative solution (with some caveats). In general, the paper is also well written and easy to understand. However, there are also some mistakes/omissions that need further explanations.

Points for improvement:

1. Figure 3 is used to illustrate a case where classification is correct but attributed to the wrong reason. The paper mentions that they address such shortcomings by augmenting the training set. It would be very useful to provide the heatmap of the same scene after the model was retrained with data augmentation. Also, please provide more details about the process of how this is carried out in steps 3-4 (lines 311-314). Presumably, there is some systematic issues in the model (or lack of generality in the training data) that would account for the incorrect focus on the beach rather than rip in Figure 3. Which data augmentation scheme was responsible for moving the focus away from the beach and towards the actual rip? Is it also correct to assume that there was no manual annotation of the image to indicate where the rip is located? That is, there is no feedback loop to Grad-CAM informing it which part of the heatmap is correct/incorrect? This is is not clear from the description.

2. Figure 8 shows 2 examples where the images are classified as having rips, but the heatmaps are incorrect. Could further data augmentation address such problem? How does one go about designing appropriate augmentations?

3. In Table 2, how does histogram normalization, etc. create data augmentations for rocky outcrops, etc.?

4. In Figure 7, it seems like the 10th percentile exceedance contour describes the boundary of breaking waves rather than the boundary of rip current.

5. Which version of MobileNet is used -- original, V2, V3?

6. It would be useful to see an example of where multiple rips are present in the scene.

7. Limitation: The type of rip current is limited to what is provided in the training set i.e. currently, bathymetry-controlled rips. I believe that this would also apply to feeder rips -- unless there is sufficient training data for these, images of the main rip alone may not be sufficient to localize the feeders.

8. The paper is missing a citation to a relevant reference of an operational system for detecting flash rips based on image processing:

Lifeguarding Operational Camera Kiosk System (LOCKS) for flash rip warning: Development and application https://www.researchgate.net/publication/335303719_Lifeguarding_Operational_Camera_Kiosk_System_LOCKS_for_flash_rip_warning_Development_and_application

9. Double check references. Many are incomplete e.g. [8][40][43], some are missing page numbers, [10] has errors.

Specific Comments (Sequential, more or less):

1. line 63: missing )

2. line 126: [20] and [26] do not use CNN's

3. line 135: Please elaborate on what is meant by "to classify". In its current form, I believe the model can only detect bathymetry controlled rips but not other types of rip current. Perhaps, the authors meant "to determine"?

4. line 168: This step does not ..

5. line 182: rip current expert ..

6. line 211: it would be useful to provide a short description of the "gradient problem".

7. line 215: as opposed to ..

8. line 226 and throughout: have a consistent spelling of MobileNet.

9. line 244: are trained with an initial ..

10. line 272/273: missing reference

11. line 276 and throughout: have a consistent spelling of Grad-CAM.

12. throughout: Place figures and tables close to where they are referred to in the paper. I found myself having to flip pages back and forth.

13. throughout: Use consistent naming e.g. Table 1 lists 2 models as CNN and MobileNet, but in Figure 6, I think only MobileNet has transfer learning of rip current and is simply referred to as Transfer Learning. I would suggest to use longer more complete labels for these. On this note, was the simple CNN model trained on the same dataset as MobileNet then retrained with rip currents?

14. line 368: against Table 4 of ..

15. line 374: is able to accurately ..

16. line 475: obtaining an overall ..

17. line 479: and disadvantages ..

18. line 483: enable monitoring ..

Reviewer B

Summary:

This paper applies an interpretable AI method, Grad CAM, to interpret the predictions of Mobilnet and a baseline CNN on a rip current image dataset. Additionally, these interpretations are used to create data augmentations to improve the classification accuracy of Mobilenet and baseline CNN on the same dataset.

General Concept Comments:

The authors used a data augmentation scheme to improve accuracy. However, the comparison between with and without augmentation is lacking. In Figure 6, the authors only show with and without augmentation for the baseline CNN. Adding the comparison for MobileNet with and without augmentation would strengthen the paper.

The authors used different types of data augmentation methods to augment the dataset. However, the motivations listed by the authors in Table 2 on why some of the augmentation methods were used is not entirely convincing. For example, regarding channel shuffle and shift, what is the author's objective in using these augmentations? Random rain and fog, shadows are also used; however, non of the test videos had these cases. Adding a description of what motivated the authors to include a particular type of augmentation would improve the paper.

The authors used rotation as a data augmentation method, as shown in Figure 5. However, in the process, the authors are introducing a boundary with sharp edges to the training data. The test data does not contain this type of boundary with sharp edges. From my understanding, the trained model expects to see a sharp boundary on the test data to predict the rotated/oblique rip currents effectively. The authors should address the effect of this boundary on the trained models. Alternatively, the authors could zoom in after rotating to remove this boundary.

The authors used transfer learning (line 227) and fine-tuning to improve the convergence time. However, the description of the steps the authors followed to do transfer learning is lacking. From what I understand, the mobilenet model trained for 1000 categories has a different fully connected layer than the mobilenet trained for two categories (rip/ no-rip). Adding more information on what changes the authors made to the original model trained on ImageNet, followed by the steps used to do the transfer learning, would improve the paper.

The authors claim that object detectors such as Faster RCNN and YOLO are less generalizable in line 419. The authors further claim that their method can learn complex patterns such as feeder currents, while Faster RCNN and YOLO cannot. The authors do not provide enough evidence to support these claims. No images were provided of the detected feeder rip currents. Removing this claim from the paper or adding evidence to support these claims would improve the paper.

In Table 3, the authors indicate that some methods are real-time while others are not. From what I understand, the execution time of a method heavily depends on the type of machine used (how many parallel GPUs, how many CPUs, how much RAM, etc.). Therefore, it would be valuable for the reader to know which type of machine was used to make these comparisons.

Specific Comments:

148: missing citation for gradient CAM

182: exert ? expert

272: missing references

273: missing references

Reviewer C

This paper presents a new approach for rip current detection using an interpretable AI method based on Grad-CAM, a method proposed [38] and implemented [40] in earlier works. The authors discussed the existing rip current detection methods from various previous works and compared the advantage of their work over them (e.g., not bounding box-based representation, amorphous rip current detection, etc.). In the result section, they quantitatively compared their results of rip current detection with [6], using the same dataset provided by the same work.

The paper is well-written, organized, and easy to read. The challenges addressed in the article are defined clearly. A reasonable amount of previous works was covered in the paper. The paper also presents the details of a proposed data augmentation strategy and demonstrates it on the existing dataset, which can potentially be applied to generalize ML methods.

Figure 2 could represent the actual architecture, and more detail of the mobile-Net used, as it should be available in the original paper. Line 212-213, "While there exists a wide variety of notable alternatives to Mobile-Net, that have typically more model parameters (e.g., Res-Net50 or VGG-16) they did not show any significant improvements in accuracy" should be backed by a reference.

Figure 3, The explanation is not very clear. It is hard to understand the "wrong reasons" and how or why it affects the model to detect the beach and the people as the rip current.

The paper proposes an augmentation strategy to overcome the lack of diversity in model training data. However, the augmentation results do not look realistic visually in some of the example images shown in figure 5. In the last three augmentation results in figure 5, random shadow, fog, and rain do not represent realism. As it is a top-down view, the rain shouldn't look like straight lines, and the fog shouldn't look like circular patches similar to lens flare. Careful consideration and more evidence may need to understand how the trained model will learn to work in real-world applications (such as in rain or fog) if trained with these augmented image data that is not visually realistic.

Even though the challenges are defined, the contributions and conclusion of the work do not seem too strong to me. The final results (accuracy) are not better than the previous works (also mentioned in the paper). However, the authors claim the flexibility of this method provides opportunities that are not evaluated clearly. There are mentions of drone applications multiple times (including future work). However, it is not discussed how the work will benefit drone applications, which I feel would contribute to the completeness of the paper.

Overall, I feel the paper presents a new method of rip current detection based on an existing technique (Grad-CAM) but does not fully evaluate the method for the proposed advantages and applications.

Author Response

Please find our responses, inline, in blue text below each comment. All changes have been made via track changes in the manuscript.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

The revised paper is much better. There are a couple of lingering questions which I think can be handled with a minor revision without another round of reviews. There are also some minor typo's that can be easily fixed.

1. line 384-385: how/when do you know the model is sufficiently trained?

2. line 414: distorting the bounding boxes together with the input image

should not be a problem for FRCNN to handle

3. will the software be published/available with the publication?

Typo's:

1. line 296, 297: missing "." i.e. sentence breaks

2 line 325, 327: missing figure #

3. line 326: garbled sentence

4. line 385: .. model was ..

5. line 389: .. cases ..

6. line 449: missing citation, mask r-cnn should be faster r-cnn

Author Response

Thanks again for your extremely valuable feedback.

line 384-385: how/when do you know the model is sufficiently trained?

This had been outlined slightly earlier in the text, but we added further clarification. Modifications made on lines 295 – 297.

“To ensure that the models do not overfit, the learning rate is adapted through regular monitoring of the validation cost function (on the validation dataset). The learning rate is reduced by a factor of 3 if the validation loss does not decrease after three epochs. Furthermore, if the validation loss fails to reduce after 10 epochs, training is stopped”

line 414: distorting the bounding boxes together with the input image should not be a problem for FRCNN to handle

This is a good point. I’ve retracted some text to make this distinction clear.
Lines 417-419 now read:

In previous work by (de Silva, Mori et al. 2021), the data augmentation was limited to rotating the images 90 either side. Our approach to data augmentation could also be beneficial object detection methods such as Faster R-CNN or YOLO as augmentations such as image-shearing, perspective transformations, image zooming can also be applied to the bounding box coordinates.

will the software be published/available with the publication?

We are committed to ensuring that our data and methods are “open”. We are finalizing some modifications to our experiments and plan to release our code on Github as soon as possible.

Typo's:

line 296, 297: missing "." i.e. sentence breaks

Modified

2 line 325, 327: missing figure #
Modified

line 326: garbled sentence

Modified

line 385: .. model was ..
Modified
line 389: .. cases ..
Modified
line 449: missing citation, mask r-cnn should be faster r-cn
Modified

Article Menu

Interpretable Deep Learning Applied to Rip Current Detection and Localization

Further Information

Guidelines

MDPI Initiatives

Follow MDPI