1. Introduction
Rips are defined as strong, localized currents which move along and away from the shore and through the breaker zone [
1]. A rip current forms due to conservation of mass and momentum. Breaking waves push surface water towards the shoreline. This excess water reaches the shoreline and flows back towards open water because of the force exerted by gravity. The water moves via the route of least resistance, and as a result, there are often preferential locations where rip currents can form. These include localized undulations or breaks in a sandbar or areas of no or lower breaking waves. Rip current formation is not restricted to oceans and seas and can also form in large lakes when there is sufficient wave energy. There are multiple factors that can enable preferential development of rip currents. These include the beach morphology, wave height, wind direction, and tides. As a result, some coastlines are more vulnerable to rip currents than others. Due to the complexity in forecasting morphology, numerous studies have adopted probabilistic forecasting methods [
2,
3,
4,
5].
Rip currents have been reported as being the most hazardous safety risk to beachgoers around the world [
1,
6] and, in Australia, are responsible for more deaths than floods, hurricanes, and tornados combined [
7,
8]. Beaches present varying levels of risk to beachgoers, depending on the season and location. For example, exposed beaches with large waves, strong winds, and significant tidal variations tend to present greater risks in general [
9]. Significant research efforts have also gone into the communication of rip current-related hazards [
10]. Ref. [
11] revisited the research for mitigation and escape measures being communicated to individuals caught in a rip current and suggested new approaches. This study was a collaboration between academic institutes and Surf Life Saving Australia. They highlight the importance of lifeguards on beaches but also that the importance of directly surveying and interviewing rip current survivors to gain valuable insights into the human behavioral aspect of incidents [
11].
While rip currents are a well-known ocean phenomenon [
1], many beachgoers do not know how to reliably identify and localize rip currents [
12]. This also extends to lifeguards, who, due to the highly oblique angle they are generally observing the ocean from, can likewise struggle to identify certain rip currents [
11], especially when the coastal morphology is complex, or the meteorological-ocean (metocean) conditions change rapidly [
13]. Despite warning signs and educational campaigns, this coastal process still poses serious threats to beach safety, with some countries reporting increases in fatalities [
3]. Thus, research and development of new techniques, for the effective identification and forecasting of this dynamic process, are ongoing. These technologies, like the methods presented here, are aimed at taking beachgoer safety from reactive to preventative and will require efficient and clear warning/notification dissemination. The aim would be for the accurate forecasting and/or identification of rip currents (e.g., [
14]) to inform the public where rip currents were occurring, so they make informed decisions of the safest place to swim.
Popular beaches are often patrolled by lifeguards, with some beaches being equipped with cameras. The function of these cameras can be for security, to provide live weather and beach conditions information, or in some cases, to monitor coastal processes [
15]. Coastal imagery has been used for over 30 years to detect wave characteristics, beach, and nearshore morphology [
16,
17], and comprehensive and semi-automated systems such as Argus [
18] have been developed in the United States, United Kingdom, Netherlands, and Australia, and Cam-Era in New Zealand [
19]. Other systems include HORUS, CoastalCOMS, KOSTASYSTEM, COSMOS [
20], SIRENA [
21], Beachkeeper [
22], and ULISES [
23]. The Lifeguarding Operational Camera Kiosk System (LOCKS) for flash rip warning [
24] is another example. While many beaches utilize single or networks of cameras, few, if any, beaches have real-time processing to identify features such as rips. Thus, as a result, most of the rip current detection is done manually by lifeguards and beachgoers [
12]. Any rip current forecast or real-time identification tool can therefore assist lifeguards and beachgoers in rip-related rescues and drownings. While in situ measurements such as acoustic doppler current profilers (ADCP), floating drifters, and dye have been used to study and quantify rip currents [
25,
26], these are time-consuming, expensive, and must be utilized where a rip is occurring. This makes them less useful for identification of rips compared to image processing techniques, which can observe large areas at low cost and effort.
There has also been significant uptake in the use of AI and other image and signal processing techniques for classifying and localizing rip currents (e.g., [
6,
25,
27], wave breaking [
28,
29], and coastal morphology [
5,
30]. Image and other signal processing techniques often use time-exposed images, or simply through averaging a series of frames. This technique works well for rip currents that exist where waves do not break over the position of the rip current and are thus visually darker. Here, places with consistent breaking waves will appear blurred white, while the location of a rip will appear darker. There are several limitations to these techniques. Firstly, because of time-averaging over periods of at least 10 min, it cannot detect and capture non-stationary, rapidly evolving rip currents, which are needed in the context of surf-life saving. Secondly, there are significant challenges in automatically deriving thresholds for rip currents, which vary as a function of the underlying bathymetry. Moreover, there is no one-fits-all threshold for detecting rip currents through this method (also due to ambient light conditions). Optical flow methods, which capture the motion between individual image frames, is another promising technique [
31]. This technique overcomes the issue of detecting rapidly evolving rip currents through time-averaging. However, to automate such an approach is also challenging as the algorithm needs to quantify differences between the wave action, the possible rip current and the motion in the background. Recent studies (e.g., [
25]) have indicated these approaches are sensitive to the beach bathymetry and thus thresholds for rip current detection vary from location to location [
32]. These approaches have often led to many false positives. Furthermore, due to computational constraints, these techniques are challenging to deploying in real-time. [
33] did, however, present recent research that utilizes two-dimensional wave-averaged currents (optical flow) in the surf zone, making use of a fixed camera angle. This was also further developed by [
34], and both studies can capture amorphous rip current structures. [
35] used an image augmentation strategy to identify different beach states with the presence of rip channels being associated with the presence of specific classes. Thus, enabling a greater amount of relevant, beach-related information is useful for physical process identification. A solution to many of the traditional image processing techniques are deep learning techniques such as convolutional neural networks (CNNs). While deep learning models can be relatively slow to train, they are fast to deploy and apply in a real-time context, which is also promising for drone technology and part of the envisaged future plans of the current study.
In the present study, we will investigate the usefulness of interpretable AI, particularly in the context of model improvement. We will also highlight some of the advantages of supervised learning through deep learning techniques such as CNNs and their ability to learn from experience and learn complex dependencies and features to derive a set of model weights/parameters that produces maximum accuracy. CNNs also require less human input, which is advantageous over traditional image-processing techniques, for which thresholds need to be defined. As many AI algorithms such as CNNs have many tunable parameters, they require a large amount of training data. A lack of diversity in training data can also result into poor model generalization, and in the context of rip current detection, models require training data from beaches representing a wide variety of different environmental settings. While the amount of data required for training CNNs can be extremely large (which could practically be unobtainable), there are many approaches to overcome this and reduce overfitting. For example, data augmentation increases the amount of training data by manipulating each individual image through a series of translations such as rotations and perspective transforms. Additionally, transfer learning has become a widely used technique for training AI-based models on small datasets, where an AI-based model is first trained on an extremely large dataset and then finetuned on a smaller dataset.
Typical AI research questions focus on detecting objects with well-defined boundaries (e.g., humans, dogs, cars, etc.). To train an AI-based model to both classify and localize the position of the object(s) within each image, a bounding box needs to be defined around each object. Several studies have successfully used object classification algorithms to both localize and predict the occurrence of rip currents. [
27] used a CNN to predict the occurrence of rip currents, and more recently, [
6] used a faster R-CNN (region-based CNN) to both localize (predict a bounding box) and predict the occurrence of rip current occurrence, achieving an accuracy of over 98% of a test dataset. The other challenge with rip currents detection is that they are not necessarily observed within each video frame, and rather the rip current can be observed over a sequence of images. Because training on video sequences is both time-consuming and requires significantly more training data (e.g., unique videos), existing approaches (e.g., [
6]) have made use of CNNs on static images of rip currents. To avoid instances where the rip current was not observed, [
6] used a frame aggregation technique where predictions are aggregated over a time interval. They noted that when predictions are aggregated over a period, the false positive/negative rates are reduced.
While these approaches have demonstrated early success in the context of rip current detection and localization, there are several issues with the implementation AI-based algorithms in a real-world setting, which the present study is aiming to address:
Lack of consideration to classify the amorphous structure of rip currents,
AI-model interpretability, to understand whether the model is learning the correct features of a rip current and whether there are deficiencies within a model,
Alternative data augmentation methods to enhance the generalization of an AI-model, and
Building trust in the AI-based model predictions.
A major advantage of the methods presented here is that they are not reliant on bounding boxes. These are usually predefined and thus only learn from the information contained within. Here, the model can capture some characteristics of the amorphous structure (rip current shape) because it learned a variety of possible coastal features with no bounding boxes. This enables the use of this technology with drones (changing camera views along a track), and not just fixed-angle cameras, which is part of the deployment options planned for the present study.
The present study introduces an interpretable AI method, namely gradient-weighted class-activation maps (Grad-CAM, [
36]), to interpret the predictions from the trained AI-based models. Grad-CAM enables the uncovering of the typical black-box AI and enables the model to learn what regions/pixels from the input image have influenced the AI-based model’s prediction. This, in turn, also enables the prediction of amorphous boundaries for a classified rip current. The present approach does not constrain the AI-model to learn features specific to a placed bounding box as in Faster R-CNN and you only look once (YOLO) object-detection approaches [
37], where the algorithm is forced to learn very specific supervised features, whereas there may be other characteristics that might be relevant. The present approach also introduces interpretable AI in the context of identifying subjective model deficiencies that are independent of traditional accuracy metrics. These approaches can help inform better model development and augmentation strategies to improve the generalization of an AI-based model. Complex AI-based models whose decisions cannot be well-understood can be hard to trust, particularly in the context of surf-life saving where the safety and human health arises. There is thus a clear need for trustworthy, flexible (no bounding box), and high performing AI-based rip detection models for real-world applications, which the present study aims to address.
3. Results
The two models outlined in
Table 2 were applied to a series of 23 testing videos, with and without rips currents, and compared to the Faster R-CNN model trained by [
6]. In
Figure 6, the percentage accuracy score of each training method is presented for all 23 testing videos. The present results demonstrate that transfer-learning and data augmentation improve the overall accuracy score of the model on the test dataset. It is important that we highlight that the validation accuracy (on aerial imagery) varies significantly from the overall test accuracy (oblique imagery). All model configurations achieved at least 90% K-fold accuracy on the validation dataset, however very different results were obtained for the independent test dataset.
Without augmentation and transfer learning, the CNN model struggles to generalize well on the test set and achieves a maximum accuracy of about 0.59 (or 59%) and an average K-fold accuracy of 0.51, as illustrated in
Table 3—which is not useful in a real-world context. When data augmentation is used, the accuracy dramatically increases to 0.75, with a K-fold accuracy of 0.69. Similarly, MobileNet, without transfer learning or augmentation (weights are not initialized from ImageNet), has little or no accuracy at 0.51, and a K-fold accuracy of 0.48. When augmentation or transfer learning on ImageNet [
44] is used alone the accuracy improves to 0.68 and 0.70, respectively. However, when augmentation and transfer learning are used collectively, we obtained a maximum accuracy of 0.89 and a K-fold accuracy of 0.85. The confusion matrix for the MobileNet model which uses augmentation and transfer learning is illustrated in
Table 4. Overall, the accuracy for each class (rip current or no rip current) is relatively equally balanced.
Overall, these experiments highlight the value of both transfer learning and data augmentation in an image classification context. Augmentation alone adds significant value to the 3-layer CNN, resulting in a 75% accuracy, but with the MobileNet model we are unable to achieve a significant accuracy through augmentation alone. This is likely due to the MobileNet model being overparameterized and the number of parameters in the model making it extremely challenging to train, and thus for the model’s loss function to converge. Similarly, with transfer-learning alone, the MobileNet model still struggles to generalize well to the testing dataset. Here, the model is no longer overparameterized, as only several layers are tunable. Moreover, our results suggest that the lack of diversity in the training data is a limiting factor in the model’s accuracy and its ability to generalize. However, when augmentation and transfer learning are combined, the accuracy dramatically improves.
Upon applying the interpretable AI method (Grad-CAM) to each of the models, it became clear that poorer performing models tend to focus on other objects such as trees, coastlines, rocks, and people instead of the rip currents (e.g.,
Figure 3). Even in scenarios where the model prediction is correct, the heatmap from Grad-CAM illustrates that the model is not ‘looking’ at the correct features—a clear indication of poor model generalization. We further tested whether our best performing model would generalize to the testing dataset when augmentation was applied (e.g., random rain and perspective transformations), and the accuracy scores remained relatively unchanged. This is further illustrated in
Figure 6, where a Grad-CAM heatmap consistently rotates with the rotation of the input image.
Note that in the present approach no aggregates or average predictions are required from the AI-based model. Accuracy metrics for the model are provided for each video in the testing dataset and are benchmarked against [
6] in
Table A1 of the
Appendix A.
Overall, through using Grad-CAM, interpretable AI-method, the present study’s model is able to accurately localize the position of a rip current. The position of a rip current can be localized by examining the heatmap of pixel importance for a prediction made by the MobileNet model (or any CNN in general). In
Figure 7, the leftmost column, the most important pixels are the warmest (reddest) and align strongly to the position of the actual rip current. Grad-CAM is also able to capture some characteristics of the amorphous structure of the rip current. This approach to localization of a rip current differs significantly from existing AI methods, which identify the bounding box in which the rip current is located within. The examples shown in
Figure 7 are from the test dataset provided by [
6]. It is important to emphasized that this approach was applied to all 23 videos, and performs well across nearly all videos, aside from the videos named:
rip_03.mp4,
rip_05.mp4, and
rip_15.mp4, where the performance/accuracy is lower. Examples of poorly classified images are given in
Figure A1 in the
Appendix A.
To ensure the boundaries of the rip currents as predicted by the interpretable AI are realistic, a comparison was made between the predicted boundaries and those identified using more traditional image processing techniques. Traditional image processing techniques (after [
16,
53]) tend to utilize the average pixel intensity over a fixed duration to determine areas of wave breaking (lighter areas corresponding to broken whitewater) and non-breaking (darker areas without whitewater present). However, such approaches may lose considerable information during the averaging process and require further thresholding and interpretation to quantify areas of non-breaking. An alternative approach, [
54] identified the area of wave breaking for each frame and summed these areas as a cumulative breaking portion during the observation period. The threshold for defining breaking within individual frames is a variable, yet provides more control and definition than time-averaged techniques. A breaking exceedance value can then be extracted from the cumulative breaking portion, as illustrated in
Figure 7 (right panels), where a 10% exceedance value has been plotted over the equivalent time-averaged image. A rip current is assumed where breaking is not occurring or occurring a low proportion of the time. Overall, the AI-predicted boundaries are consistent with those identified by the cumulative breaking portion. It should also be noted that the image processing technique is limited in requiring a sequence of images or video taken from a relatively stable platform and is not necessarily identifying a rip, but rather areas where breaking is not occurring, which usually correspond to a rip current. Rip currents of short temporal duration (flash rips) or rip currents containing large amounts of foam are unlikely to be well-resolved using this method.
4. Discussion
The present study’s overall accuracy does not exceed the 0.98 accuracy obtained by [
6] in their framed aggregation approach (F-RCNN + FA), but matches closely with their results without frame aggregation (FA). Through further fine-tuning of the present model, a higher accuracy could be achieved. There are several advantages to using Grad-CAM for localization of rip currents over traditional approaches (Faster R-CNN in [
6]). Firstly, this approach is semi-supervised (in the context of object localization) and there is no penalization (during model training) on the CNNs prediction of the position of the rip current within an image, unlike in Faster R-CNN or YOLO. This means that there is no need to provide the bounding box for where the rip current is located, which is a time-consuming process for labelling image data. While Faster R-CNN algorithms are very useful in many different applications, by penalizing the algorithm on where the rip current is positioned, it tends to constrain the algorithm/model to only learn features that are specific to localizing rip currents within the labelled bounding box. Moreover, this means that algorithms such as Faster R-CNN and YOLO could perhaps be less flexible on the features/information that they can learn. Moreover, we have demonstrated that it is possible to “localize” rip currents relatively precisely without informing the model where the rip current is within the image. Object detectors such as Faster R-CNN (as implemented in [
6]) cannot operate in real-time and are often not mobile friendly. Our approach of image classification can run in excess of several hundred frames per second on GPU (~400) and in real-time on a CPU (20–30 fps). However, our interpretable AI method of Grad-CAM can only run at approximately 6–10 frames per second on a CPU and GPU. However, given that rip currents are not present all the time, our interpretable AI method can be run on-demand. Applying our method of Grad-CAM has currently not been optimized, and further improvement in speed is likely possible. The lightweight nature of MobileNet means it can be deployed on lightweight devices such as mobile phones or drones, unlike Faster R-CNN. It is important that we highlight that there are alternatives to Faster R-CNN, such as the YOLO object detection algorithm, in which some variants can run on mobile devices and in real-time.
Secondly, interpretable AI model development can also be subjective and be further improved through human experience. This subjective element is important, as understanding “the why” of how an AI algorithm makes a particular prediction reduces the black box and adds this additional level of confidence that the user has with the AI. Furthermore, subjective quality control is also very important for applications such as deploying this method to aid surf lifeguards. In the context of real-time model deployment (e.g., applied to beach mounted cameras or drones), interpretable AI has the potential to identify model drift, defined as changes in the model accuracy, or scenarios where the equipment is performing poorly.
Thirdly, the interpretable framework enables the training of effective models, and thus informs future developments, and potential limitations within the data, as illustrated in the life cycle of interpretable AI development in
Figure 4. The outline of how Grad-CAM could help easily identify issues within the AI-based models, and devising data augmentation strategies to overcome them, is elaborated in
Section 2.
Method Comparison
Both techniques in artificial intelligence (interpretable AI and CNNs) can be used to classify and localize rip currents. However, it is important that the limitations in these methods are pointed out and compared to existing technologies that could also be deployed. While human observation will continue to be an important approach to rip current detection [
12], other approaches using imagery or video feeds will require deployment of cameras onto a wide variety of beaches. Furthermore, the cameras need to have unobstructed views of the beach and need ongoing maintenance. While the task of large-scale deployment of cameras through beaches is an expensive task, it would fundamentally change beach safety, and would have a huge impact on beaches where lifeguards are off duty or not present at all. This AI technology also opens the door for remotely sense patrol patterns making use of drone technology. Early notifications through applications could take rip current-related rescue from reactive to preventative, especially given the low numbers of beachgoers who are able to identify rip currents [
12]. This is especially true, as rip currents are the leading cause of drownings (on wave exposed beaches) worldwide [
12].
In
Table 5. the advantage and disadvantages of several rip current detection methods are given. These include well-established image processing techniques. In
Table 5, some limitations with AI methods are also given, which include poor model generalization and resolving the complex, amorphous structure of rip currents. This study has highlighted that with careful consideration of data augmentation (outlined in
Section 2.3.1) and interpretable AI, improvement over the existing limitations of AI can be made.
5. Conclusions
The present study investigates the remote identification of rip currents using visual data from cameras and videos. More specifically, the use of artificial intelligence was investigated based on a review of the current status quo. Several recent studies have reported great progress in the field of object or feature detection, with particular focus on rip currents. Although these studies have reported great success in their identification accuracy, they were limited by requiring a bounding box in the images. These boxes also limited the AI method’s ability to learn features other than rip currents and thus limited generalization of those methods. The interpretable AI model presented here circumnavigated that limitation and thus enables the deployment of the new rip current identification tool in more application. The method presented here also made use of a synthetic data augmentation strategy that enables the model to generalize even further and thus became more independent on practical constraints. These include rain on the camera lens, fog, varying sand exposer (tides), and image and video capturing perspective angles. A detailed list of model performance metric is provided with the newly proposed method, obtaining an overall accuracy of 89% (based on the test data sets—23 videos). Although this number is lower than other methods, the flexibility of this method makes it attractive and still presents the opportunity for further optimization. Established image processing techniques are also discussed, as these are widely used and reliable. To highlight the functionality of these methods, an advantages and disadvantages table has been added as a part of the discussion. This will enable the reader to get a clear and quick understanding of which methods will be the best solution to a problem. The future prospect of this study is to test and validate this method in using drone technology as well. This will enable monitoring of beaches and move rip current-related incidents from a reactive to a preventative approach. This will also enable the monitoring of large beaches, which will then not be constrained to the perspective of singular cameras.
There are certainly many challenges regarding detecting different types of rip currents, including feeder rip currents and images containing multiple rip currents. The current approach demonstrates significant skill in the localization of single rip currents through interpretable AI. Future work will focus on examining whether Grad-CAM can detect multiple rip currents in one image, such as feeder rip currents, which may require further refinement or enhancement of training data and revisions to the Grad-CAM technique. Further work will also investigate the amorphous structure of rip currents, including the ability to capture shape characteristics such as width and length. We will also measure other accuracy metrics in the context of object localization, such as Intersection of Union (IOU) score. A more diverse range of coastlines should also be considered, and in different environmental conditions, as this will further help progress the understanding of how data augmentation can help the models generalize. As AI becomes data-centric, the focus will be on improving the data rather than the models themselves. Both data augmentation and interpretable AI can help in understanding how to better improve the training data so that a better model generalization may be obtained. Further studies will also investigate how these techniques can be deployed in an operational setting, e.g., making use of drones. These can then be used to patrol remote or large beaches, beyond the sight of lifeguards and fixed camera angles. Here, the model generalization will be even more important.