1. Introduction
Wave observation is an important aspect of physical oceanography study, which is crucial for marine forecasting, disaster prevention and mitigation, ocean engineering, and maritime safety. Conducting observational research on waves has significant scientific and practical value. The wave parameters observed for nearshore waves primarily include the mean wave height, wave direction, wave period, and significant wave height (SWH), with SWH being particularly important. SWH represents the average height of the highest one-third of waves within a given period. Current wave observation methods include manual observation, instrument measurement, and remote sensing inversion [
1]. At present, wave observation technology primarily relies on instrument measurement, with buoys and wave staffs being widely used for nearshore observations. Wave staff observation has advantages such as simple structure and quick response, but it is difficult to apply in open sea areas and requires frequent maintenance. Buoy observation is easier to maintain but is prone to issues such as dragging of anchor in harsh sea conditions. In recent years, the application of remote sensing radar observation in wave observation has been increasing. X-band radar, HF radar, lidar, and synthetic aperture radar (SAR) have significantly expanded their observation regions, and are achieving promising inversion accuracy. Zhu et al. [
2] used wave mode data from the GF-3 SAR combined with existing model algorithms to inverted SWH; when compared to reanalysis data from the European Centre for Medium-Range Weather Forecasts (ECMWF), the inverted SWH achieved a root mean square error (RMSE) of 0.57 m. Klotz et al. [
3] employed The Ice, Cloud, and land Elevation Satellite 2 (ICESat-2) land surface algorithm to determine wave and wind characteristics, with the inverted SWH yielding an RMSE of 0.3 m compared to ERA-5 reanalysis data. Liu et al. [
4] utilized ensemble empirical mode decomposition (EEMD) with X-band radar sea surface images to estimate SWH, achieving an RMSE of 0.36 m against buoy observations. Similarly, Zhao et al. [
5] employed high-frequency radar data to obtain wave information, where the RMSE between the inverted SWH and buoy data was found to be 0.29 m. However, satellite and radar observations often involve high costs and the observational accuracy is limited by spatial resolution. Additionally, research on using binocular vision technology and wave images captured by binocular cameras to invert wave surface information has made some progress. However, this technique requires a complex calibration process to ensure inversion accuracy, and thus most research remains in the experimental phase [
6,
7,
8,
9]. With the development of deep learning in the field of computer vision, research that integrates video imagery with various deep learning techniques for image recognition and object detection has emerged across multiple domains, demonstrating the advantages of deep learning methods in terms of accuracy and computational efficiency [
10,
11]. Currently, research on inverting sea conditions using various types of wave images has made some progress. Compared to previous measurement methods, acquiring nearshore wave video images is more cost-effective, and when combined with artificial intelligence technology, can produce more accurate and timely inversion results.
With the rapid development of computers and the continuous improvement in computational power, deep learning has gradually come into the public eye. Deep neural networks are a significant branch of deep learning, enhancing the ability to extract features from targets by constructing more complex network structures. Compared to traditional neural networks, deep neural networks utilize multiple hidden layers to perform nonlinear transformations, making them more capable of handling complex environments and problems. Convolutional neural networks (CNNs) [
12,
13] are a type of deep learning model commonly used for image and video processing. CNNs have a better capability for handling image and sequential data because they can automatically learn features from images and extract the most useful information. Krizhevsky et al. [
14] enhanced the basic structure of a CNN by deepening the hierarchical structure and using the nonlinear activation function ReLU along with the Dropout method, resulting in the AlexNet model. The success of AlexNet greatly propelled the development of CNNs, encouraging other researchers to propose models like VGGNet [
15] and GoogleNet [
16] to tackle more complex image recognition problems. Traditional convolutional networks or fully connected networks often encounter issues such as information loss and degradation during transmission, as well as the vanishing gradient or exploding gradient problems, which make training very deep networks challenging. To address the issue, where training accuracy rapidly declines after reaching saturation as the network depth increases, He et al. [
17] introduced the deep residual network (ResNet). ResNet addresses this problem to some extent by directly passing input information to the output, preserving information integrity. The network only needs to learn the differences between the input and output, thereby simplifying the learning target and reducing the difficulty. Depending on the number of network layers, ResNet can be divided into ResNet18, ResNet34, ResNet50, ResNet101, and ResNet152. The capability of various deep learning models in extracting features from wave video images is explored in this study, such as AlexNet, VGGNet, MobileNet [
18], DenseNet [
19], and ResNet. Additionally, the inclusion of attention mechanisms can enhance the ability of CNNs to capture and express important features. Attention mechanisms simulate the human brain’s focus on specific regions at particular moments, selectively acquiring more useful information while ignoring irrelevant data [
20]. The principle of attention mechanisms can thus be understood as the model focusing on important information from the input data to achieve a specific task while ignoring less important information. Attention mechanisms improve model performance by weighting different input data to adjust the model’s focus accordingly. Currently, common attention mechanism modules include the SE module [
21], ECA module [
22], and CBAM module [
23]. Among these, the SE module and ECA module are channel attention mechanisms, while the CBAM module combines both channel and spatial attention mechanisms. Channel attention mechanisms focus on which channel features are more meaningful, whereas spatial attention mechanisms focus on which spatial parts are more significant [
24].
Against the backdrop of the rapid development and application of deep learning and machine learning, deep learning algorithms have already been applied across various domains within oceanography. In wave forecasting, numerous studies have emerged that utilize CNNs, long short-term memory (LSTM) networks, and various optimized models for long-term prediction of wave parameters [
25,
26,
27,
28]. Additionally, there has been considerable research on predicting and inverting factors such as sea ice concentration and sea surface temperature using GPR data [
29], infrared remote sensing images [
30], and imaging data based on MWRI and SSMI [
31,
32].
In recent years, video images combined with deep learning and computer vision methods have shown great potential in wave information recognition and monitoring. Andriolo et al. [
33] developed two new methods for estimating breaking wave height based on Timex images. One method provides accurate breaking wave height estimates by integrating a series of video-derived parameters and the beach profile data, while the other does not require local water depth data, demonstrating cost-effectiveness and applicability across a wide range of field conditions. Scardino et al. [
34] proposed a monitoring technique that combines CNN and Optical Flow technology to assess tidal and storm parameters from video recordings. The results indicate that the system achieves good accuracy in wave flow and height monitoring in a low-cost manner. Additionally, Valentini et al. [
35] further demonstrated the application of a low-cost video surveillance system for wave monitoring. By combining a CNN with superpixels segmentation, they achieved precise classification of coastal images, showcasing its practicality in environmental monitoring. In another study, Andriolo [
36] proposed a technique for automatically extracting nearshore wave transformation domains from images captured by coastal video monitoring stations, and demonstrating the method’s reliability and providing strong support for nearshore hydrodynamics and sediment transport research. In addition, research using video images as a data source has also been applied in visibility inversion and prediction. Hu et al. [
37] proposed a cloud image retrieval method for sea fog recognition using a dual-branch residual neural network, effectively identifying sea fog and low-level clouds. Additionally, several studies have used image processing and deep learning techniques to detect sea fog from various image data sources, including video images captured by cameras and multispectral images [
38,
39,
40].
Utilizing deep learning algorithms and wave images or videos for wave parameter inversion has emerged as a novel approach in recent years, and is employed by numerous researchers to enhance the inversion accuracy of various wave parameters. Xue et al. [
41] proposed a method for SWH inversion based on CNNs and Sentinel-1 SAR data, constructing an inversion dataset with over 3000 images. When compared with buoy measurements, the method achieved an RMSE of 0.32 m for the inverted SWH. Liu et al. [
42] utilized LeNet to classify waves using sea clutter data, demonstrating excellent wave height inversion capability on high-quality datasets. The method achieved an average accuracy of over 93% on the experimental dataset, and its generalization ability was validated using data from different periods and sea areas. However, this study lacked inversion results for actual values of wave heights. Choi et al. [
43] employed deep learning techniques to estimate significant wave height from a single ocean image, achieving an accuracy of 84% with their proposed classification model. Additionally, they introduced a regression model based on a convolutional long short-term memory (ConvLSTM) network to estimate continuous significant wave height from a sequence of ocean images, achieving a mean squared error of 0.02 m on their proposed dataset. Song et al. [
44] utilized images extracted from nearshore wave monitoring videos to capture both the static and dynamic features of waves. Two independent Network In Network (NIN) networks were constructed to learn the spatial and temporal characteristics of the waves, where the two types of features were fused using a central network to determine the SWH. The results achieved a relative error of 6.4% ± 4.9, and the author pointed out that this method can well meet the operational requirement for nearshore wave forecasting. It is necessary to evaluate the performance of the proposed method according to operational standards. In this study, the performance of model was assessed based on China’s nearshore observation standards [
45]. The standard stipulates that the absolute error between instrument measurements and actual observed values should be maintained within 15% of the actual observed values. Gal et al. [
46] proposed a method to estimate the height of breaking waves within the breaking zones from video recordings, which in turn estimates nearshore wave height, achieving results comparable to buoy observations. Sun et al. [
47] conducted airborne interferometric radar altimeter experiments in the Yellow Sea; they used a mean filtering algorithm to invert sea surface height and its wave spectrum, achieving good results for swell but facing limitations in the application to wind waves. Jinah et al. [
48] used various deep learning methods and coastal video images to track wave propagation. Learning the behavior of transformed and propagated waves in the surf zone, they successfully estimated the instantaneous wave speed of each crest and breaking wave in the video domain. Yun-Ho et al. [
49] combined CNNs and LSTM to classify sea states and average wave height from monocular ocean videos. Learning methods based on wave videos have achieved classification accuracies exceeding 93%, but they require substantial training time. To more comprehensively learn the spatial and temporal information of waves, some research proposed using three-dimensional convolutional networks for wave parameter monitoring, achieving good accuracy [
50]. Nonetheless, these approaches are still limited to utilizing wave images to represent wave information. Overall, research on using nearshore video images for wave inversion is still in its early stages and mostly focuses on wave classification. There is relatively little research on inverting SWH from a regression perspective. Analysis of the current research status reveals that there are still many methods in this field that can be explored and improved.
In summary, there is currently limited research on using deep learning algorithms and wave video images to achieve SWH inversion. In this study, a deep learning classification method for distinguishing wind waves and swell from nearshore instantaneous wave video images was proposed at first. Subsequently, a deep learning regression method based on CNNs and MLP was introduced to invert SWH from instantaneous wave video images, meteorological factors observation data, and oceanographic factor observation data. Additionally, the specialized models for wind wave and swell trained on the wind wave inversion dataset and the swell inversion dataset, and independent inversion was performed on the wind wave and swell samples in the test set. Finally, the impact of an improved loss function on the inversion accuracy was discussed. This paper is organized as follows:
Section 2 describes the materials and methods;
Section 3 presents the experimental results of wind wave and swell classification and SWH inversion, and discusses the impact of various factors on the inversion accuracy; and
Section 4 provides the main conclusions.
4. Conclusions
This study explored the feasibility of using deep learning techniques to extract wave features from instantaneous wave video images. First, various mainstream CNN models were compared, revealing that the ResNet architecture had a stronger ability to extract wave features from instantaneous wave images. Furthermore, based on ResNet50, the SE attention mechanism was incorporated, and inspired by the design of the Swin Transformer network structure, improvements were made to the ResNet architecture, leading to the development of the ResNet-SW model. This model is specifically designed to identify wave types using wave images. The results show that ResNet-SW can accurately classify wind waves and swells, even with a small amount of noise and label configuration bias, achieving a classification accuracy of 94.61%. Building on this, this study examined the effectiveness of using the ResNet-SW network structure to invert SWH from instantaneous wave video images. The results showed that the method is effective for SWH inversion, but due to its reliance on the image quality of instantaneous wave images to represent wave information, the model’s stability is poor. To address these limitations, various meteorological and oceanographic factors were introduced to jointly represent wave information, and the Inversion-Net algorithm, which combines CNNs and MLP, was developed. Feature selection was conducted before model training to analyze the impact of various factors on the inversion process and select suitable input features. The SWH inversion results obtained using this method showed an RMSE of 0.11 m compared to buoy observations. To further enhance the accuracy and stability of the inversion results, the specialized models for wind wave and swell were trained on the wind wave inversion dataset and the swell inversion dataset. The results showed that this method effectively increased the proportion of samples meeting operational observation standards, with a CR reaching 84.07%. Additionally, we attempted to enhance the stability of model by adding conditional constraints. The SWH observations from the nearshore and offshore buoys were fitted to derive a linear relationship, which was then embedded into the loss function. The results showed that this approach significantly improved the stability of model and inversion accuracy. Finally, the model’s performance was evaluated over the entire wave process. In the application phase, wind wave and swell classification models were first used to determine the wave type and assess whether the test samples met the constraints proposed in this study. Subsequently, the samples were input into the corresponding inversion models, and time series inversion results with a 1-hour resolution were synthesized from the outputs of multiple models. The results demonstrate that the method not only meets operational observation requirements but also maintains a low error margin.
In summary, deep learning techniques can effectively extract wave characteristics from instantaneous wave video images and use them for wave type classification and SWH inversion. Additionally, integrating multiple factors and improving the training process can significantly enhance the inversion accuracy and model stability. Although, there are certain limitations in the current study. These shortcomings have pointed out directions for future studies.