A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism

Wang, Shuwen; Wu, Ziyin; Jia, Shuaidong; Zhao, Dineng; Shang, Jihong; Wang, Mingwei; Zhou, Jieqiong; Qin, Xiaoming

doi:10.3390/jmse13040722

Open AccessArticle

A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism

by

Shuwen Wang

¹

,

Ziyin Wu

^1,2,3,*,

Shuaidong Jia

⁴,

Dineng Zhao

^1,*

,

Jihong Shang

¹,

Mingwei Wang

¹,

Jieqiong Zhou

¹ and

Xiaoming Qin

²

¹

State Key Laboratory of Submarine Geoscience and Second Institute of Oceanography, Ministry of Natural Resources, Hangzhou 310012, China

²

School of Oceanography, Shanghai Jiao Tong University, Shanghai 200030, China

³

College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao 266590, China

⁴

Department of Military Oceanography and Hydrography & Cartography, Dalian Naval Academy, Dalian 116000, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2025, 13(4), 722; https://doi.org/10.3390/jmse13040722

Submission received: 5 March 2025 / Revised: 29 March 2025 / Accepted: 2 April 2025 / Published: 3 April 2025

(This article belongs to the Special Issue Underwater Acoustic Field Modulation Technology)

Download

Browse Figures

Versions Notes

Abstract

:

Marine researchers rely heavily on ocean sound velocity, a crucial hydroacoustic environmental metric that exhibits large geographical and temporal changes. Nowadays, spatio-temporal series prediction algorithms are emerging, but their prediction accuracy requires improvement. Moreover, in terms of ocean sound speed, most of these models predict an ocean sound speed profile (SSP) at a single coordinate position, and only a few predict multi-spatial-scale SSPs. Hence, this paper proposes a new data-driven method called STA-Conv-LSTM that combines convolutional long short-term memory (Conv-LSTM) and spatio-temporal attention (STA) to predict SSPs. We used a 234-month dataset of monthly mean sound speeds in the eastern Pacific Ocean from January 2004 to June 2023 to train the prediction model. We found that using 24 months of SSPs as the inputs to predict the SSPs of the following month yielded the highest accuracy. The results demonstrate that STA-Conv-LSTM can achieve predictions with an accuracy of more than 95% for both single-point and three-dimensional scenarios. We compared it against recurrent neural network, LSTM, and Conv-LSTM models with optimal parameter settings to demonstrate the model’s superiority. With a fitting accuracy of 95.12% and the lowest root-mean-squared error of 0.8978, STA-Conv-LSTM clearly outperformed the competition with respect to prediction accuracy and stability. This model not only predicts SSPs well but also will improve the spatial and temporal forecasts of other marine environmental factors.

Keywords:

spatio-temporal attention mechanism; deep learning; sound velocity profile; STA-Conv-LSTM; multi-spatial-scale prediction

1. Introduction

The measurement of hydrographic elements, hydroacoustic communications, and the study of marine resources are just a few marine applications that make extensive use of acoustic waves, the primary channel for underwater information transmission [1,2]. Accurate measurements of the properties of a maritime environment related to the propagation of sound waves are required to advance the development of these applications. One such parameter that aids in determining the properties of sound propagation is the speed of sound in saltwater [3,4]. The speed of sound is affected primarily by changes in temperature; however, it is also affected by salinity and pressure in saltwater [5,6]. Because of this, the dynamic ocean environment exhibits a spatial and temporal variation in the speed of sound. Because of the ocean’s vertical stratification, the structure of sound velocity is similarly vertically stratified [3]. Waves, internal waves, currents, and seasonal changes are only a few of the ocean’s physical processes that can alter the ocean ecosystem, both temporarily and permanently. The speed of sound can vary in both space and time due to the complicated interplay of several periodic physical processes [7].

Ocean sound velocity is an essential and fundamental parameter for multibeam seafloor topography sounding, which is directly used to correct for sound line refraction, and can even affect the accuracy of bathymetry [8,9,10]. In current ocean research, real-time sound speed information is mainly derived from in situ measurements of the sound speed profile (SSP), which captures changes in sound speed from the water surface to the seabed [11]. A marine SSP can be measured in two different ways. One is to use expensive and inefficient sound velocity profilers (e.g., SV Plus V2 and AQSV-1500) to measure the time it takes for sound signals to travel a finite distance through saltwater. Argo buoys, conductivity, temperature, and depth devices, and disposable deep-sea thermometers are some of the devices that can be used to indirectly determine the SSP by measuring salinity and temperature profiles in these waters. Because of their inefficiency and high expense, traditional at-sea observations only give sparse point-by-point SSPs. Raw-data-based methods for predicting and inverting the speed of sound have been extensively researched in the past several decades, thanks to the proliferation of various technologies.

Many SSP inversion methods have been developed since the advent of ocean acoustic tomography by Munk and Wunsch [12], which has been an essential tool in ocean research. These approaches include matched acoustic peak arrivals [13] and matched field validation methods [14,15]. Ocean forecasting issues have been addressed by Kalman filtering, an optimization approach for state estimation [16,17]. A promising new field of study, compressive sensing in acoustics, has garnered considerable interest in the past decade [18]. Instead of employing sparse representations of an infinite number of SSPs, compressive sensing inversion approaches efficiently estimate fine-scale SSPs [19,20]. Machine learning has also been shown to be a useful tool for solving problems in ocean science, opening new avenues for using data-driven approaches in marine environment forecasting [21,22,23]. Predicting SSPs in a three-dimensional (3D) marine area using the convolutional long short-term memory (Conv-LSTM) model, which is based on deep learning, has an average prediction error of less than 1.7 m∙s⁻¹ [24].

Ocean sound speed analysis and prediction techniques have been the subject of substantial research for the past forty years. The complexity of the marine environment makes the precise prediction of ocean SSPs a considerable problem. The use of computationally intensive and inefficient ocean numerical simulations has long been the backbone of spatio-temporal prediction approaches to address changes in the marine environment. Indeed, there is already a wealth of useful knowledge regarding the ocean’s internal dynamics and external forces contained in the time series of recorded ocean data [25,26]. To efficiently extract the data’s intrinsic features and the rules of physics, deep-learning models can learn from massive datasets [27]. Researchers in the field of ocean prediction have found success using deep learning, an influential and widely used technique [28,29,30,31].

Machine learning applications in the field of acoustics have advanced rapidly in recent years, offering new ideas and methods for ocean sound speed prediction. For instance, Yu et al. [32] proposed a sound speed inversion method using radial basis function neural networks. Zhang et al. [33] introduced a prediction model based on LSTM neural networks for sea surface temperature forecasting, achieving higher accuracy than that of traditional regression methods. Ali A et al. [34] compared the effects of deep learning and traditional statistical methods on the prediction of sea surface temperature, significant wave heights, and other marine parameters. The results show that the prediction performance of the deep learning model is much better than that of the statistical model. Li et al. [24] developed a marine sound speed model based on Conv-LSTM that can capture the temporal and spatial characteristics of historical data. Ou et al. [35,36] proposed an SSP inversion algorithm based on a comprehensive learning model using random forest and a method for reconstructing SSPs using the extreme gradient boosting model. Piao et al. [37] proposed an orthogonal representation of SSPs considering background field variations. Based on the statistical characteristics of time-series SSPs, high-precision SSP prediction was realized by LSTM. Wu et al. [38] introduced a data-fusion-driven multi-input multi-output convolutional regression neural network, integrating satellite-based real-time remote sensing of sea surface temperatures, historical SSP feature vectors, and corresponding spatial coordinate information. The model eliminates the dependence on sonar observation data and can be applied in a wider spatial region. Gao et al. [39] proposed a round-by-round training approach to avoid being trapped in poor local optima. The results indicate that the proposed Neural ODE sound speed forecasting model is more effective in long-term forecasting than traditional models and can accurately predict sound speed at any time.

These relevant studies have demonstrated promising prediction results, thereby validating the applicability of machine learning methods. However, these approaches fall short in adequately accounting for multi-spatial coupling effects and spatio-temporal weights, which adversely impact predictive effectiveness.

Conv-LSTM is a specialized neural network architecture integrating LSTM’s ability to process time-series data and convolutional operations for extracting spatial features [40]. When handling time-series data, it retains long-term dependencies and responds dynamically to sequence changes [41]. This characteristic makes it effective in tasks like weather forecasting, traffic flow prediction, and demand analysis, which require consideration of the temporal dimension [42,43]. The spatio-temporal attention mechanism focuses on significant features in both dimensions, helping the model identify changes in research objects at different times and locations [44]. By assigning weights, the model can selectively emphasize critical input data to enhance performance and effectiveness [45]. Incorporating this mechanism into Conv-LSTM enhances its focus on key information. It enables a more accurate identification of historical information relevant to the current task when processing long sequences of sound speed data, improves the ability to capture long-term dependencies, and reduces the impact of noise and irrelevant information to enhance the model’s accuracy and robustness.

To investigate the multi-spatial-scale interactions of sound speed and achieve accurate predictions with limited data, we propose the STA-Conv-LSTM framework along with a multi-spatial-scale sound speed prediction method that integrates the coupling of spatial structures. This approach entails processing initial data through Conv-LSTM and subsequently passing the results to spatio-temporal attention modules to extract relevant feature information. The outputs from both Conv-LSTM and spatio-temporal attention modules are then concatenated to facilitate the integration of original features with attention-weighted features. It is important to note that when predicting sound speed in multi-spatial-scale structures, the design of the STA-Conv-LSTM adapts according to variations in data dimensions.

The contributions of this paper can be summarized as follows:

To address the inadequacies in accounting for multi-spatial coupling effects and spatio-temporal weights in ocean sound speed prediction, we introduced the STA-Conv-LSTM framework along with a multi-spatial-scale sound speed prediction method that integrates spatial structure coupling.
To validate the efficacy of STA-Conv-LSTM, we conducted experiments to assess the model’s accuracy in predicting ocean sound speed using the BOA_Argo dataset.

The remainder of the paper is structured as follows. In Section 2, we introduce the sources and preprocessing of the SSP dataset, followed by an explanation of the neural network architecture of the STA-Conv-LSTM prediction model. In Section 3, we compare the prediction results of the STA-Conv-LSTM model at different spatial scales with those of other models. Section 4 offers an analytical discussion that thoroughly evaluates the feasibility and effectiveness of the STA-Conv-LSTM model. Finally, the conclusions are drawn in Section 5.

2. Materials and Methods

2.1. Data

We used the Global Ocean 3D Gridded Dataset (BOA_Argo) [46] to compile our data for this investigation. This dataset uses a horizontal resolution of 1° × 1° in the global ocean range (180° W–180° E, 80° S–80° N), and its vertical resolution is described as follows. One layer is present every 10 m from 0 m to 180 m, one layer every 20 m from 180 m to 500 m, one layer every 50 m from 500 m to 1300 m, and one layer every 100 m from 1300 m to 2000 m, for a grand total of 58 layers. The variables used in the monthly average file are as follows: “lon” for longitude (360 × 160), “lat” for latitude (360 × 160), “pres” for pressure (58), “temp” for temperature (360 × 160 × 58), “salt” for salinity (360 × 160 × 58), “salt_scatter_error” for salinity scattering error (360 × 160 × 58), “mld_t” for isothermal layer depth (360 × 160), “mld_dens” for mixed layer depth (360 × 160), and “mld_composed” for synthetic mixed layer depth (360 × 160). This study uses the variables salinity, temperature, pressure, longitude, and latitude. All data used for the experiment were taken from the East Pacific coordinate position (24.5° N, 169.5° E) and the region (15.5° N–34.5° N, 160.5° E–179.5° E), as indicated in Figure 1. The dataset spans 234 months, from January 2004 to June 2023, and covers a depth range of 0–2000 m.

2.2. Dataset Preprocessing

Before designing and optimizing the algorithm, it was necessary to preprocess and divide the dataset into different spatial scales. As noted above, the dataset used here was the BOA_Argo warm salt dataset. As an example, when predicting the SSP for a single coordinate position, the original dataset must be converted into a size that can be input to the STA-Conv-LSTM model. The steps are as follows:

Data cropping: The ocean temperature and salinity data of BOA_Argo dataset from January 2004 to June 2023 were downloaded and cropped in relation to the study area range using Python 3.11. Figure 1 shows that the study region extended from latitude 15.5° N to 34.5° N and from longitude 160.5° E to 179.5° E. The data originated from an area in the eastern Pacific Ocean that is well known for its marine habitat and its highly variable climate.
Determining the velocity of sound: There is a constant quantitative link between temperature, salinity, and water depth and the speed of sound in saltwater. After converting the pressure to a vertically oriented water depth value, the speed of sound at each point was determined using the pressure-to-depth conversion method described by Saunders. Thereafter, the data on sound speed were computed using the following simplified empirical formula derived from Del-Grosso [47]:

$C = 1449.2 + 4.6 T - 0.055 T^{2} + 2.9 \times 10^{- 5} T^{3} + (1.34 - 0.01 T) (S - 35) + 0.016 D$

(1)

In this context, C, T, S, and D denote the sound speed, temperature, salinity, and depth of water at a certain position, respectively. For the coordinate position (24.5° N, 169.5° E) in the chosen area, Figure 2 shows the complete temperature–salinity profiles from 2004 to 2023 as well as the computed SSPs.

3.: Data partitioning: The sound velocity data were formatted as [191, 58, 1] for a single coordinate position (24.5° N, 169.5° E). In a specific area (15.5° N–34.5° N, 160.5° E–179.5° E), the data format for the SSP dataset was [191, 20, 20, 58]. The procedure for splitting the time-series data into training and validation datasets is illustrated in Figure 3. The training subset underwent a 4-fold cross-validation in which 25% of it was used as the validation set. As a result, the overall ratio of the training, validation, and test sets was 3:1:1.
4.: Data normalization: The input data must be linearly transformed to ensure that they are distributed within a specific range. This process helps balance the weights among different features and enhances both the training effectiveness and the generalization ability of the model. In our study, we employed min–max normalization, which scales all training data to the range of [0, 1]. The calculation for this normalization is as follows:

$x^{*} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$

(2)

where $x^{*}$ is the normalized feature of the labeled data, which ranges from 0 to 1, $x$ is the original data value, $x_{\max}$ is the maximum value of the data, and $x_{\min}$ is the minimum value of the data.
5.: Slide sampling: We performed slide sampling on the normalized SSP data using a window size of 32 and step sizes of 1, 6, 12, 18, 24, and 28.

2.3. Conv-LSTM Model

LSTM is a type of RNN designed for time-series problems, and one of its applications is sonic time-series prediction. LSTMs excel at resolving issues with long-term dependencies across time series [48]. However, it can be challenging for fully connected LSTM networks to accurately capture spatial information when making spatio-temporal predictions. Shi et al. [49] suggested the Conv-LSTM network as a solution to this problem; by incorporating convolutional operations into LSTM, it becomes able to capture data in two domains: space and time. Because of this, Shi et al. were able to achieve their goal of neighborhood precipitation forecasting.

Conv-LSTM is a neural network that integrates convolutional neural networks with LSTM [50]. It introduces convolutional operations to traditional LSTMs, enabling the model to simultaneously process both sequential and spatial information. In Conv-LSTM, the computations for input, forget, and output gates, as well as the updating of cell states and hidden states, are all performed through convolutional operations. This architecture allows Conv-LSTM to capture local spatial features in the input data while preserving the long-term dependency processing capabilities inherent in LSTM [51].

Compared with traditional LSTMs, a key characteristic of Conv-LSTM is its replacement of fully connected operations with convolutional ones [52]. This modification ensures that when computing gated information and state updates, Conv-LSTM retains the spatial structure of the input data. Furthermore, because of the translation invariance afforded by convolutional operations, Conv-LSTM can more effectively manage spatial information and local features. Consequently, it demonstrates superior performance when processing sequential data with an underlying spatial structure [53]. As illustrated in Figure 4, during the training phase, Conv-LSTM receives a series of input data characterized by spatial features and learns spatio-temporal dependencies among these inputs [54]. During the prediction phase, given one or more initial inputs, Conv-LSTM generates predictions for future matrix states based on learned spatio-temporal characteristics.

The computational process in the Conv-LSTM network cell, as shown in Figure 4, is divided into five main steps:

(1): Determine the information to be filtered. The data transmitted from the previous cell are extracted by convolution as follows:

$f_{t} = σ_{f} (W_{x f} x_{t} + W_{h f} h_{t - 1} + W_{c f} \circ c_{t - 1} + b_{f})$

(3)

Here, the current cell

x_{t}

and the output data of the previous layer

h_{t - 1}

are multiplied by their respective weight coefficients

W_{x f}

and

W_{h f}

, while the output data of the previous cell

c_{t - 1}

of the current layer are convolved with their weight coefficient

W_{c f}

. In Equation (3), “

\circ

” denotes the convolution operation. The computed results are added to the bias vector

b_{f}

in this step. The final value f_t is the output of an activation function, e.g., the sigmoid activation function

σ_{f}

, and is multiplied by a value between 0 and 1, where 0 means that no information passes, and 1 means that all the information passes.

(2): Determine the information that must be retained for storage in the current cell using input gates similar to those used in LSTM networks as follows:

$i_{t} = σ_{t} (W_{x i} x_{t} + W_{h i} h_{t - 1} + W_{c i} \circ c_{t - 1} + b_{i})$

(4)

This step is roughly the same as the step for the forget gate. Here, the input data of cell

x_{t}

and the output data of the previous layer

h_{t - 1}

must be multiplied by their respective input weight coefficients

W_{x i}

and

W_{h i}

. Moreover, the output of the previous cell

c_{t - 1}

is convolved with the input weight coefficients

W_{c f}

, together with the bias of the present step

b_{i}

after passing through the activation function

σ_{i}

, to obtain the final result of input gate

i_{t}

.

(3): Calculate the current state of the cell, which is determined by both the forget and the input gates, as follows:

$c_{t} = σ_{o} (W_{x o} x_{t} + W_{h o} h_{t - 1} + W_{c o} \circ c_{t} + b_{o}) .$

(5)

In this formula, the current state of unit

c_{t}

consists of two parts. The first is the result of the convolution of the result of the forget gate

f_{t}

with the state of the previous unit

c_{t - 1}

; that is, the data are filtered through the forget gate. The second is the result of the calculation of the input data and the weight coefficients of the current state after the activation function

\tanh

along with the result of the input gate

i_{t}

; that is, the part of the current value of the unit that is obtained by determining the storage through the input gate. This step results in the current state of the cell. The current state of cell

c_{t}

is also transferred through the neural network to

c_{t - 1}

for the next cell calculation.

(4): The output gate calculation continues to derive the output data of the current cell based on the already obtained current cell state as follows:

$o_{t} = σ_{o} (W_{x o} x_{i} + W_{h o} h_{t - 1} + W_{c o} \circ c_{t} + b_{o})$

(6)

In the formula above, the input data in the current cell

x_{t}

and the output data in the previous layer

h_{t - 1}

are multiplied by their respective weight coefficients

W_{x o}

and

W_{h o}

in the output gate, the current cell state

c_{t}

(obtained in the previous step) is convolved with its output weight coefficients

W_{c o}

, and all the computed results are added to the bias vector

b_{o}

of the output gate and passed through the activation function

σ_{o}

to output the final computed result of the output gate

o_{t}

.

The output data of the current unit must be resized before they can be passed to the next layer in the multilayer neural network, as follows:

h_{t} = o_{t} \circ \tanh (c_{t} + b)

(7)

In this formula, the current cell state

c_{t}

is added to the bias

b

, activated by the activation function

\tanh

, and then convolved with the output data

o_{t}

from the output gate to ensure that the final output is normalized to the range of (−1, 1). This produces the final output

h_{t}

, which is passed to the next level of the cell. A bias vector

b_{*}

is added to all calculation steps.

2.4. STA-Conv-LSTM Model

Because it considers both spatial correlation and temporal fluctuation in the data, the Conv-LSTM network is a substantial step forward in the field of spatio-temporal prediction. Although prediction accuracy is low because each time step and spatial location are typically seen as equally important, the Conv-LSTM network sets the groundwork for future research in this area. Hence, as shown in Figure 5, we propose the STA-Conv-LSTM model, which incorporates an STA mechanism that can assign different weights to different time steps and spatial locations, thus improving the prediction accuracy for SSPs. The layers of the STA-Conv-LSTM model are briefly described below using the example of an SSP prediction for a single coordinate position:

The model receives the raw SSP sequence data through the input layer for the subsequent layers. The input shape is (samples, time, height, width, channels), which specifies the number of samples, time step, height of the input 2D matrix, width of the input 2D matrix, and number of channels, respectively.
The input layer is followed by two Conv-LSTM layers. The first Conv-LSTM layer uses 64 filters and a 7 × 7 convolution kernel. By convolving in time and space, this layer captures local features and temporal correlations of the input data. Nonlinearities are introduced to the model using the ReLU activation function, “padding” is set to “same” to keep the output size the same as the input size, and “return_sequences” is set to “True” to retain the output at all time steps. The second Conv-LSTM layer is similar to the first Conv-LSTM layer. This layer also uses 64 filters and 7 × 7 convolution kernels. It further extracts features from the input data to enhance the model’s ability to capture temporal and spatial information.
The output of the Conv-LSTM layer is input to the temporal attention module to focus on the importance of different time steps.
A spatial attention module is attached to the temporal attention module to better capture information about key spatial locations.
After the spatio-temporal features have been extracted, a concatenate layer is used to stitch together the original Conv-LSTM layer output and the output of the spatial attention module along the channel dimension. The purpose of this is to combine the original features with the attention-weighted features, so that the model can both retain the original information and capture the key information using the attention mechanism.
Finally, a 2D convolutional layer with one filter and a 7 × 7 convolutional kernel is used to map the spliced feature map into an SSP prediction as an output layer. The activation function of the output layer is linear by default, and the output value directly represents the prediction result. Here, “padding” is set to “same” to keep the output size the same as the input size, and “data_format” is set to “channels_last” to retain the order of the channels in the input data.

Unlike the SSP prediction for a single coordinate position, the SSP prediction for a 3D region uses a 6D tensor, which must be summed and weighted for the three dimensions of depth, latitude, and longitude. This is one more dimension that is used in the SSP prediction for a single coordinate position. In addition, when performing 3D spatial SSP prediction, Conv-LSTM2D must be replaced by Conv-LSTM3D, and Conv2D must be replaced by Conv3D. The temporal and spatial attention modules included in this version of the algorithm are described in detail below.

The structure of the temporal attention module is shown in Figure 6. First, the features of different time steps are extracted by a 1 × 1 convolutional layer using the following formula:

F_{t} = C o n v 2 D (x_{t}; W_{f}, b_{f})

(8)

where

x_{t}

is the input feature at the

t

time step, and

W_{f}

and

b_{f}

are the convolutional kernel weights and bias, respectively. Next, a fully connected layer with a

s i g m o i d

activation function is used to compute the gating factor to determine the importance of the temporal attention weights, according to

G_{t} = D e n s e (F_{t}; W_{g}, b_{g})

(9)

where

W_{g}

and

b_{g}

are the weights and bias of the fully connected layer, respectively. The importance of each time step is then obtained by computing the attention weights through a 1 × 1 convolutional layer and a

s i g m o i d

activation function as follows:

α_{t} = s i g m o i d (C o n v 2 D (F_{t}; W_{α}, b_{α})

(10)

where

W_{α}

and

b_{α}

are the convolution kernel weights and bias, respectively. Next, the attention weights are normalized so that the sum is 1 using

α_{t}^{'} = \frac{α_{t}}{\sum_{t} α_{t}}

(11)

where

\sum_{t} α_{t}

denotes the sum of the weights of all time steps. Finally, the attention weights are multiplied element by element by the input features, thereby enhancing the impact of the critical time step. At the same time, the gating factor is multiplied element by element by the attention-weighted sign to control the extent of attentional information delivery.

O_{t} = x_{t} * α_{t}^{'} * G_{t}

(12)

Here, “∗” denotes element-by-element multiplication.

Figure 7 shows the spatial attention module’s architecture, which is structurally similar to that of the temporal attention module; however, the input features, convolution kernel size, and attention weight computation are significantly different. The following formula is used to extract features from various spatial locations using an initial 7 × 7 convolutional layer as follows:

F_{i j} = C o n v 2 D (x_{i j}; W_{s}, b_{s})

(13)

where

x_{i j}

is the element of the input feature matrix in row

i

and column

j

, and

W_{s}

and

b_{s}

are the convolutional kernel weights and biases, respectively. Next, a fully connected layer with a

s i g m o i d

activation function is used to compute the gating factor to determine the importance of the spatial attention weights. The specific formula is

G_{i j} = D e n s e (F_{i j}; W_{g s}, b_{g s})

(14)

where

W_{g s}

and

b_{g s}

are the weights and bias of the fully connected layer, respectively. The importance of each spatial location is then obtained by calculating the attentional weights through a 7 × 7 convolutional layer and a

s i g m o i d

activation function as follows:

β_{i j} = s i g m o i d (C o n v 2 D (F_{i j}; W_{β}, b_{β}))

(15)

where

W_{β}

and

b_{β}

are the convolution kernel weights and bias, respectively. The attention weights are then normalized so that their sum is 1 using

β_{i j}^{'} = \frac{β_{i j}}{\sum_{i j} β_{i j}}

(16)

where

\sum_{i j} β_{i j}

denotes the sum of the weights of all spatial locations. Finally, the attentional weights are multiplied element by element by the output features of the temporal attention layer to enhance the influence of key spatial locations. Simultaneously, the gating factor is multiplied element by element by the attention-weighted features to control the degree of transmission of attentional information.

O_{i j} = x_{i j} * β_{i j}^{'} * G_{i j}

(17)

Again, “

*

” denotes element-by-element multiplication.

2.5. Evaluation Indicators

In this research, we present experimental results that used three assessment indices—the relative error (RE), fitting accuracy (ACC), and root-mean-squared error (RMSE)—to compare the prediction ability of several models. The following equations present the exact meaning of each index

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {(x_{a, i} - x_{p, i})}^{2}}{n}}

(18)

A C C = 1 - \frac{\sum_{i = 1}^{n} (\frac{|x_{a, i} - x_{p, i}|}{x_{a, i}})}{n}

(19)

R E = \frac{\sum_{i = 1}^{n} (x_{p, i} - x_{a, i})}{n}

(20)

In these equations,

x_{p, i}

denotes the prediction result of the sound velocity value in the predicted sequence, and

x_{a, i}

denotes the sound velocity value in the actual sound velocity profile at the corresponding moment. In the experiments, the RMSE is used to determine the absolute error. To determine the average deviation from the actual value, the RE is used, and to determine the accuracy relative to the actual data, the ACC is used. When the RMSE and RE values are modest, this indicates that the model is performing well in terms of making predictions. By contrast, a high ACC value suggests that the predicted profile is more in line with the actual one. The average RMSE, ACC, and RE for all 58 water layers were used to predict a single coordinate point, where n is the number of layers divided by the profile of sound velocity.

3. Results

3.1. SSP Prediction for a Single Coordinate Position

We initially performed prediction tests on estimating the SSP at a single coordinate position to ensure that the STA mechanism was successfully integrated into the STA-Conv-LSTM model for predicting the ocean sound speed at various spatial scales. Finding the best time step is the first step in making a time-series prediction. The number of months of historical data that should be used to obtain the best prediction is determined by the time step, a crucial parameter in the model. Before we tested STA-Conv-LSTM for SSP prediction at a single coordinate position, we tested it on SSP datasets with varying time steps to see how well it predicted changes in time. We chose 1, 6, 12, 18, 24, and 28 as our time steps to ensure that typical values were covered.

As Figure 8 shows, the SSP data were from location (24.5° N, 169.5° E), and comparing the prediction results for different time steps revealed that the results were very similar. This suggests that the time step did not significantly affect the SSP’s prediction accuracy.

The evaluation metrics of the STA-Conv-LSTM model’s prediction performance at different time steps are shown in Table 1. When the time step was 24, the best results were achieved with a minimum RE of 0.7379 m/s, a minimum RMSE of 0.9878, and a maximum ACC of 95.12%. It may be inferred that using 24 months of continuous historical data to forecast the following month will yield the most accurate results, with the least margin of error. Because SSP is a time series, its distribution follows a proximity trend, and the speed of sound is significantly affected by the cyclical change of seasons. As a result, the SSP of the same month after 24 months have passed, i.e., the data for the past two years, can also be used to better grasp the trends of the distribution of an SSP in the coming month. We believe that there is also a strong correlation between the average data for a particular future month and the data for the same month within the past two years. Therefore, we conclude that 24 months of historical data should be used to predict future data when conducting experiments to compare the capabilities of different models and to predict SSP data in three dimensions.

After determining the optimal time step of 24, we used the same dataset (24.5° N, 169.5° E) with a time step of 24 to conduct comparative prediction experiments using the RNN, LSTM, and Conv-LSTM models to assess the STA-Conv-LSTM model’s relative performance. The prediction results of the STA-Conv-LSTM model and its comparison models are displayed in Figure 9. The actual sound speed in June 2023 was much lower than the anticipated sound speed, and there was not a huge discrepancy between some of the RNN’s predicted values and the real values. We compared the sound speed prediction errors of different model approaches, as shown in Figure 10. The line in the middle of the box represents the average value. The red part presents the sound speed structure prediction errors of our proposed STA-Conv-LSTM model, which contains 24 data points corresponding to different months. The other parts present the sound speed prediction errors of the RNN model, LSTM model, and Conv-LSTM model. Clearly, with respect to the other three methods, our proposed STA-Conv-LSTM model achieved the lowest average RMSE for sound speed, and the data distribution was relatively dense. This result demonstrates the accuracy and stability of our proposed STA-Conv-LSTM model in predicting ocean sound speed. Table 2 shows the results of the comparison, with higher ACC values indicating that STA-Conv-LSTM is more suited for real-world data. Additionally, STA-Conv-LSTM had the lowest RMSE among all the models, indicating that its prediction results agreed best with the real data. Therefore, we can conclude that the STA-Conv-LSTM network outperforms other models, with accurate SSP prediction results at a single coordinate position. Moreover, it can satisfy the requirements for research work. This model greatly improves the prediction quality of SSPs because it satisfies the requirements of the sound speed correction standard proposed by the International Waterway Organization.

3.2. SSP Prediction in Three Dimensions

Although Figure 8, Figure 9 and Figure 10 show the SSP prediction results for individual coordinate positions, STA-Conv-LSTM can also predict spatial variations in the SSP. To predict the ocean sound speed in 3D space, a new dataset containing all the SSP data for the selected region was required. The format of each data point of the dataset was (20 × 20 × 58), i.e., 20 × 1° longitude, 20 × 1° latitude, and 58 layers of water depth. Figure 10 displays the STA-Conv-LSTM prediction results for the chosen area. Using data collected at 0, 4, 8, 16, and 100 m below the surface, we compared our predictions to the actual results.

The expected data distribution was very close to the real one, as shown in Figure 11. For every assessment parameter, Table 3 directly compares the expected and the actual outcomes. There was a greater concentration of the range of sound velocity distributions in deep water than in shallow water. The ACC was lower (only approximately 90%), and the RE between the forecast findings and the actual results was about 5.7 m/s in shallow water, particularly in the surface layer of seawater. By contrast, in deep water, especially at a depth of 1200 m, the RE was the smallest (less than 1.1 m/s), the ACC was the largest (nearly 96%), and the RMSE was the smallest at 0.1 m/s. Moreover, the average RMSE was the smallest at 0.0495 m/s. Because the RMSE was below 0.05, the accuracy reached was similar to the accuracy of individual coordinate position predictions. In other words, closer to the seawater surface, the deviation was larger, but at greater depths, the prediction was better. The experimental results demonstrate that the STA-Conv-LSTM network, when trained with 3D SSP data, can accurately predict the SSP of any given coordinate position in the region. The method’s accuracy met the standards set by this type of research tasks, making it a valuable tool for ocean observation.

3.3. Comparison of Multi-Spatial-Scale Sound Speed Profile Predictions

When the prediction object changed from a single point to a 3D area, the error curve of sound speed prediction using the STA-Conv-LSTM model decreased, indicating that the prediction accuracy gradually improved, as shown in Figure 12. This figure presents the prediction errors of the STA-Conv-LSTM model for different prediction objects. The blue line represents the RMSE for the single-point SSP prediction; the red line represents the RMSE for the 3D area SSP prediction. Throughout the year, the errors for both prediction objects were smaller from April to August and larger from January to March and from September to December. This is because the target sea area in this study was examined during the warming period from January to March and the rainy season from September to December. However, because it was affected by a monsoon, the sea area experienced strong winds and waves from September to April of the following year. The combination of these factors led to frequent fluctuations in seawater temperature and salinity. Therefore, the changes in ocean sound speed exhibited weak regularity. When a 3D-area SSP was predicted, the RMSE decreased from September to December, reflecting the advantages of 3D prediction over single-point SSP prediction, as it incorporates more spatial information and reduces the impact of local fluctuations.

4. Discussion

On the basis of Conv-LSTM and the fusion of temporal and spatial attention mechanisms, this paper proposes a multi-scale ocean sound speed prediction method called the STA-Conv-LSTM model. This method was employed to predict the SSP for a single coordinate position and in three dimensions. The results in Section 3.1, Section 3.2 and Section 3.3 reveal that this method exhibited several distinct characteristics.

4.1. Analysis of SSP Prediction at a Single Coordinate Position

After determining that the optimal prediction step length for the STA-Conv-LSTM model is 24 months, as shown in Figure 8, we further explored the importance of historical data in sound speed prediction. Historical data not only provided the basic training set for the model but also established a temporal relationship during the prediction process. The time series characteristics [55] of sound speed profiles allow 24 months of historical data to effectively capture periodic changes and long-term trends [56]. This approach effectively utilizes the time-varying characteristics of the environment, assisting the model in accurately predicting sound speed for the upcoming month.

In terms of model comparison, significant differences in prediction accuracy among different models reveal the strengths and weaknesses of various machine learning algorithms. For instance, while RNN [57] and LSTM [39] have certain advantages in time-series prediction, their performance is limited when dealing with complex environmental factors, such as fluctuations in temperature and salinity. Although Conv-LSTM accounts for the spatio-temporal complexities of sound speed [49], STA-Conv-LSTM, by integrating convolutional layers, LSTM layers, and spatio-temporal attention mechanisms [58], is able to better capture local features and global information, thus achieving higher accuracy in predicting sound speed in shallow water areas.

The comparison between deep and shallow water regions further emphasizes the impact of water depth on sound speed prediction accuracy. In shallow waters, the complexity of environmental factors [59,60,61,62], such as sunlight, wind, tides, and currents, significantly increases the challenges of prediction. We believe that future research could incorporate more environmental monitoring data, such as meteorological data [63] and ocean sensor data [64], to enhance the accuracy of sound speed predictions in shallow water regions.

4.2. Analysis of SSP Prediction in Three Dimensions

In the prediction of a three-dimensional SSP, the STA-Conv-LSTM model achieved an average accuracy of 93%, as demonstrated in Figure 10 and Figure 11. By utilizing a three-dimensional model, we could effectively capture the spatial variations in sound speed. This integration of spatial information, especially the changes at different depths, is crucial for understanding the behavior of sound wave propagation.

Comparing the prediction results between shallow and deep water regions, as shown in Table 3, further reveals the complexity of environmental factors [65,66]. In shallow water regions, the instability of sound speed is more pronounced due to the influences of seasonal variations [55] and biological activities [59,60,61,62]. Therefore, improving prediction accuracy in shallow waters relies on a detailed analysis of multiple environmental factors. In future research, employing machine learning and data fusion techniques [67,68], along with deep learning models, will provide an effective way to enhance the quality of sound speed predictions.

4.3. Analysis of Sound Speed Profile Prediction at Different Spatial Scales

When the prediction range for sound speed was extended to three-dimensional areas, as shown in Figure 12, the STA-Conv-LSTM model demonstrated exceptional effectiveness in handling complex spatial data. This efficiency stems from the model’s ability to effectively integrate multidimensional information, thereby limiting the influence of localized data fluctuations.

Regarding fluctuations caused by environmental factors [59,60,61,62], the model’s three-dimensional predictive capability [66] allows for a more comprehensive identification and response to various changes. In practical applications, combining real-time data monitoring with model predictions can significantly enhance the real-time predictive performance of ocean sound speed. Future research could explore how to integrate this model with Internet of Things (IoT) technology [68] to allow for continuous real-time data updates, further improving response speed and accuracy so to adapt to the dynamically changing marine environment.

In summarizing the above sections, we emphasize the broad application prospects of the STA-Conv-LSTM model in ocean sound speed prediction. As methods of data acquisition and processing continue to evolve, coupled with advanced machine learning techniques [68], this model has the potential to provide valuable predictive information across a broader range of marine research and applications, such as maritime navigation, fishery resource management, and acoustic detection.

Overall, the advantages of the STA-Conv-LSTM model lie not only in its high predictive accuracy but also in its scalability and adaptability. Future research incorporating multi-dimensional data [67] and interdisciplinary approaches is expected to drive further development and innovation in this field.

5. Conclusions

With the rapid advancement of computational power and continuous breakthroughs in deep-learning technologies, the predictive accuracy of spatio-temporal sequence prediction algorithms for ocean sound speed has significantly improved. In this paper, to investigate the multi-spatial-scale interactions of sound speed and achieve precise predictions with limited data, we introduced the STA-Conv-LSTM framework, which integrates the STA mechanism with Conv-LSTM. We validated the accuracy of our method through experiments on the BOA_Argo dataset. The main conclusions are as follows:

For predicting the SSP at a single coordinate location, the experimental study first evaluated the forecasting performance of the STA-Conv-LSTM model under different time step configurations, establishing the optimal time step as 24 months. This means that data for each future time point will be predicted using the monthly mean historical data of the past 24 months. This conclusion is justified by considering the temporal trends and periodicity associated with the distribution of the SSPs. To confirm the superiority of the model, a comparative analysis was conducted with RNN, LSTM, and Conv-LSTM networks. Obtained under the optimal parameter settings, the results showed that the STA-Conv-LSTM network achieved the highest prediction accuracy, with an RMSE of 0.8978 and an ACC exceeding 95%. Additionally, the deviation from the actual data was less than 1 m/s, demonstrating the effectiveness of the STA-Conv-LSTM network in predicting SSP data.
Three-dimensional SSP prediction was performed in the selected measurement area using the complete historical SSP data to achieve spatio-temporal forecasting. The predictions across different water depths from shallow to deep were compared with the actual data, revealing that the prediction accuracy was greater in deeper waters, achieving an RMSE of 0.0495 and an ACC exceeding 95%. By contrast, the predictive performance was slightly less effective in shallow water, where the RMSE remained around 0.1, with an ACC just below 90%. Overall, these results indicate that the STA-Conv-LSTM network, by effectively capturing the interrelationships among SSPs at various spatial points, provided superior accuracy for 3D area SSP predictions than for SSP predictions at a single coordinate.

The experimental results demonstrate that the STA-Conv-LSTM network can perform SSP predictions in 3D spatial contexts within the measurement area with sufficient accuracy for practical applications. For future work, the prediction of future SSPs based on historical data could be applied to single-beam and multi-beam ocean observation methods, thereby reducing the workload and costs while improving measurement precision. The prediction of SSPs reflects the sound speed distribution law governing the multi-spatial-scale characteristics of the ocean. The proposed ocean sound speed prediction method based on the STA-Conv-LSTM model has better prediction performance when considering the coupling effect of multiple spatial scales. This study lays an effective foundation for the 3D development of marine sound speed prediction techniques.

However, at present, our summary of multi-spatial-scale coupling effects is preliminary and crude and does not accurately describe the interaction between the sound velocities at each spatial location. Moreover, the sound velocity prediction lacks consideration of the dynamic process of ocean temperature and salinity correlation.

In future research, we will focus on exploring the effects of multi-dimensional spatial coupling effects on sound speed prediction, with a view to obtaining the results of sound speed prediction with spatially continuous properties, improving the accuracy and efficiency of prediction. In addition, we plan to combine the STA-Conv-LSTM model with physical knowledge of ocean dynamic processes to improve the efficiency of STA-Conv-LSTM in handling physical relationships in space and time, as well as the interpretability of the model.

Author Contributions

S.W.: conceptualization, methodology, writing—original draft. Z.W.: conceptualization, methodology, supervision. S.J.: conceptualization, methodology, writing—original draft. D.Z.: methodology, supervision. J.S.: methodology, supervision. M.W.: validation, visualization, investigation. J.Z.: supervision. X.Q.: validation, visualization, investigation. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research & Development Program of China [grant numbers: 2022YFC2806600 and 2022YFC2806605]; the Scientific Research Fund of the Second Institute of Oceanography, MNR [grant number: QNYC2403]; the National Natural Science Foundation of China [grant number: 42306210]; and the Oceanic Interdisciplinary Program of Shanghai Jiao Tong University [grant numbers: SL2022ZD205, SL2023ZD102 and SL2023ZD203].

Data Availability Statement

The data presented in this study are openly available in BOA_Argo at ftp://data.argo.org.cn/pub/ARGO/BOA_Argo/ (accessed on 5 December 2024).

Acknowledgments

Thank the anonymous reviewers for editing the English text of a draft of this manuscript.

Conflicts of Interest

The authors declare that there are no conflicts of interest for this manuscript.

References

Akyildiz, I.F.; Pompili, D.; Melodia, T. Underwater acoustic sensor networks: Research challenges. Ad Hoc Netw. 2005, 3, 257–279. [Google Scholar] [CrossRef]
Stojanovic, M.; Preisig, J. Underwater acoustic communication channels: Propagation models and statistical characterization. IEEE Commun. Mag. 2009, 47, 84–89. [Google Scholar] [CrossRef]
Kinsler, L.E.; Frey, A.R.; Coppens, A.B.; Sanders, J.V. Fundamentals of Acoustics, 4th ed.; John Wiley and Sons: Hoboken, NJ, USA, 2000; 480p. [Google Scholar]
Heidemann, J.; Stojanovic, M.; Zorzi, M. Underwater sensor networks: Applications, advances and challenges. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2012, 370, 158–175. [Google Scholar] [CrossRef] [PubMed]
Chen, C.T.; Millero, F.J. Speed of sound in seawater at high pressures. J. Acoust. Soc. Am. 1977, 62, 1129–1135. [Google Scholar] [CrossRef]
Mackenzie, K.V. Nine-term equation for sound speed in the oceans. J. Acoust. Soc. Am. 1981, 70, 807–812. [Google Scholar] [CrossRef]
Storto, A.; Falchetti, S.; Oddo, P.; Jiang, Y.M.; Tesei, A. Assessing the impact of different ocean analysis schemes on oceanic and underwater acoustic predictions. J. Geophys. Res. Oceans 2020, 125, e2019JC015636. [Google Scholar] [CrossRef]
Wu, Z.; Yang, F.; Tang, Y. High-Resolution Seafloor Survey and Applications; Springer & Science Press: Beijing, China, 2020; p. 625. ISBN 978-7-03-066031-9. [Google Scholar]
Zhao, D.; Wu, Z.; Zhou, J.; Li, J.; Shang, J.; Li, S. A new method of automatic SVP optimization based on MOV algorithm. Mar. Geod. 2015, 38, 225–240. [Google Scholar] [CrossRef]
Yang, F.; Li, J.; Wu, Z.; Jin, X.; Chu, F.; Kang, Z. A post-processing method for the removal of refraction artifacts in multibeam bathymetry data. Mar. Geod. 2007, 30, 235–247. [Google Scholar] [CrossRef]
Liu, Y.Y.; Chen, Y.; Meng, Z.; Chen, W. Performance of single empirical orthogonal function regression method in global sound speed profile inversion and sound field prediction. Appl. Ocean Res. 2023, 135, 103598. [Google Scholar] [CrossRef]
Munk, W.; Wunsch, C. Ocean acoustic tomography: A scheme for large scale monitoring. Deep. Sea Res. Part A Oceanogr. Res. Pap. 1979, 25, 123–161. [Google Scholar] [CrossRef]
Skarsoulis, E.K.; Athanassoulis, G.A.; Send, U. Ocean acoustic tomography based on peak arrivals. J. Acoust. Soc. Am. 1996, 100, 797–813. [Google Scholar] [CrossRef]
Tolstoy, A.; Diachok, O.; Frazer, L.N. Acoustic tomography via matched field processing. J. Acoust. Soc. Am. 1991, 89, 1119–1127. [Google Scholar] [CrossRef]
Goncharov, V.V.; Voronovich, A.G. An experiment on matched-field acoustic tomography with continuous wave signals in the Norway Sea. J. Acoust. Soc. Am. 1993, 93, 1873–1881. [Google Scholar] [CrossRef]
Candy, J.V.; Sullivan, E.J. Sound velocity profile estimation: A system theoretic approach. IEEE J. Ocean. Eng. 1993, 18, 240–252. [Google Scholar] [CrossRef]
Carrière, O.; Hermand, J.P.; Candy, J.V. Inversion for time-evolving sound-speed field in a shallow ocean by ensemble Kalman filtering. IEEE J. Ocean. Eng. 2009, 34, 586–602. [Google Scholar] [CrossRef]
Gerstoft, P.; Mecklenbräuker, C.F.; Seong, W.; Bianco, M. Introduction to compressive sensing in acoustics. J. Acoust. Soc. Am. 2018, 143, 3731–3736. [Google Scholar] [CrossRef]
Bianco, M.; Gerstoft, P. Compressive acoustic sound speed profile estimation. J. Acoust. Soc. Am. 2016, 139, EL90–EL94. [Google Scholar] [CrossRef]
Choo, Y.; Seong, W. Compressive sound speed profile inversion using beamforming results. Remote Sens. 2018, 10, 704. [Google Scholar] [CrossRef]
Park, J.C.; Kennedy, R.M. Remote sensing of ocean sound speed profiles by a perceptron neural network. IEEE J. Ocean. Eng. 1996, 21, 216–224. [Google Scholar] [CrossRef]
Jain, S.; Ali, M.M. Estimation of sound speed profiles using artificial neural networks. IEEE Geosci. Remote Sens. Lett. 2006, 3, 467–470. [Google Scholar] [CrossRef]
Huang, J.; Luo, Y.; Shi, J.; Ma, X.; Li, Q.Q.; Li, Y.Y. Rapid modeling of the sound speed field in the South China Sea based on a comprehensive optimal LM-BP artificial neural network. J. Mar. Sci. Eng. 2021, 9, 488. [Google Scholar] [CrossRef]
Li, B.Y.; Zhai, J.S. A novel sound speed profile prediction method based on the convolutional long-short term memory network. J. Mar. Sci. Eng. 2022, 10, 572. [Google Scholar] [CrossRef]
Espeholt, L.; Agrawal, S.; Sønderby, C.; Kumar, M.; Heek, J.; Bromberg, C.; Gazen, C.; Carver, R.; Andrychowicz, M.; Hickey, J.; et al. Deep learning for twelve hour precipitation forecasts. Nat. Commun. 2022, 13, 5145. [Google Scholar] [CrossRef] [PubMed]
Shao, Q.; Li, W.; Han, G.J.; Hou, G.C.; Liu, S.Y.; Gong, Y.T.; Qu, P. A deep learning model for forecasting sea surface height anomalies and temperatures in the South China Sea. temperatures in the South China Sea. J. Geophys. Res. Ocean. 2021, 125, e2021JC017515. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Ham, Y.G.; Kim, J.H.; Luo, J.J. Deep learning for multi-year ENSO forecasts. Nature 2019, 573, 568–572. [Google Scholar] [CrossRef]
Xiao, C.J.; Chen, N.C.; Hu, C.L.; Wang, K.; Gong, J.Y.; Chen, Z.Q. Short and mid-term sea surface temperature prediction using time-series satellite data and LSTM-AdaBoost combination approach. Remote Sens. Environ. 2019, 333, 111858. [Google Scholar] [CrossRef]
Andersson, T.R.; Hosking, J.S.; Pérez-Ortiz, M.; Paige, B.; Elliott, A.; Russell, C.; Law, S.; Jones, D.C.; Wilkinson, J.; Phillips, T.; et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat. Commun. 2021, 12, 5124. [Google Scholar] [CrossRef]
Saunders, P.M. Practical conversion of pressure to depth. J. Phys. Oceanogr. 1981, 11, 573–574. [Google Scholar] [CrossRef]
Yu, X.K.; Xu, T.H.; Wang, J.T. Sound Velocity Profile Prediction Method Based on RBF Neural Network. In Proceedings of the China Satellite Navigation Conference (CSNC) 2020 Proceedings, Chengdu, China, 22–25 November 2020; Volume III, pp. 475–487. [Google Scholar]
Zhang, Q.; Wang, H.; Dong, J.Y.; Zhong, G.; Sun, X. Prediction of Sea Surface Temperature Using Long Short-Term Memory. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1745–1749. [Google Scholar] [CrossRef]
Ali, A.; Fathalla, A.; Salah, A.; Bekhit, M.; Eldesouky, E. Marine Data Prediction: An Evaluation of Machine Learning, Deep Learning, and Statistical Predictive Models. Comput. Intell. Neurosci. 2021, 27, 8551167. [Google Scholar]
Ou, Z.Y.; Qu, K.; Liu, C. Estimation of sound speed profiles using a random forest model with satellite surface observations. Shock Vib. 2022, 2022, 2653791. [Google Scholar] [CrossRef]
Ou, Z.Y.; Qu, K.; Shi, M.; Wang, Y.F.; Zhou, J.B. Estimation of sound speed profiles based on remote sensing parameters using a scalable end-to-end tree boosting model. Front. Mar. Sci. 2022, 9, 1051820. [Google Scholar]
Piao, S.C.; Yan, X.; Li, Q.Q.; Li, Z.L.; Wang, Z.W.; Zhu, J.L. Time series prediction of shallow water sound speed profile in the presence of internal solitary wave trains. Ocean Eng. 2023, 283, 115058. [Google Scholar]
Wu, P.F.; Zhang, H.; Shi, Y.J.; Lu, J.J.; Li, S.J.; Huang, W.; Tang, N.; Wang, S.J. Real-time estimation of underwater sound speed profiles with a data fusion convolutional neural network model. Appl. Ocean Res. 2024, 150, 104088. [Google Scholar]
Gao, C.; Cheng, L.; Zhang, T.; Li, J.L. Long-term Forecasting of Ocean Sound Speeds at Any Time via Neural Ordinary Differential Equations. In Proceedings of the OCEANS 2024—Singapore, Singapore, 15–18 April 2024; pp. 1–6. [Google Scholar]
Xu, Y.H.; Hou, J.Y.; Zhu, X.J.; Wang, C.; Shi, H.D.; Wang, J.Y. Hyperspectral Image Super-Resolution with ConvLSTM Skip Connections. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar]
Abbass, M.J.; Lis, R.; Awais, M.; Nguyen, T.X. Convolutional Long Short-Term Memory (ConvLSTM)-Based Prediction of Voltage Stability in a Microgrid. Energies 2024, 17, 1999. [Google Scholar] [CrossRef]
Zheng, L.; Lu, W.S.; Zhou, Q.Y. Weather image-based short-term dense wind speed forecast with a ConvLSTM-LSTM deep learning model. Build. Environ. 2023, 239, 110446. [Google Scholar] [CrossRef]
He, R.; Liu, Y.B.; Xiao, Y.P.; Lu, X.Y.; Zhang, S. Deep spatio-temporal 3D densenet with multiscale ConvLSTM-Resnet network for citywide traffic flow forecasting. Knowl.-Based Syst. 2022, 250, 109054. [Google Scholar]
Lv, Z.Q.; Ma, Z.B.; Xia, F.Q.; Li, J.B. A transportation Revitalization index prediction model based on Spatial-Temporal attention mechanism. Adv. Eng. Inform. 2024, 61, 102519. [Google Scholar] [CrossRef]
Xu, C.Y.; Xu, C.Q. Local spatial and temporal relation discovery model based on attention mechanism for traffic forecasting. Neural Netw. 2024, 176, 106365. [Google Scholar] [PubMed]
Li, H.; Xu, F.; Zhou, W.; Wang, D.; Wright, J.S.; Liu, Z.; Lin, Y. Development of a global gridded Argo data set with Barnes successive corrections. J. Geophys. Res. Ocean. 2017, 122, 866–889. [Google Scholar] [CrossRef]
Del Grosso, V.A. New equation for the speed of sound in natural waters (with comparisons to other equations). J. Acoust. Soc. Am. 1974, 56, 1084–1091. [Google Scholar] [CrossRef]
Wang, H.Y.; Xu, P.D.; Zhao, J.H. Improved KNN Algorithm Based on Preprocessing of Center in Smart Cities. Complexity 2021, 2021, 5524388. [Google Scholar]
Shi, X.J.; Chen, Z.R.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; MIT Press: Cambridge, MA, USA, 2015; pp. 802–810. [Google Scholar]
Lindemann, B.; Müller, T.; Vietz, H.; Jazdi, N.; Weyrich, M. A Survey on Long Short-Term Memory Networks for Time Series Prediction. Procedia CIRP 2021, 99, 650–655. [Google Scholar]
Agga, A.; Abbou, A.; Labbadi, M.; Houm, Y.E. Short-Term Self Consumption PV Plant Power Production Forecasts Based on Hybrid CNN-LSTM, ConvLSTMModels. Renew. Energy 2021, 177, 101–112. [Google Scholar]
Moishin, M.; Deo, R.C.; Prasad, R.; Rai, N.; Abdulla, S. Designing Deep-Based Learning Flood Forecast Model with ConvLSTM Hybrid Algorithm. IEEE Access 2021, 9, 50982–50993. [Google Scholar]
Peng, Y.Q.; Tao, H.F.; Li, W.; Yuan, H.T.; Li, T.J. Dynamic Gesture Recognition Based on Feature Fusion Network and Variant ConvLSTM. IET Image Process. 2020, 14, 2480–2486. [Google Scholar]
Guo, F.; Yang, J.; Li, H.; Li, G.; Zhang, Z. A ConvLSTM Conjunction Model for Groundwater Level Forecasting in a Karst Aquifer Considering Connectivity Characteristics. Water 2021, 13, 2759. [Google Scholar] [CrossRef]
Liu, Y.; Chen, W.; Chen, W.; Chen, Y.; Ma, L.; Meng, Z. Reconstruction of ocean front model based on sound speed clustering and its effectiveness in ocean acoustic forecasting. Appl. Sci. 2021, 11, 8461. [Google Scholar] [CrossRef]
Chen, C.; Lei, B.; Ma, Y.; Liu, Y.; Wang, Y. Diurnal fluctuation of shallow-water acoustic propagation in the cold dome off northeastern Taiwan in spring. IEEE J. Ocean. Eng. 2020, 45, 1099–1111. [Google Scholar] [CrossRef]
Elman, J.L. Finding structure in time. Cogn. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Niu, Z.Y.; Zhong, G.Q.; Yu, H. A Review on The Attention Mechanism of Deep Learning. Neurocomputing 2021, 452, 48–62. [Google Scholar] [CrossRef]
Chen, W.; Zhang, Y.; Liu, Y.; Ma, L.; Wang, H.; Ren, K.; Chen, S. Parametric model for eddies-induced sound speed anomaly in five active mesoscale eddy regions. J. Geophys. Res. Oceans 2022, 127, e2022JC018408. [Google Scholar] [CrossRef]
Xiao, Y.; Li, Z.; Sabra, K.G. Effect of mesoscale eddies on deep-water sound propagation. J. Acoust. Soc. Am. 2018, 143, 1873–1874. [Google Scholar] [CrossRef]
Shapiro, G.; Chen, F.; Thain, R. The effect of ocean fronts on acoustic wave propagation in the Celtic Sea. J. Mar. Syst. 2014, 139, 217–226. [Google Scholar] [CrossRef]
Chen, C.; Yang, K.; Duan, R.; Ma, Y. Acoustic propagation analysis with a sound speed feature model in the front area of Kuroshio Extension. Appl. Ocean Res. 2017, 68, 1–10. [Google Scholar] [CrossRef]
Navarra, A.; Simoncini, V. A Guide to Empirical Orthogonal Functions for Climate Data Analysis; Springer: Dordrecht, The Netherlands, 2010; pp. 39–67. [Google Scholar] [CrossRef]
Qu, K.; Zou, B.; Zhou, J. Rapid environmental assessment in the South China Sea: Improved inversion of sound speed profile using remote sensing data. Acta Oceanol. Sin. 2022, 41, 78–83. [Google Scholar] [CrossRef]
Duda, T.F.; Lavery, A.C.; Lin, Y.; Zhang, W. Sound propagation effects of near-seabed internal waves in shallow water. J. Acoust. Soc. Am. 2018, 143, 1975. [Google Scholar] [CrossRef]
Lin, Y.; Lynch, J.F. Three-dimensional sound propagation and scattering in an ocean with surface and internal waves over range-dependent seafloor. J. Acoust. Soc. Am. 2017, 141, 3753. [Google Scholar] [CrossRef]
Sarkar, P.; Janardhan, P.; Roy, P. Applicability of a long short-term memory deep learning network in sea surface temperature predictions. In Proceedings of the Earth 1st International Conference on Water Security and Sustainability, San Luis Potosí, Mexico, 28–30 October 2019. [Google Scholar]
Sarkar, P.; Janardhan, P.; Roy, P. Prediction of sea surface temperatures using deep learning neural networks. SN Appl. Sci. 2020, 2, 1–14. [Google Scholar] [CrossRef]

Figure 1. Location of the selected area. The coordinate position (24.5° N, 169.5° E) is indicted by a red solid circle, and the study area (15.5° N–34.5° N, 160.5° E–179.5° E) is indicated by a blue box.

Figure 2. BOA_Argo profile data for point (24.5° N, 169.5° E). (a) Temperature profile; (b) salinity profile; (c) sound velocity profile.

Figure 3. Division of the data into training and validation datasets. “S” stands for dataset, “X_train” indicates the input data, “Y_train” indicates the output data, and t denotes the time step.

Figure 4. Conv-LSTM neural unit structure, where

i

,

f

,

c

, and

o

denote the input gate, forget gate, control unit, and output gate, respectively;

σ

represents the nonlinear activation function;

x_{t}

signifies the input at time

t

;

W_{x i}

,

W_{h i}

and

W_{c i}

are the weight matrices; “

*

” denotes the convolution operator; “

\circ

” indicates the Hadamard product;

H_{t}

represents the output value at time

t

; and

o_{t}

signifies the gated information at the output gate.

Figure 4. Conv-LSTM neural unit structure, where

i

,

f

,

c

, and

o

denote the input gate, forget gate, control unit, and output gate, respectively;

σ

represents the nonlinear activation function;

x_{t}

signifies the input at time

t

;

W_{x i}

,

W_{h i}

and

W_{c i}

are the weight matrices; “

*

” denotes the convolution operator; “

\circ

” indicates the Hadamard product;

H_{t}

represents the output value at time

t

; and

o_{t}

signifies the gated information at the output gate.

Figure 5. Structure of the STA-Conv-LSTM network. The red block represents the convolutional neural network module, the yellow block represents the LSTM module, the green block represents the temporal attention mechanism module, and the blue block represents the spatial attention mechanism module.

Figure 6. Structure of the temporal attention module. “Conv2D” represents the convolutional layer (1 × 1 convolutional kernel), “Dense” represents the fully connected layer, and “Lambda” denotes an anonymous function to make the code more concise.

Figure 7. Structure of the spatial attention module. “Conv2D” is the convolutional layer (7 × 7 convolutional kernel), “Dense” indicates a fully connected layer, and “Lambda” is an anonymous function that makes the code more concise.

Figure 8. Prediction results for different time steps. (a) Time step = 1; (b) time step = 6; (c) time step = 12; (d) time step = 18; (e) time step = 24; and (f) time step = 28.

Figure 9. Comparison of the prediction results of different models (time step = 24).

Figure 10. Comparison of the prediction errors of different models. The line in the middle of the box represents the average value. The red, yellow, purple, and green boxes indicate the results for the STA-Conv-LSTM, Conv-LSTM, LSTM, and RNN models, respectively.

Figure 11. SSP prediction for different water depths in 3D space. (a) Water depth = 0 m, real values; (b) water depth = 0 m, predicted values; (c) water depth = 400 m, real values; (d) water depth = 400 m, predicted values; (e) water depth = 800 m, real values; (f) water depth = 800 m, predicted values; (g) water depth = 1200 m, real values; (h) water depth = 1200 m, predicted values; (i) water depth = 1600 m, real values; (j) water depth = 1600 m, predicted values.

Figure 12. Prediction error for different prediction objects of the STA-Conv-LSTM model. The blue line represents the RMSE for the single-point SSP prediction; the red line represents the RMSE for the 3D-area SSP prediction.

Table 1. Prediction results of STA-Conv-LSTM network with different time steps.

Time Step	RMSE	ACC (%)	RE (m/s)
1	1.8629	91.89	1.2448
6	2.2551	90.19	1.4880
12	1.4568	93.32	1.0122
18	2.9064	87.04	1.9679
24	0.8978	95.12	0.7379
28	2.8697	88.19	1.7968

Table 2. Predictions of sound velocity profiles for different models.

Model	RMSE	ACC (%)	RE (m/s)
RNN	1.9768	88.31	1.7400
LSTM	1.5410	90.79	1.4009
Conv-LSTM	1.0296	92.96	1.1644
STA-Conv-LSTM	0.8978	95.12	0.7379

Table 3. Predictions for different water depths in the selected area (15.5° N–34.5° N, 160.5° E–179.5° E).

Water Depth (m)	RMSE	ACC (%)	RE (m/s)
0	0.1098	89.88	5.691
400	0.0815	91.38	2.018
800	0.0647	92.77	1.428
1200	0.0601	94.80	1.079
1600	0.0495	95.95	0.659

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Wu, Z.; Jia, S.; Zhao, D.; Shang, J.; Wang, M.; Zhou, J.; Qin, X. A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism. J. Mar. Sci. Eng. 2025, 13, 722. https://doi.org/10.3390/jmse13040722

AMA Style

Wang S, Wu Z, Jia S, Zhao D, Shang J, Wang M, Zhou J, Qin X. A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism. Journal of Marine Science and Engineering. 2025; 13(4):722. https://doi.org/10.3390/jmse13040722

Chicago/Turabian Style

Wang, Shuwen, Ziyin Wu, Shuaidong Jia, Dineng Zhao, Jihong Shang, Mingwei Wang, Jieqiong Zhou, and Xiaoming Qin. 2025. "A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism" Journal of Marine Science and Engineering 13, no. 4: 722. https://doi.org/10.3390/jmse13040722

APA Style

Wang, S., Wu, Z., Jia, S., Zhao, D., Shang, J., Wang, M., Zhou, J., & Qin, X. (2025). A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism. Journal of Marine Science and Engineering, 13(4), 722. https://doi.org/10.3390/jmse13040722

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Spatial-Scale Ocean Sound Speed Profile Prediction Model Based on a Spatio-Temporal Attention Mechanism

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. Dataset Preprocessing

2.3. Conv-LSTM Model

2.4. STA-Conv-LSTM Model

2.5. Evaluation Indicators

3. Results

3.1. SSP Prediction for a Single Coordinate Position

3.2. SSP Prediction in Three Dimensions

3.3. Comparison of Multi-Spatial-Scale Sound Speed Profile Predictions

4. Discussion

4.1. Analysis of SSP Prediction at a Single Coordinate Position

4.2. Analysis of SSP Prediction in Three Dimensions

4.3. Analysis of Sound Speed Profile Prediction at Different Spatial Scales

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI