Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network

He, Qunyi; Wang, Haitao; Zeng, Xiangyang; Jin, Anqi

doi:10.3390/jmse12060885

Open AccessArticle

Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(6), 885; https://doi.org/10.3390/jmse12060885

Submission received: 10 April 2024 / Revised: 22 May 2024 / Accepted: 24 May 2024 / Published: 26 May 2024

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Ship-radiated noise separation is critical in both military and economic domains. However, due to the complex underwater environments with multiple noise sources and reverberation, separating ship-radiated noise poses a significant challenge. Traditionally, underwater acoustic signal separation has employed blind source separation methods based on independent component analysis. Recently, the separation of underwater acoustic signals has been approached as a deep learning problem. This involves learning the features of ship-radiated noise from training data. This paper introduces a deep time-domain network for ship-radiated noise separation by leveraging the power of parallel dilated convolution and group convolution. The separation layer employs parallel dilated convolution operations with varying expansion factors to better extract low-frequency features from the signal envelope while preserving detailed information. Additionally, we use group convolution to reduce the expansion of network size caused by parallel convolution operations, enabling the network to maintain a smaller size and computational complexity while achieving good separation performance. The proposed approach is demonstrated to be superior to the other common networks in the DeepShip dataset through comprehensive comparisons.

Keywords:

underwater acoustic; ship-radiated noise separation; deep network; parallel dilated convolution; group convolution

1. Introduction

Accurately separating noises radiated by different ships provides an important basis for subsequent applications, such as underwater acoustic target recognition, acoustic communication, and analysis of underwater situations.

According to the number of sensors used in the algorithms, the methods for signal separation can be categorized into two types, namely the single-channel method and the multi-channel method. Classical single-channel methods include spectral subtraction, Wiener filtering, adaptive filtering, etc. Multi-channel methods include subspace-based methods, blind source separation, etc.

Spectral subtraction [1,2] is a technique originally used for speech enhancement by estimating the background noise spectrum from early frames of mixed signals by calculating their average magnitudes or energy spectrum. This estimated noise is subtracted to extract the amplitude or energy spectrum of the desired acoustic signal. Spectral subtraction is simple and effective for reducing stationary noise in speech signals. However, the success of spectral subtraction depends on an accurate estimation of the background noise spectrum. Inaccurate estimation can lead to distortion or excessive residual noise during subtraction. For non-stationary noise, such as ship-radiated noise, where characteristics may change with time, noise separation becomes more challenging.

Wiener filtering [3] is also a commonly used method for signal separation or enhancement. It minimizes the squared difference between the output and the desired signal by solving the Toeplitz matrix. For instance, the signal-to-noise ratio is used as a means of estimating the power spectral density of clean speech signals. Similarly, the statistical characteristics of signals, including their mean and covariance matrices, can also be utilized as prior knowledge in Wiener filtering. However, this is often not feasible in real engineering.

Adaptive filtering [4,5] is based on linear filtering such as Wiener and Kalman filters, enabling real-time parameter adjustments to adapt to changes in signals. Consequently, it is suitable for dynamic and non-stationary signal processing. However, its performance depends on the selection of the adaptive algorithm, filter structure, step size, and other parameters. Suboptimal settings can lead to inefficient convergence or poor separation performance.

The subspace-based method [6] typically involves constructing a model of the signal subspace and using techniques such as singular value decomposition (SVD) [7,8,9] or principal component analysis (PCA) [10] to extract the underlying sources from the observed mixtures. It relies on assumptions about the signal and noise characteristics, and deviations from these assumptions can lead to suboptimal separation performance.

In the field of acoustic signal separation, blind source separation (BSS) based on independent component analysis (ICA) [11,12] is widely used. By leveraging the statistical independence or different statistical characteristics of source signals, BSS can effectively separate mixed signals. The first attempts to apply it in an underwater environment were made in the 1990s. Currently, the research primarily concentrates on processing sonar signals, underwater acoustic array signals, and analyzing ship-radiated noise.

In 1997, Gaeta et al. [13] used BSS to estimate the impulse response function of the hydroacoustic channel. In 2003, Kirsteins [14] employed BSS to study the impact of ocean surface multi-path effects on synthetic aperture sonar. In 2006, Mansour et al. [15] dealt with the application of ICA algorithms in passive acoustic tomography (PAT). In 2011, Kamal et al. [16] combined slow feature analysis (SFA) with BSS for hydroacoustic signals. In the same year, Zhang et al. [17] used BSS to reduce tug interference on hydroacoustic signals. In 2015, Tu et al. [18] separated hydroacoustic signals based on the negentropic FastICA algorithm. In 2017, Li et al. [19] used a spatial filter with a hydrophone array to separate underwater near-field sources. However, BSS requires the number of observed acoustic signals to be greater than or equal to the number of sources. Otherwise, it may lead to a decrease in separation effectiveness due to the statistical independence between sources.

In recent years, there have been significant advances in signal separation methods based on deep learning. Signal separation techniques based on deep learning typically use an end-to-end approach, which takes mixed signals in the time or time–frequency domain as input. This approach usually does not involve the extraction of explicit features or complex spectral processing. Research on convolutional neural networks (CNNs) [20], U-networks (UNets) [21], and other deep learning networks has gained much attention. The deep complex UNet (DCUNet) [22] combines the benefits of deep complex networks and UNets to process complex spectrograms by estimating complex ratio masks (CRMs). The residual U-network (Res-UNet) [23], which is typically used for vocal extraction in music, employs a complex ideal ratio mask (CIRM) to address the challenge in CIRM estimation due to the sensitivity of the real and imaginary parts of complex masks to signal temporal displacement. Recurrent neural networks (RNNs) [24] can learn the correlation between signal features by processing long sequences with recurrent connections. The long short-term memory (LSTM) [25] network is optimized for gradient disappearance and explosion during the training process. Furthermore, residual neural networks (ResNets) [26], generative adversarial networks (GANs) [27], visual geometry group networks (VGGs) [28], and so on are gradually used in acoustic signal separation, which have greatly demonstrated exceptional efficiency and extended the acoustic signal separation methods [29,30,31].

Most deep learning-based techniques for acoustic signal separation use complex masking models based on the time–frequency spectrum. This involves converting the time-domain waveform into the time–frequency spectrum. Directly conducting acoustic signal separation in the time-domain is an important approach. For instance, the time-domain audio separation network (TasNet) [32] and the convolutional time-domain audio separation network (Conv-TasNet) [33] directly estimate masks from the time-domain waveform, preserving phase information and reducing network size with 1D convolutions.

The separation of single-channel ship-radiated noise is complicated by difficulties associated with low frequency, spectral overlap, and other related issues. This paper proposes a time-domain separation network model based on the classical Conv-TasNet. The proposed method performs one-dimensional convolution on the time-domain waveform signal during the design of the separation layer mask. During the convolution process, parallel dilated convolutions are used to extract features from longer time-domain signals. On one hand, it extracts more signal envelope information to enhance the low-frequency processing, catering to the low-frequency characteristics of ship-radiated noise. On the other hand, it relies on long-term characteristics to extract more phase information. Based on this framework, group convolution is used to reduce the number of convolution kernels and convolution times, thereby reducing the network size expansion caused by parallel convolution while maintaining accuracy. The separation verification experiments conducted under various conditions demonstrate that the proposed network produces more precise and dependable results.

The rest of this paper is structured as follows: Section 2 provides a detailed description of the characteristics of ship-radiated noise and the structure of the proposed network. Section 3 introduces evaluation and validation. Finally, conclusions are given in Section 4.

2. Deep Time-Domain Ship-Radiated Noise Separation Network

2.1. Characteristics of Ship-Radiated Noise

The radiated noise of a ship mainly consists of mechanical noise, propeller noise, and hydrodynamic noise [34]. The spectrum of the ship-radiated noise is primarily composed of broadband noise in the continuous spectrum as well as a line spectrum at discrete frequencies.

Mechanical noise generated inside the ship includes propulsion system and auxiliary equipment noise and is the primary source of ship-radiated noise. It is caused by the mechanical vibrations of various ship components during navigation and is transmitted through the hull to the seawater. Propulsion noise is the sound energy radiated into the ocean by the propulsion system due to unbalanced vibrations and friction during rotation. Unbalanced vibration generates a narrowband signal that includes the system rotation frequency and its harmonic components. The noise generated by friction is predominantly broadband, continuous-spectrum noise. The propulsion system is located at its tail, with the main energy concentrated within a 100 Hz range. It is a combination of weak continuous spectrum and strong line spectrum characteristics. The noise of auxiliary equipment is mainly generated during the operation of the auxiliary equipment located in the middle of the ship. Its frequency is mainly concentrated between 100 Hz and 1 kHz.

Propeller noise is the cavitation noise caused by the high-speed rotation of the propeller blades stirring the surrounding water flow. Its frequency is mainly concentrated between 1 kHz and 5 kHz and is mostly in the form of a continuous spectrum. The noise generated by propellers due to cavitation can be divided into two aspects. Firstly, the rotation of the propellers creates numerous bubbles that produce noise when they burst. The frequency spectrum of this noise is continuous. Secondly, the propellers stir a large number of bubbles, which generate periodic, forced vibrations that cause noise. The spectrum of this noise is a discrete line spectrum.

Hydrodynamic noise is the main source of ship-radiated noise during high-speed navigation. It is generated by the structural vibration caused by the water flow passing through the ship and its ancillary parts due to the effects of fluid dynamics.

2.2. Deep Time-Domain Ship-Radiated Noise Separation Network

Assuming that there are two sources in the ocean, the mixed ship-radiated noise can be expressed as follows:

m (t) = n_{1} (t) + n_{2} (t)

(1)

where

m (t)

denotes the mixed ship-radiated noise, and

n_{1} (t)

and

n_{2} (t)

denote the noise emitted by different ships and sampled at the observation point, respectively. The purpose of ship-radiated noise separation is to obtain each independent noise component from the mixture.

From a time-domain perspective, the envelope of the ship-radiated noise signal waveform can reflect the characteristics of the low-frequency component [35]. The envelope has a longer time span compared to the detailed information in the signal waveform. In response to these characteristics, a time-domain ship-radiated noise separation network was developed based on an end-to-end Conv-TasNet model. This model constructs masks in the time-domain signals directly. In order to enhance the capture of the envelope information, this network introduces dilated convolution, which expands the convolution range during the convolution process and increases the length of the sampled frame in the time domain. Figure 1 illustrates the framework of the network.

The network consists of three parts: an encoder layer, a separation layer, and a decoder layer. The encoder layer initially performs a one-dimensional convolution of the input data, projecting the mixed noise signals into high-dimensional space to obtain a deep feature matrix. Based on the feature matrix output by the encoder layer, the separation layer constructs a mask matrix for the target ship-radiated noise signal using a mixed convolution module. The corresponding elements of the mask matrix are then multiplied with the encoded matrix to obtain the ship-radiated noise separation features. Finally, the separation features are input into the decoder layer to achieve signal recovery through deconvolution. This ultimately results in obtaining the separated time-domain ship-radiated noise signal.

2.3. Encoder Layer and Decoder Layer

The network proposed is in an end-to-end form, meaning that the input is the original time-domain signal of the ship-radiated noise and the output is the separated time-domain signal. The mixed ship-radiated noise signal is first processed by the encoder layer, and then the temporal signal is restored by the decoder layer. In contrast to the time–frequency mask construction used in other networks, there is no requirement to perform a short-time Fourier transform on the signal to obtain its time–frequency spectrum. Instead, operations are performed directly on the time-domain signal waveform, as illustrated in Figure 2.

When extracting high-dimensional features from the mixed ship-radiated noise signal through the encoder layer:

A = H (m (t))

(2)

where

H (\cdot)

stands for the process by which encoder layers extract high-dimensional features of mixed ship-radiated noise,

m (t)

is the mixed ship-radiated noise, and

A

is the feature matrix output by the encoder layer.

The process of encoding is accomplished using the widely used one-dimensional convolution form in deep learning. This method involves sliding a convolution kernel over the ship-radiated noise signal to extract features. It is assumed that there is a ship-radiated noise signal, denoted as

L_{n}

in length. In the encoder layer, the length and stride of the 1D convolutional kernel are denoted as

L_{k}

and

S

, respectively. After moving the convolution kernel, we obtain a feature row vector with a length calculated as

L_{f} = \frac{L_{n} - L_{k}}{S} + 1

(3)

When the number of convolutional kernels is

K

, the size of the output feature matrix

A

is

K \times L_{f}

. Furthermore, a rectified linear unit (ReLU) is added after the output matrix to ensure non-negativity.

After the encoding operation, a mask matrix for the ship-radiated noise is constructed in the separation layer based on high-dimensional features:

m_{i} = M (A)

(4)

where

M (\cdot)

is denoted as the process by which separation layers construct the separation mask of a ship-radiated noise signal,

m_{i}

represents the constructed mask matrix of each ship-radiated noise signal, and

i = 1, 2

.

The encoder layer produces the feature matrix

A

, which is then multiplied with the mask matrix

m_{i}

from the separation layer to obtain the intermediate matrix

I_{i}

:

I_{i} = A ⊙ m_{i}

(5)

where

i = 1, 2

. Then the decoder layer reduces the dimensionality of the intermediate matrix

I_{i}

through transposed convolution to obtain the separated ship-radiated noise signal:

{\tilde{n}}_{i} (t) = \bar{C} (I_{i})

(6)

where

\bar{C} (\cdot)

is the deconvolution operation of the separated ship-radiated noise signal by the decoder layer,

⊙

represents the point multiplication of the corresponding element,

{\tilde{n}}_{i} (t)

is the separated ship-radiated noise, and

i = 1, 2

. The decoder layer comprises one-dimensional transposed convolution. The same dimension filling step is added to the transposed convolution processing to obtain a separated ship-radiated noise signal dimension that matches the input mixed ship-radiated noise signal dimension.

2.4. Separation Layer Containing Parallel Dilated Convolution and Group Convolution

The separation layer plays a crucial role in constructing the separation mask matrix

m_{i}

for this network. It extracts deep features of ship-radiated noise to estimate the separation mask matrix through convolution operations. To obtain more detailed information in the time-domain during the convolution process, more dilated convolutions have been used to create a separation layer. Dilated convolution [36,37,38] is a technique that expands the size of the convolution kernel by setting an expansion factor. In order to improve the processing accuracy of ship-radiated noise with low-frequency characteristics, we need to extract the envelope information using the convolution operation. The utilization of dilated convolution expands the convolution range during the convolution process and increases the length of the sampled frame in the time-domain. Furthermore, this effect can be enhanced by placing the dilated convolution module in parallel. Consequently, we introduced parallel dilated convolutions [39,40] in the basic convolution unit.

Parallel dilated convolution can increase the length of the convolution kernel, enabling longer convolution times to adapt to low-frequency situations and maintain the extraction of detailed information. However, this can lead to an increase in the number of convolutions, resulting in a larger network size. To address this issue, we introduced group convolution [41,42,43] to divide the signal into predetermined group values in the channel dimension, thereby maintaining a smaller network size. Each group was convolved with an equal number of channels, reducing the number of convolution kernel parameters and ensuring optimal network performance while minimizing the required parameters. Figure 3 illustrates the proposed separation layer structure.

The separation layer normalizes the feature matrix

A

, which is output by the encoder layer. Then a bottleneck layer is added to change the number of channels for

A

. This is achieved through a subsequent

1 \times 1

convolution with

K_{1}

convolution kernels of size 1.

Afterwards, for feature extraction, the data are processed through

B

convolutional blocks and added up together, where each block is composed of

u

convolutional units. Then it is passed to non-linear activation, followed by

1 \times 1

convolution with

K_{2}

kernels of size 1 and another non-linear activation. The resulting separation mask matrix

m_{i}

is obtained. In this module, the ReLU is utilized as the non-linear activation function.

In the separation layer, in order to enhance the precision of feature extraction and minimize the requisite parameters of the network, we propose a parallel dilated group convolutional unit, as illustrated in Figure 4. In order to regularize the dimensions and ranges, the input data are passed to a

1 \times 1

convolution, followed by a non-linear activation function and global normalization, with the output denoted by

V \in R^{K_{a} \times L_{f}}

. In this instance, the

1 \times 1

convolution has a convolution kernel size of 1 and the number of convolution kernels is

K_{a}

. Then,

V

is divided into two parts,

V_{1} \in R^{K_{a} / 2 \times L_{f}}

and

V_{2} \in R^{K_{a} / 2 \times L_{f}}

, based on the number of channels

K_{a}

, to enter the next dilated convolution operation in parallel. Due to the large parameter size of deep learning networks, overfitting often occurs during the training process, which also reduces the training speed. In order to reduce the network parameters without affecting its feature extraction performance, we use group convolution in two parallel dilated convolution operations. In each dilated convolution operation, the input data are divided into

U

groups to respectively obtain

V_{11}, V_{12}, \dots, V_{1 U} \in R^{\frac{K_{a} \times L_{f}}{2 \times U}}

and

V_{21}, V_{22}, \dots, V_{2 U} \in R^{\frac{K_{a} \times L_{f}}{2 \times U}}

, and then we perform dilated convolution on each group:

Y_{i j} = D (V_{i j}), i = 1, 2; j = 1, 2, \dots, U .

(7)

where

D (\cdot)

denotes the dilated convolution.

Here, dilated convolution can be seen as a generalization operation of one-dimensional convolution, which involves inserting a 0 element between the kernel elements of the convolution kernel to “expand” the kernel. This can expand the range contained in a single convolution, thereby containing more waveform global information. If the expansion rate param

d

is used to expand the range of the kernel, then

d - 1

0 elements are inserted between the kernel elements. We perform two parallel dilated convolution operations. The first dilation factor maintains a fixed value of

d_{1} = 2^{0}

, that is, no 0 elements are inserted. The dilation factor of the second path is in each dilated group convolution block

d_{2} = 2^{i}, i = 1, 2, \dots, u,

according to the ordinal number of

u

convolution units. The result of completing the dilation convolution for each group is merged and the feature matrix consistent with the original data can be obtained by splicing the output of the two paths of parallel dilation convolution.

The output of each convolutional unit includes two parts: residual path output and skip connection path output. The residual path output enters the next convolutional unit. After the feature extraction of all convolutional units is completed, the skip connection path outputs of all units are added together as the total output of the convolutional module to improve the richness of the extracted features. Based on this, a separation mask matrix can be obtained.

2.5. Training Objective

The training process uses scale-invariant signal-to-noise ratio (SI-SNR) as the training objective [15], which is defined as follows:

\{\begin{array}{l} n_{t a r g e t} = \frac{< \tilde{n}, n < n}{{‖n‖}^{2}} \\ e = \tilde{n} - n_{t a r g e t} \\ SI-SNR = 10 {l o g}_{10} \frac{{‖n_{t a r g e t}‖}^{2}}{{‖e‖}^{2}} \end{array}

(8)

where

n

represents the ground truth,

\tilde{n}

represents the separated ship-radiated noise signal, and

{‖n‖}^{2} = 〈n, n〉

denotes the signal power.

3. Evaluation and Validation

3.1. Dataset and Parameter Settings

In our evaluation, the dataset derives from the DeepShip dataset [44], which consists of recordings of underwater ship-radiated noise captured in a real oceanic environment. The dataset contains underwater ship-radiated noise signals from four different kinds of ships: tankers, tugs, passenger ships, and cargo, with a total length of 33.34 hours.

We divided all ship-radiated noise signals into pieces with a duration of 3 s. Then, two random pieces of different kinds of signals are selected and mixed into a new signal with a signal-to-noise ratio (SNR) in the range of −5~5 dB. A dataset containing 5000 mixed signals is constructed. In the validation, the dataset was divided into three parts with a ratio of 7:2:1, namely the training set, the validation set, and the test set. The training set was used for model training and fitting, determining parameters such as the weights and biases of the model. The validation set was used to adjust the hyperparameters of the model and conduct a preliminary assessment of its performance. The test set was used for model evaluation. The ship-radiated noise in this dataset was acquired in an oceanic region, and there is a certain distance between the acquisition device and the ship. Consequently, the collected signal itself contains channel information. The use of the additive method to mix two signals already takes into account channel factors, which can demonstrate the complexity of different signals in the separation problem. Furthermore, the additive approach is also commonly employed in the processing of mixed signals, which helps to determine whether the features of the separated signals can be effectively extracted. Therefore, in this experiment, an additive noise method was used to construct a mixed ship-radiated noise signal.

In the network training, the number of epochs is 100, the initial learning rate is 0.001, and the network is trained using the Adam optimizer. During the network training, if the loss value of the validation set does not improve within three epochs, the learning rate will automatically decrease by half.

Figure 5 provides a detailed view of the model’s performance at each epoch. The tendency of the lines indicates an optimal fit. It can be seen that the proposed network converges quickly and maintains good convergence consistency on the training and validation sets, indicating that the network effectively avoids overfitting or underfitting problems. The model parameter configuration is shown in Table 1.

3.2. Evaluation Metrics and Comparison Models

The performance of the network is evaluated using various metrics. We use signal-to-noise ratio (SNR) as one of the objective measures:

S N R = 10 {l o g}_{10} \frac{{‖n‖}^{2}}{{‖n - \tilde{n}‖}^{2}}

(9)

where

n

presents ground truth,

\tilde{n}

represents the separated ship-radiated noise signal. We also report the scale-invariant signal-to-noise ratio improvement (SI-SNRi) and signal-to-noise ratio improvement (SNRi) [45,46] to evaluate the separation accuracy. The segmental SNR (Seg-SNR) is used to evaluate the frame-level separation accuracy. To obtain SegSNR, the separated ship-radiated noise signal needs to be divided into several frames. For each frame of signal, its SNR is calculated separately, and the average SNR of all frames is calculated.

S e g S N R = \frac{1}{f_{l}} \sum_{i = 0}^{f_{l}} {S N R}_{f r a m e} (i)

(10)

where

f_{l}

denotes the number of frames and

{S N R}_{f r a m e}

denotes the SNR value of each frame, which can be expressed as

{S N R}_{f r a m e} (i) = 10 {l o g}_{10} (\frac{\sum_{j = 0}^{M_{s} - 1} n^{2} (i M_{s} - j)}{\sum_{j = 0}^{M_{s} - 1} {[\tilde{n} (i M_{s} - j) - n (i M_{s} - j)]}^{2}})

(11)

where

M_{s}

represents the number of samples per frame.

With the same dataset, we used various comparison methods, including UNet [21] and Res-UNet [23], all of which are models proposed in recent years and have achieved satisfactory results. When each model is trained, the network parameters, such as the optimizer, the number of epochs, and the initial learning rate, are consistent. In addition, these networks have all converged.

3.3. Validation and Performance Analysis

To determine the roles of parallel dilated convolution and group convolution, we conducted ablation experiments using the four objective evaluation metrics mentioned in Section 3.2. The results of these experiments are shown in Table 2. In Table 2, the results of separation are enhanced by the addition of parallel dilated convolution. The impact of group convolution on the performance of the network is minimal. However, the utilization of group convolution has the effect of reducing the model size from 4.43 M to 3.67 M and GFLOPs from 21.1 to 19.8. This demonstrates that group convolution is effective in reducing the size of the network and its computational complexity.

To evaluate the performance of the proposed method and other common methods, a comparison was conducted using the proposed method and the methods described in the previous section. Table 3 summarizes the performance of these models with respect to objective evaluation. All scores were derived from averages obtained using the 500 mixtures tested in the DeepShip dataset. The separation performance of different models for mixed signals was examined using SNR, SegSNR, SNRi, and SI-SNRi. The results indicate that the proposed method outperforms both Res-UNet and UNet in terms of SNR, obtaining higher scores. The SegSNR scores obtained by the proposed method were more than 1 dB higher than those of Res-UNet and UNet. The SNRi scores of Res-UNet and UNet were found to be similar, but the proposed method achieved higher scores. The SI-SNRi scores of Res-UNet and UNet are significantly lower than those obtained by the proposed method, further confirming its superiority.

Table 4 shows the model size and arithmetic complexity of each method. Res-UNet has the largest model size, followed by UNet, and the proposed method has the smallest model size. Similarly, the proposed method has the lowest computational complexity.

Figure 6 shows the time–frequency plots of the mixture, the ground truth, and the separated ship-radiated noise using different models obtained by short-time Fourier transform (STFT). The mixture is the combination of a tug and a passenger ship in the DeepShip dataset. In Figure 6g–i, the separated results of the passenger ship signal are obtained using different models, and all of them are not significantly different from the ground truth. However, in the separated results of the tug signal obtained from Res-UNet and UNet, some residual components of the passenger ship signal remain. In Figure 6c, the separated result of the tug signal obtained from Res-UNet is stronger in the spectrum at approximately 2–5 kHz than ground truth. In Figure 6d, the separated result obtained from UNet is stronger in the spectrum at approximately 2–4 kHz than ground truth. In addition, there is a lack of spectral energy in higher frequency areas in the separated results of the tug signal obtained from Res-UNet at approximately 8–14 kHz. In contrast, the separated result obtained from the proposed method is not significantly different from the ground truth in Figure 6e. This confirms that the proposed method performs well in separating the mixture containing two sources of ship-radiated noise when both separation results are expected to achieve a satisfactory outcome.

Figure 7 shows the generated masks in the separation process. The generated masks are weight matrices with the same dimensions and shapes as the time-domain features output by the encoder. The basis functions show the diversity of frequency and phase tuning. The encoder extracts time-domain features from the original time-domain signal. These features are then multiplied by the generated masks, which are used to select and emphasize the features that are relevant to a particular source. This process enables the separation of different sources. Consequently, the physical meaning of masks can be understood as a weighting of time-domain features.

Figure 8 displays the scatter plots of SNR, SegSNR, SNRi, and SI-SNRi for different methods, where color indicates density. Generally speaking, SNR and SegSNR scores are higher when the mixture SNR is higher. The SNRi scores are most concentrated at a mixture SNR of 0 dB. In Figure 8i, there is minimal scatter at scores below 0 dB for SNR, while in Figure 8a,e, there is some scatter. Similarly, in Figure 8j, there is minimal scatter at scores below 0 dB for SegSNR compared to Figure 8b,f. In Figure 8k, there is minimal scatter at scores below 10 dB for SNRi compared to Figure 8c,g. This indicates that the proposed method does not exhibit poor separation performance for certain samples. As for the SI-SNRi scores, the scatter in Figure 8l is more dispersed than the others, but most of these dispersed scatters have higher SI-SNRi scores, which leads to an increase in the average SI-SNRi score compared to the other methods. In addition, this indicates that the proposed method has better adaptability to certain samples. Compared to the two other models, the proposed method not only improves the average evaluation score but also reduces outlier cases, for example, test samples with SNR, SegSNR, and SNRi scores far from the dense central region.

Figure 9 displays the quantile–quantile plots of SNR, SegSNR, SNRi, and SI-SNRi for different methods. In Figure 9i, in the range of −4 to 4 on mixture SNR, the SNR score distribution tends to be a straight line, indicating that the scores in this range conform well to the theoretical distribution being compared to the SNR scores. In the range of mixture SNR from −5 to −4, the SNR score distribution is lower than this straight line, which means that the sample points have lower SNR scores than the theoretical distribution. However, in Figure 9a,e, the distribution is lower than 0 dB and more dispersed compared to Figure 9i. This indicates that the proposed method has better performance than other methods at low mixture SNR. The comparison of Figure 9b,f,j, Figure 9c,g,k, and Figure 9d,h,l yields a similar view. In addition, in Figure 9l, the SI-SNRi sores have a higher distribution in the range of 4 to 5 on mixture SNR, indicating the proposed method has better performance at high mixture SNR.

Overall, the results indicate that the reference networks selected in this paper have good enhancement performance, but there is a gap in enhancement accuracy compared to the methods proposed here. This difference may be related to the datasets used by the networks, as demonstrated by both Res-UNet and UNet, which employed the MUSDB18 dataset in their earliest published literature. This dataset contains separate vocals, accompaniments, bass, drums, and other instruments. This paper utilizes the DeepShip dataset of ship-radiated noise, as outlined in Section 3.1, for all models. This may cause some changes in the performance of the models.

4. Conclusions

To address the issue of separating ship-radiated noise in the underwater environment, we introduce parallel dilated convolution and group convolution to the classical Conv-TasNet. The parallel dilated convolution allows for longer time signal processing, enhances the processing of low-frequency noise, and reduces the loss of local information by using two-way dilated convolution operations with different expansion factors. Furthermore, group convolution addresses the disadvantage of network-scale expansion brought about by parallel dilated convolution.

The primary focus of our research has been on the mixing of two types of ship radiation noise. In the event that a third signal is present during the separation process, such as white noise, the mixture can first be processed using a denoising algorithm, after which the mixture can be separated. In the case of a third signal originating from another ship, a new network model is required that can construct three separate masks and utilize the same basic network framework.

In summary, the proposed method maintains a high separation effect while guaranteeing a smaller network size. It has the potential to be an alternative method for engineering applications in the underwater environment.

Author Contributions

Q.H.: writing—original draft, methodology, and validation; H.W.: software; X.Z.: writing—review and editing; A.J.: visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China: 12074317.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Z.; Wang, R.; Yin, F.; Wang, B.; Peng, W. Speech dereverberation method based on spectral subtraction and spectral line enhancement. Appl. Acoust. 2016, 112, 201–210. [Google Scholar] [CrossRef]
Xiao, K.; Wang, S.; Wan, M.; Wu, L. Radiated noise suppression for electrolarynx speech based on multiband time-domain amplitude modulation. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1585–1593. [Google Scholar] [CrossRef]
Chen, J.; Benesty, J.; Huang, Y.; Doclo, S. New insights into the noise reduction Wiener filter. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1218–1234. [Google Scholar] [CrossRef]
Erçelebi, E. Speech enhancement based on the discrete Gabor transform and multi-notch adaptive digital filters. Appl. Acoust. 2004, 65, 739–762. [Google Scholar] [CrossRef]
Sayoud, A.; Djendi, M.; Medahi, S.; Guessoum, A. A dual fast NLMS adaptive filtering algorithm for blind speech quality enhancement. Appl. Acoust. 2018, 135, 101–110. [Google Scholar] [CrossRef]
Surendran, S.; Kumar, T.K. Oblique Projection and Cepstral Subtraction in Signal Subspace Speech Enhancement for Colored Noise Reduction. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 2328–2340. [Google Scholar] [CrossRef]
Fattorini, M.; Brandini, C. Observation strategies based on singular value decomposition for ocean analysis and forecast. Water 2020, 12, 3445. [Google Scholar] [CrossRef]
Zhao, S.X.; Ma, L.S.; Xu, L.Y.; Liu, M.N.; Chen, X.L. A Study of Fault Signal Noise Reduction Based on Improved CEEMDAN-SVD. Appl. Sci. 2023, 13, 10713. [Google Scholar] [CrossRef]
Zhao, X.Z.; Nie, Z.G.; Ye, B.Y.; Chen, T.J. Number law of effective singular values of signal and its application to feature extraction. J. Vibr. Eng 2016, 29, 532–541. [Google Scholar] [CrossRef]
Zou, H.; Xue, L. A selective overview of sparse principal component analysis. Proc. IEEE 2018, 106, 1311–1320. [Google Scholar] [CrossRef]
Hao, J.; Lee, I.; Lee, T.W.; Sejnowski, T.J. Independent Vector Analysis for Source Separation Using a Mixture of Gaussians Prior. Neural Comput. 2010, 22, 1646–1673. [Google Scholar] [CrossRef]
Ikeshita, R.; Nakatani, T. Independent Vector Extraction for Fast Joint Blind Source Separation and Dereverberation. IEEE Signal Process. Lett. 2021, 28, 972–976. [Google Scholar] [CrossRef]
Gaeta, M.; Briolle, F.; Esparcieux, P. Blind separation of sources applied to convolutive mixtures in shallow water. In Proceedings of the IEEE Signal Processing Workshop on Higher-Order Statistics, Banff, AB, Canada, 21–23 July 1997; pp. 340–343. [Google Scholar] [CrossRef]
Kirsteins, I.P. Blind separation of signal and multipath interference for synthetic aperture sonar. In Proceedings of the Oceans 2003. Celebrating the Past… Teaming Toward the Future (IEEE Cat. No. 03CH37492), San Diego, CA, USA, 22–26 September 2003; pp. 2641–2648. [Google Scholar] [CrossRef]
Mansour, A.; Benchekroun, N.; Gervaise, C. Blind Separation of Underwater Acoustic Signals. In Proceedings of the International Conference on Independent Component Analysis and Blind Signal Separation: 6th International Conference, Charleston, SC, USA, 5–8 March 2006; pp. 181–188. [Google Scholar] [CrossRef]
Kamal, S.; Supriya, M.H.; Pillai, P.R.S. Blind source separation of nonlinearly mixed ocean acoustic signals using Slow Feature Analysis. In Proceedings of the OCEANS 2011 IEEE-Spain, Santander, Spain, 6–9 June 2011; pp. 1–7. [Google Scholar] [CrossRef]
Zhang, X.; Fan, W.; Xia, Z.; Kang, C. Tow ship interference cancelling based on blind source separation algorithm. In Proceedings of the International Conference on Awareness Science & Technology, Dalian, China, 27–30 September 2011; pp. 465–468. [Google Scholar] [CrossRef]
Tu, S.; Chen, H. Blind Source Separation of Underwater Acoustic Signal by Use of Negentropy-Based Fast ICA Algorithm. In Proceedings of the IEEE International Conference on Computational Intelligence and Communication Technology, Ghaziabad, India, 13–14 February 2015; pp. 608–611. [Google Scholar] [CrossRef]
Li, G.; Dou, M.; Zhang, L.; Wang, H. Underwater Near Field Sources Separation and Tracking with Hydrophone Array Based on Spatial Filter. In Proceedings of the Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 5274–5278. [Google Scholar] [CrossRef]
Park, S.R.; Lee, J.W. A fully convolutional neural network for speech enhancement. In Proceedings of the International Speech Communication Association (INTERSPEECH 2017), Stockholm, Sweden, 20–24 August 2017; pp. 1465–1468. [Google Scholar] [CrossRef]
Jansson, A.; Humphrey, E.; Montecchio, N.; Bittner, R.; Kumar, A.; Weyde, T. Singing voice separation with deep u-net convolutional networks. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR 2017), Suzhou, China, 23–27 October 2017; pp. 745–751. [Google Scholar]
Choi, H.S.; Kim, J.H.; Huh, J.; Kim, A.; Ha, J.W.; Lee, K. Phase-Aware Speech Enhancement with Deep Complex U-Net. In Proceedings of the International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar] [CrossRef]
Kong, Q.; Cao, Y.; Liu, H.; Choi, K. Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation. In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR 2021), Virtual, 7–12 November 2021; pp. 342–349. [Google Scholar] [CrossRef]
Isik, Y.Z.; Roux, J.L.; Chen, Z.; Watanabe, S.; Hershey, J.R. Single-Channel Multi-Speaker Separation Using Deep Clustering. In Proceedings of the International Speech Communication Association (INTERSPEECH 2016), San Francisco, CA, USA, 8–16 September 2016; pp. 545–549. [Google Scholar] [CrossRef]
Chen, J.; Wang, D. Long short-term memory for speaker generalization in supervised speech separation. J. Acoust. Soc. Am. 2017, 141, 4705–4714. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Liu, Y.Z.; Wang, D.L. Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 2092–2102. [Google Scholar] [CrossRef] [PubMed]
Šarić, Z.; Subotić, M.; Bilibajkić, R.; Barjaktarović, M.; Stojanović, J. Supervised speech separation combined with adaptive beamforming. Comput. Speech Lang. 2022, 76, 101419. [Google Scholar] [CrossRef]
Tan, K.; Chen, J.; Wang, D. Gated Residual Networks with Dilated Convolutions for Monaural Speech Enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 189–198. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.; Mesgarani, N. TaSNet: Time-domain audio separation network for real-time, single-channel speech separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, AB, Canada, 15–20 April 2018; pp. 696–700. [Google Scholar] [CrossRef]
Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef]
Urick, R.J. Principles of Underwater Sound, 3rd ed.; McGraw-Hill Book Company: New York, NY, USA, 1983. [Google Scholar]
Purushothaman, A.; Sreeram, A.; Kumar, R.; Ganapathy, S. Dereverberation of autoregressive envelopes for far-field speech recognition. Comput. Speech Lang. 2022, 72, 101277. [Google Scholar] [CrossRef]
Lei, X.; Pan, H.; Huang, X. A Dilated CNN Model for Image Classification. IEEE Access 2019, 7, 124087–124095. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, X.; Jung, C. DCSR: Dilated Convolutions for Single Image Super-Resolution. IEEE Trans. Image Process. 2019, 28, 1625–1635. [Google Scholar] [CrossRef] [PubMed]
Ren, Z.; Kong, Q.; Han, J.; Plumbley, M.D.; Schuller, B.W. Attention-Based Atrous Convolution Neural Networks: Visualsation and Understanding Perspectives of Acoustic Scenes. In Proceedings of the 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Ni, J.; Gao, J.; Li, J.; Yang, H.; Hao, Z.; Han, Z. E-AlexNet: Quality evaluation of strawberry based on machine learning. J. Food Meas. Charact. 2021, 15, 4530–4541. [Google Scholar] [CrossRef]
Lee, Y.; Park, J.; Lee, C.O. Two-level group convolution. Neural Netw. 2022, 154, 323–332. [Google Scholar] [CrossRef] [PubMed]
Mirchandani, G.; Foote, R.; Rockmore, D.N.; Healy, D.; Olson, T. A wreath product group approach to signal and image processing. II. Convolution, correlation, and applications. IEEE Trans. Signal Process. 2000, 48, 749–767. [Google Scholar] [CrossRef]
Irfan, M.; Jiangbin, Z.; Ali, S.; Iqbal, M.; Masood, Z.; Hamid, U. DeepShip: An underwater acoustic benchmark dataset and a separable convolution based autoencoder for classification. Expert Syst. Appl. 2021, 183, 115270. [Google Scholar] [CrossRef]
Vincent, E.; Gribonval, R.; F’evotte, C. Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1462–1469. [Google Scholar] [CrossRef]
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech. J. Acoust. Soc. Am. 2011, 130, 3013–3027. [Google Scholar] [CrossRef]

Figure 1. The framework of the network.

Figure 2. The process of encoding and decoding: (a) the process of encoding; (b) the process of decoding.

Figure 3. The design of the separation layer.

Figure 4. The design of a parallel dilated group convolutional unit.

Figure 5. The performance of the model in each epoch.

Figure 6. The time–frequency plots of the mixture, the ground truth, and the separated results: (a) mixture; (b–e) the ground truth of the tug signal and the separated results obtained from Res-UNet, UNet, and the proposed method, respectively; (f–i) the ground truth of the passenger ship signal and the separated results obtained from Res-UNet, UNet, and the proposed method, respectively.

Figure 7. The generated masks in the separation process: (a) the generated mask used for separating the tug signal; (b) the generated mask used for separating the passenger ship signal.

Figure 8. Scatter-plots of SNR, SegSNR, SNRi, and SI-SNRi for different methods in the DeepShip dataset: (a–d) scatter-plots of SNR, SegSNR, SNRi, and SI-SNRi for Res-UNet in the DeepShip dataset, respectively; (e–h) scatter-plots of SNR, SegSNR, SNRi, and SI-SNRi for UNet in the DeepShip dataset, respectively; (i–l) scatter-plots of SNR, SegSNR, SNRi, and SI-SNRi for the proposed method in the DeepShip dataset, respectively.

Figure 9. Quantile–quantile plots of SNR, SegSNR, SNRi, and SI-SNRi for different methods in the DeepShip dataset, the red dotted line represents the direct correspondence between the two distributions. (a–d) quantile–quantile plots of SNR, SegSNR, SNRi, and SI-SNRi for Res-UNet in the DeepShip dataset, respectively; (e–h) quantile–quantile of SNR, SegSNR, SNRi, and SI-SNRi for UNet in the DeepShip dataset, respectively; (i–l) quantile–quantile plots of SNR, SegSNR, SNRi, and SI-SNRi for the proposed method in the DeepShip dataset, respectively.

Table 1. The model parameter configuration.

Parameter	Parameter Description	Value
$K$	Number of convolutional kernels in the encoder layer	512
$L_{k}$	Convolutional kernel size in the encoder layer	16
$K_{1}$	Number of convolutional kernels in the separation layer	128
$K_{2}$	Number of convolutional kernels in the separation layer (generating masks)	512
$B$	Number of convolutional blocks in the separation layer	3
$u$	Number of convolutional units in a convolutional block	8
$K_{a}$	Number of input $1 \times 1$ convolutional kernels in a convolutional unit	512
$U$	Number of groups in each dilated convolution of the convolutional unit	8
$P$	Convolution kernel size for group convolution in a convolutional unit (dilated convolution)	3
$K_{e}$	Number of residual-connection $1 \times 1$ convolutional kernels in a convolutional unit	128
$K_{s}$	Number of skip-connection $1 \times 1$ convolutional kernels in a convolutional unit	128

Table 2. Average SNR, SegSNR, SNRi, and SI-SNRi for Conv-TasNet, using parallel dilated convolution in Conv-TasNet, using group convolution in Conv-TasNet, and the proposed method in the DeepShip dataset.

Methods	SNR (dB)	SegSNR (dB)	SNRi (dB)	SI-SNRi (dB)
Conv-TasNet	16.6928	17.4112	16.9565	1.9447
Using parallel dilated convolution	16.8952	17.6931	17.2388	2.6442
Using group convolution	16.7148	17.5242	16.9841	1.9843
Proposed method	16.8609	17.7508	17.3023	2.5261

Table 3. Average SNR, SegSNR, SNRi, and SI-SNRi for different models in the DeepShip dataset.

Methods	SNR (dB)	SegSNR (dB)	SNRi (dB)	SI-SNRi (dB)
Res-UNet	15.3931	16.3755	15.5706	0.0788
UNet	15.6725	16.6569	15.8283	0.1509
Proposed method	16.8609	17.7508	17.3023	2.5261

Table 4. Model size and GFLOPs for different models.

Methods	Model Size (M)	GFLOPs
Res-UNet	103.0	28.5
UNet	33.4	33.7
Proposed method	3.67	19.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Q.; Wang, H.; Zeng, X.; Jin, A. Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network. J. Mar. Sci. Eng. 2024, 12, 885. https://doi.org/10.3390/jmse12060885

AMA Style

He Q, Wang H, Zeng X, Jin A. Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network. Journal of Marine Science and Engineering. 2024; 12(6):885. https://doi.org/10.3390/jmse12060885

Chicago/Turabian Style

He, Qunyi, Haitao Wang, Xiangyang Zeng, and Anqi Jin. 2024. "Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network" Journal of Marine Science and Engineering 12, no. 6: 885. https://doi.org/10.3390/jmse12060885

APA Style

He, Q., Wang, H., Zeng, X., & Jin, A. (2024). Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network. Journal of Marine Science and Engineering, 12(6), 885. https://doi.org/10.3390/jmse12060885

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship-Radiated Noise Separation in Underwater Acoustic Environments Using a Deep Time-Domain Network

Abstract

1. Introduction

2. Deep Time-Domain Ship-Radiated Noise Separation Network

2.1. Characteristics of Ship-Radiated Noise

2.2. Deep Time-Domain Ship-Radiated Noise Separation Network

2.3. Encoder Layer and Decoder Layer

2.4. Separation Layer Containing Parallel Dilated Convolution and Group Convolution

2.5. Training Objective

3. Evaluation and Validation

3.1. Dataset and Parameter Settings

3.2. Evaluation Metrics and Comparison Models

3.3. Validation and Performance Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI