1. Introduction
As a key technology to promote the intelligence of the underwater acoustic equipment system, underwater acoustic target-radiated noise recognition is one of the most important research directions of underwater acoustic signal processing [
1]. Some reasons behind this are that underwater environments are complex and changing, making it difficult to identify underwater objects [
2,
3]. Besides, although recent years have experienced an increasing number of water vessels, the labeled large scale of the data acquisition has been hindered a lot because of confidentiality and cost. Another drawback based on the data quality is that the background noise and the interference from other noise sources inevitably exist.
Generally, Underwater Acoustic Target Recognition (UATR) is performed by well-trained sonar men, which is inaccurate due to the long-time work and may be affected by weather conditions [
4]. Hence, developing a robust recognizing system to replace humans’ work of identifying ship-radiated noise is of great importance. From a technical perspective, efforts are consistently made to improve the classification accuracy in the aspects of feature extraction and classifier training [
5].
Much work attempts to extract hand-crafted features from ship-radiated noise and feed them into different kinds of classifiers. On one hand, as for the traditional machine learning feature extraction process, Support Vector Machines (SVM) [
6,
7] and Principal Component Analysis (PCA) [
8] methods are widely used. For example, Meng et al. proposes a method straightly using the wave structure with SVM [
6]. Wei et al. [
7] present an extraction method based on 1
1/2D spectrum and PCA. Features derived from Mel filters of Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel Spectrogram (LM) are two widely used features in Environment Sound Classification (ESC) tasks [
9,
10] with acceptable performance. Although such features originate from the speech or sound field, the effect of MFCC and its first-order differential MFCC or second-order MFCC features are proven for underwater acoustic target recognition [
8]. Besides, a considerable number of research works indicate that the fusion feature can give a more comprehensive representation of environment sounds [
11]. For the recognition of underwater acoustic targets, Meng et al. [
6] exploit the fusion feature of zero-crossing wavelength, peek-to-peek amplitude, and zero-crossing-wavelength difference. On the other hand, the design of the neural networks plays an important role in achieving a competitive performance together with the optimized feature extraction. For example, a time-delay neural network (TDNN) and convolutional neural network (CNN) are introduced for UATR in [
12]. Testolin et al. [
13] present an innovative method that allows to accurately detect and track underwater moving targets from the reflections of an active acoustic emitter. The system is based on a computationally and energy-efficient preprocessing stage carried out using a deep convolutional denoising autoencoder (CDA), whose output is then fed to a probabilistic tracking method based on the Viterbi algorithm. Testolin et al. [
14] also have proven that transfer learning can be a viable approach in these scenarios where tagged data is often lacking and evaluate the feasibility of the model based on Recurrent Neural Networks (RNN) for the scenarios requiring on-line processing of the reflection sequence. Shen et al. [
15] introduce the auditory inspired convolutional neural networks trained from raw underwater acoustic signal.
For ShipsEar [
16], the machine learning method using a GMM-based classifier with the standard expectation maximization (EM) algorithm for training purposes could be used as a baseline method, whose best classification rate is 75.4%. As the results obtained by typical methods based on machine learning are not very high, the methods based on deep learning models are worthy of researching. Li et al. [
5] introduce a feature optimization approach with Deep Neural Networks (DNN) and optimizing loss function and achieve an accuracy of 84%. Yang et al. [
17] propose a so-called competitive Deep Belief Nets (cDBN) for UATR. Luo et al. [
18] present a UATR method based on Restricted Boltzmann Machine (RBM), which achieves the accuracy of 93.17% on the dataset of ShipsEar. Ke et al. [
4] propose a novel recognition method of four steps including preprocessing, pretraining, finetuning, and recognition, which achieves the recognition accuracy of 93.28%.
In this paper, we propose the three-dimensional fusion features along with the data augment strategy of SpecAugment and an 18-layer Residual Network (ResNet18) containing the center loss function with the embedding layer to achieve good accuracy. The remaining parts of this paper are organized as follows.
Section 2 introduces the framework of the classification method for UATR.
Section 3 describes the suitable feature aggregate scheme of the three-dimensional fusion features and an 18-layer Residual Network (ResNet18).
Section 4 presents the experimental results of ShipsEar with the proposed method and other methods.
Section 4 gives the conclusion.
3. Experiments and Analysis
3.1. Dataset Description and Preparation
The detailed description of the dataset of the ship-radiated noise called ShipsEar (available at
http://atlanttic.uvigo.es/underwaternoise/), which contains a total of 91 records of 11 vessel types and one background noise class, is presented in the literature [
16]. During 2012 and 2013, the researchers recorded the sounds of many different classes of ships on the Spanish Atlantic coast. The recordings were made with autonomous acoustic digitalHyd SR-1 recorders, manufactured by MarSensing Lda (Faro, Portugal). This compact recorder includes a hydrophone with a nominal sensitivity of −193.5 dB re 1 V/1 μPa and a flat response in the 1 Hz to 28 kHz frequency range.
To keep consistent with other classification methods, the 11 vessel types are merged into four experiment classes and the background noise type is as one class, as shown in
Table 1.
Before preprocessing, the number of the recorded sound clips with a duration of 5 s is 1956 by truncating the original records. The original signals are recorded at different sample rates. By preprocessing, each sound clip is further separated into 41 frames with an overlap of zero.
3.2. Experimental Result
The proposed method is verified by a computer with four GPUs of Nvidia GeForce RTX 2080Ti and Core i7-6900K CPU. Furthermore, the deep learning framework of the proposed model is implemented using Keras 2.2.4 with TensorFlow 1.12.0 as a backend.
When training the model, the batch size and the maximum number of epochs (each epoch includes one training cycle on all training data) are set to be 128 and 200, respectively. To accelerate the training process, the early stopping strategy is that the training will be stopped if the validation loss is reduced by larger than 0.00005 in 20 successive epochs. Besides, the adaptable strategy of learning rate is adopted, where the initial value is 0.001, and the value will be 60% of the former value every 20 epochs. Using such a strategy, the practical number of epochs used is 88. As shown in
Figure 5a, the training loss and validation loss are rapidly reduced within about 20 epochs, and they are gradually decreased without overfitting due to the fact of the design of the adaptable learning rate. Meanwhile,
Figure 5b also shows that the accuracy is improved at a varied speed. Only small improvements are obtained after the turn and the best accuracy of the validation data is 0.946. We can observe that
Figure 5a, b describes the same process.
We evaluate our model of ResNet18 with the three-dimensional feature. Twenty percent of the data is used as the validation set and 10% of the data is used as the test set, while each network is trained with 70% of the dataset with the augmentation.
Table 2 depicts the detailed results of the classification system at the aspect of precision, recall, and f1-score. It is clear that all the accuracy of the recognition of each class is higher than 0.90 and the average precision or recall or f1-score is 0.943, where the support denotes the number of each test class. For the convenience of comparison, classifier performance is measured using the classification accuracy, defined as the average precision. The ability of the described classifier to identify different vessels is indicated by the fact that there is no confusion between background noise class E and four vessel classes A–D [
16]. The vessel classes with the best results are A (background noise) and B (fishing boats, trawlers, mussel boats, tugboats, and the dredger), with classification rates of 0.970 and 0.958, respectively. The poorest results are obtained for C (motorboats, pilot boats, and sailboats). Although the acoustic dataset contains high background noise in shallow water, the overall performance is still satisfactory.
Figure 6 shows the confusion matrix of the proposed method on the ShipsEar dataset, where classes 0–4 denote classes A–E, respectively. Values along the diagonal indicate the number of samples classified correctly for each specific class. It shows that class C is the hardest class for the proposed classifier, while all other classes are well separated.
It is worth noting that the parameters such as the number of filters, filter size, and the number of layers and the hyperparameters for training such as the batch size, the initial learning rate, and the patience in early stopping are optimal choices according to the train and validation process in the experiment. Selecting other parameters is also feasible, however, using different parameters could encounter the performance loss and using different hyperparameters does not have significant impact but exhibit slightly inferior performance.
3.3. Experimental Analysis
The effectiveness of three parts of the proposed approach is assessed, i.e., (1) the advantage of the feature extraction method, (2) the advantage of the ResNet18 model, and (3) the contribution of the embedding layer with the center loss function and softmax. Besides, the comparison of the performance of the described method and that of references is presented.
3.3.1. Experiment A: The Advantage of the Feature Extraction Method
This experiment is designed to verify that the three-dimensional features used in the ResNet18 model could yield better results compared with other feature extraction methods. As mentioned in [
17], MFCC features have been experimentally proved to be the best hand-crafted features for the recognition, we use MFCC features and its deltas(Δ) and deltas-deltas(Δ
2) for comparison. Besides, considering that LM is a simple but effective feature, we also analyze its performance. The results of different feature extraction methods are shown in
Table 3.
From the results of MFCC and MFCC + Δ + Δ2, we can see that deltas or deltas-deltas do not contribute in improving the average accuracy, even if the number of parameters increases due to the extra two dimensions. Using the LM feature extracted from the raw signal, we can achieve an average accuracy of 0.906. Therefore, we can see that such a feature is distinguishable with different classes similar to MFCC features. Compared with MFCC or LM, the recognition accuracy of LM + MFCC improves at 0.027 and 0.005, respectively. An important remark is that LM + MFCC + CTZZ achieves the highest recognition accuracy of 0.943, which surpasses 0.032 of the average accuracy of LM + MFCC, although the extraction of such a three-dimension feature increases the computation time.
3.3.2. Experiment B: The Advantage of the ResNet18 Model
This experiment is designed to compare the performance of the described classifier with other typical models. Different networks, named “CNN-1, CNN-2, LSTM, CRNN, and ResNet18,” are exploited for comparison.
As shown in
Table 4, CNN-1 and CNN-2 share the same network structure, while the features are MFCC and LM + MFCC + CCTZ, respectively. The first layer exploits 32 filters of the size of 3 × 3 with the activation function of the hyperbolic tangent function named “tanh,” followed by the 2 × 2 max-pooling processing. The second layer uses 64 filters of the size of 3 × 3 with the activation function of tanh, followed by max-pooling of 2 × 2. The last layer utilizes 128 filters of the size of 3 × 3 with the activation function of tanh, followed by max-pooling of 2 × 2 and batch normalization. Afterward, the fully connected dense layers consist of 1024, 128, and 5 nodes, respectively. The optimizer is Adam with a learning rate of 0.001 and a decay factor of
. The batch size and the number of epochs are set to be 128 and 50, respectively. With CNN-2 fed by the optimized feature of LM + MFCC + CCTZ, we can achieve the average accuracy of 0.906, which surpasses that of CNN-1 of 0.845.
The LSTM network with the LM feature only achieves an average accuracy of 0.852. The first two LSTM layers have 256 and 128 outputs with batch normalization, respectively. The dropout is set to be 0.2. Afterward, the fully connected part consists of 256 and 5 nodes, respectively. The optimizer is Adam.
As for the CRNN network, the first layer has 64 filters of the size of (5,5) and the regularization method is L2 with the lambda of 0.01. The activation function is ReLU. Batch Normalization is also used and the dropout rate is 0.25. The second and third layers are LSTM, which have 64 units and are regularized L2 with the lambda of 0.01. The dropout rate of recurrent is 0.5, the activation function is tanh, and the dropout rate is 0.25. The fourth and fifth layers are time distributed layers with 128 and 64 nodes. The activation functions are ReLU and the dropout is 0.25. The sixth layer is a dense layer with 5 nodes. The optimizer is Adam with a learning rate of 0.001 and a decay factor of . Such a method could achieve an average accuracy of 0.885.
From
Table 4, we could have a better insight into the performance from the detailed accuracy. The ResNet18 with the proposed feature extraction method has an advantage over other methods.
Besides, we also compare the proposed method with other methods. The baseline design is based on the study [
16], which shows that using the basic machine learning method, the accuracy is 0.754. Besides, the accuracy achieved by the ResNet18 model as well as that achieved by other state-of-the-art approaches of RBM + BP [
18] and RSSD [
4] described in the literature are presented in
Table 5. Our method achieves an accuracy of 0.943. These results indicate that our ResNet18 has achieved significant improvement in UATR with three-dimensional features as input. Given that the split of the dataset could be different for different methods, the comparison is not rigorous. However, to our knowledge, the best results are obtained by the proposed method for ShipsEar from the accuracy perspective.
3.3.3. Experiment C: The Contribution of the Embedding Layer with the Center Loss Function and Softmax
In this section, the effect of different loss functions is investigated. Although softmax is the frequently used cost function, the simple utilization of softmax is very easy to become overfitting. For the training process, it tends to increase the value of the outputs before being fed to softmax to reduce the cost. To penalize such behavior, another selectable loss function named “uniform loss” could be a choice that also tries to fit the uniform distribution and is defined as:
where the first term denotes the softmax loss and the second term denotes the uniform distribution. Note that
is a balance parameter within [0, 1] and is set to be 0.3 in all the classification experiments.
Table 6 summarizes the impacts of different loss functions on the mean accuracy. The results indicated that selecting a good loss function can give a better result for classification. Using the center loss has led to another 3.3% and 0.9% improvement in the average accuracy compared with softmax and uniform loss, respectively. It can be seen in
Table 6 that although the accuracy of class A by the model with the center loss is slightly worse than the others, the average performance is better. We can conclude that the performance has been improved by utilizing the embedding layer with the center loss function and softmax.