1. Introduction
There are two types of sonar systems: passive and active. The former is when a hydrophone is recording sound within the ocean, and the latter occurs when a pulse of sound, or a ping, has been sent out to a target of interest in an attempt to determine a target’s information. This research is specific to active sonar. Active sonar target recognition and classification has numerous maritime applications, such as harbor monitoring, autonomous underwater vehicle vision, and seabed characterization. However, classification suffers from feature uncertainties due to unpredictable or unknown environmental (salinity, temperature, sound speed profile, etc.) and target parameters (size, shape, orientation, etc.) [
1]. Different forms of obstructions, such as fish or bubbles, or oceanic noise may also be present within a sonar’s return path and can further entangle a received response [
2]. These effects combine and degrade the target-specific informative features used for discrimination.
Machine learning algorithms are commonly used to perform classification of sonar data [
3,
4,
5,
6,
7,
8,
9,
10,
11]. Many of these classification pipelines employ convolutional neural networks (CNNs). Williams demonstrated classification of sonar images using a 10 layer CNN [
3]. Wang et al. used weights found with a deep belief network and then replaced the randomly initialized CNN weights to perform classification of various sonar images [
12]. CNNs were used for feature extraction rather than classification by Zhu et al., who used AlexNet to extract sonar image features prior to classification using a support vector machine [
7]. Sonar image feature extraction through edge detection was additionally proposed by Wang et al., who created three different CNNs with skip connections and demonstrated its ability to find continuous edges [
13]. However, many of these networks are deep and rely on a large number of samples for training. A large amount of public domain experimental field data for training is challenging and costly to obtain [
9]. Many approaches, such as feature engineering or extraction [
10], transfer learning [
4], pretrained networks [
6,
7], synthetic data generation [
4,
5,
6], and employing many tiny classifiers [
11], have been used to mitigate this challenge. This work is a combination of synthetic data generation, two time-frequency representations, and the employment of moderately sized classifiers that have been optimized for various sonar target recognition tasks. The signals used throughout this work are simulated from known models, giving complete control over the ground truth and simulation options. This allows interpretation of the informative features used by the classifier for discrimination to be related back to the physical domain.
Signal segmentation is performed prior to the projection of simulated target backscattered responses onto two two-dimensional representations. The use of two-dimensional (2D) analysis in conjunction with machine learning techniques is common within the underwater community [
10,
14,
15,
16,
17,
18,
19]. An in-depth review of the feature extraction and classification methods is provided in [
20]. Choi et al. employed cross-spectral density matrices as inputs to a variety of classifiers trained to discriminate against submerged or surface ships. In most cases, CNNs provided increased classification, with a lowest binary misclassification rate of 0.92% [
15]. Power cepstrums were used as CNN inputs trained to perform detection and ranging of vessels in a variety of SNRs [
14]. Other reported research describes a novel chirp wavelet [
16] using a three-channel spectrogram CNN classfier [
17], or using binary features extracted from acoustic spectra [
10,
18]. An interesting approach was taken by Luo et al. They used multiple spectrograms at various resolutions as a three-channel input in conjunction with a ResNet-inspired network, achieving up to a 96.32% classification accuracy in ship noise classification [
19]. This research differs in the use of simulated signals with a known ground truth, a comparison of the short-time Fourier transform and the continuous wavelet transform representations and their impact on classification, and examination of the classifiers post training to explain the classifier choices.
Deep learning techniques are considered to be state of the art and provide increased classification when compared with traditional approaches [
21]. However, these networks suffer from a lack of explainability and interpretability of their results due to their complexity [
22]. This creates a lack of trust and transparency in a classifier and is sometimes referred to as a `black box’. When incorrect classification may result in harmful real-world outcomes (as is the case in sonar classification), there exists the need for explainable artificial intelligence (XAI) [
23]. Explainable artificial intelligence (XAI) is an emerging area of study aimed at increasing the interpretability of machine learning choices. XAI attempts to build transparency and increase trust in a classifier algorithm by making the classifier choices interpretable to humans [
24]. There are many different approaches to XAI, and the one employed in this work is gradient-weighted class activation mapping (Grad-CAM), which uses the gradients of test images sent through the trained network to determine which feature is the most important one [
25]. This technique can be used to find discriminative, class-specific features and has been used to explore trained networks in the medical field [
26,
27].
Spectrograms and scalograms are two-dimensional representations that describe how the frequency content of a signal changes over time. These representations have been chosen as they are two standard representations that are commonly used. The goal of this research is not to find the optimum time-frequency distributions for classification but rather to interpret the classifier choices and relate the decisions back to the physical domain. A spectrogram has a fixed window size that forces a constant time-frequency resolution, while scalograms have a scaling parameter that allows the contraction and dilation of the window. This provides a varied resolution. The fixed resolution results in spectrograms being an adequate representation of stationary signals and a poor representation of non-stationary signals. The variable resolution associated with scalograms allows them to be a good representation of non-stationary signals. A comparison between the two representations through classification accuracy is reported within this work. This work is a continuation and expansion of the comparison of spectrogram and scalogram representations of the simulated backscattered responses reported in [
28]. This research differs from the previous iteration by updating the signal segmentation, the inclusion of additional simulated targets, the usage of a convolutional neural network that has been optimized for classification, and the examination of the trained classifier to explain its choices and relate them to the physical domain.
A large amount of experimental sonar field data for training is challenging and costly to obtain, but advances in feature extraction and machine learning have been creatively mitigating this problem. However, classifiers lack interpretation to the physical domain. This work uses simulated data that have a known ground truth and a CNN for classification. The hyperparameters are tuned using Bayesian optimization. After training, the networks are examined to determine the important features used by the network for classification. This work is an examination and comparison of two common time-frequency representations, both preceding and following classification. The networks were trained to perform classification for a variety of tasks in order to determine any dependencies between the classification task and optimized hyperparameters. The post-classification examination is performed using an XAI technique to reveal the target-specific features that a CNN uses for discrimination. The key contribution of this work is in the examination of the networks trained on common time-frequency representations and the explanation of the network choices in the physical domain. This post hoc analysis allows for informed decisions to be made for future classification pipelines to intuitively bias networks by forcing them to prioritize influential features. The analysis can additionally be used to examine the failure modes of classifiers and create mitigation strategies.
3. Results
The overall classification results for the material type, geometry, and interior fill are reported for the different SNRs. Next, the interesting confusion matrices are examined. Lastly, the trained classifiers are examined, using Grad-CAM to explain the network choices and relate the informative features to the physical domain.
3.1. Overall Accuracy
The overall accuracy and standard deviation for the CNNs trained on the simulated data to classify target geometry are in
Table 9, while those for the interior material are shown in
Table 10, and those for the material type are in
Table 11. The reported results were generated by training a network using the hyperparameters listed in
Table 4,
Table 5 and
Table 6 on the training and validation data used for the cross-validation Bayesian optimization. Testing was performed using the disjointed 20% data subset. The reported results are from across 10 random seeds to determine a statistical description of the network.
Typically, a higher SNR results in a higher average classification accuracy and lower standard deviation. The scalograms tend to have a higher average classification accuracy. However, there was no discernible trend in the corresponding standard deviations. Both trends were expected, as the increased SNR provided a cleaner representation for the network to classify, and the scalogram representation provided increased localization of the robust features due to the contraction and dilation of the wavelets. There was a negligible difference between the two SNR classification accuracies, accounting for the standard deviation of the random seeds. This demonstrates the CNNs’ ability to perform noise suppression across a difference of 15 dB. In future work, this can be taken to the extreme, and analysis across different SNR levels can be used to determine the failure mode of the classifier. Given these results, and by leveraging the knowledge learned in the hyperparameter optimization, while scalograms provide negligible increased classification accuracy compared with their spectrogram counterpart, the scalogram networks typically have fewer layers and filters, which decreases the number of learned parameters, the complexity of the network, and the network training time.
3.2. Confusion Matrices
The confusion matrices for the highest average accuracies reported
Table 9,
Table 10 and
Table 11 are shown in
Figure 5,
Figure 6 and
Figure 7. The confusion matrices list the classification results across the 10 random seeds. The units of the confusion matrices have been row-normalized to easily determine a percentage per class of correct or incorrect classification. For example, in
Figure 5, 3.3% of the shells were misclassified as solid spheres.
The highest classification accuracy occurred when the network was instructed to train based off of the target geometry (solid or shell). The lowest accuracy occurred when the network was instructed to discriminate between the interior materials (air, oil, or solid). These trends were a result of the simulated shells having similar acoustic spectra, regardless of the internal fluid, when the shells were thick. We recall that the radii of the shells were randomly split across the training and testing dataset, as described in
Section 2.4. These classification results represent an average across the randomly separated radii. However, the classification result may still be explained through examination of the different feature representations. An example of this similar response can be seen in
Figure 8, where the spectrograms were generated with a thick shell (
) made from tungsten carbide.
Figure 8a was generated by using air as the internal fluid, while
Figure 8b used octane oil. There is little visual difference between the two spectra, which in turn confused the classifier. When the shells had a reduced thickness, as seen when comparing the images in
Figure 3a,b, the classifier was able to distinguish between the interior fluid. The thickness where the visually similar acoustic spectra began to differ occurred sooner in the scalogram representation. This gave the network more variability in the spectra providing the increased accuracy when the network was trained on the scalograms. A classifier trained to discriminate between a solid and shell will automatically group the air- and oil octane-filled shells in a class, resulting in the increased classification accuracy seen in
Table 9.
3.3. Grad-CAM
Grad-CAM was used to determine what the CNN was selecting as the most influential features for each classification task. The Grad-CAM results are presented as heat maps that show the spatial locations of large gradients. The axes are in units of pixels and representative of the time and frequency axes. The color bar was normalized between 0 and 1. The heat maps are accompanied by the corresponding input image. Note that the input images are not in a decibel scale, as the networks were trained using normalized linear units. The advantage to using Grad-CAM is the location of the class-specific influence on the spectra can be found. These are the locations of the largest gradients. This can provide insight on feature extraction algorithms (i.e., intuitively bias the classifier by forcing it to focus on informative features) and aid in classifier debugging. Examination of the Grad-CAM heat maps on the simulated data allows relations to be drawn between the important features and physical scattering mechanisms.
Grad-CAM was performed on the CNNs for the reported confusion matrices. The resulting heat maps were visually similar to the input images for the classification of the interior material and target geometry. This was due to the initial convolution layers detecting semantically meaningful objects, and both networks were one convolutional layer deep. An example of these heat maps and the corresponding test images are shown in
Figure 9. The heat maps depicted are for the correctly classified air-filled shell (
Figure 9a), octane oil-filled shell (
Figure 9b), and an air-filled shell that was incorrectly classified as an octane oil-filled shell (
Figure 9c). The corresponding input test images are depicted in
Figure 9d–f, respectively. The corresponding classification scores are listed in the title of the Grad-CAM heat maps.
The CNN depends on the first feature component, typically due to specular or geometric scattering, when choosing the air class and higher-order, more complex resonances when choosing the octane oil-filled shell class. This is evident in the comparison of the heat maps in
Figure 9a,b. Insight into the CNN model’s failure in misclassification can be gained through examining
Figure 9c.
Figure 9d–f depict the corresponding input test image. This test image was incorrectly classified as an octane oil-filled shell and had a classification score of 0.560, which was split between the two classes. The CNN found a similar oil structure within the input image but also relied on features external to the resonances, as is evident from the highlighted features on the exterior of the image. To mitigate this failure mode of the CNN, a segmentation processing step could be included in the classification pipeline. This would likely increase the classification accuracy and decrease the training time, since the network mainly relies on semantic clues within the image.
Figure 10 depicts the heat map for the CNN trained on spectrograms to recognize the target geometry. The input test image was the same as the octane oil input image in
Figure 9e. A comparison between
Figure 9b and
Figure 10 provides insight into the features that are important to different classification tasks. The classifier used to discriminate between interior materials depends on higher-order, more complex resonances (Rayleigh or Lamb surface waves), while the classifier trained to discriminate between the target geometries relies on the shape of the specular scattering and first resonance. This can be leveraged when designing a classification pipeline, as the classification task must be taken into account. If the target geometry is being classified, then the feature representation can focus on the start of the signal and the specular reflection rather than its entirety. This provides an automatic dimensional reduction in the signal truncation, resulting in decreased classifier complexity and training time.
To determine how the varying random seed impacted the classification accuracy, Grad-CAM was performed on the CNNs trained on the scalograms with a 20 dB SNR to classify the material type. The resulting heat maps for the first five random seeds and the test image are shown in
Figure 11.
Figure 11a–e depicts the Grad-CAM heat maps for various random seeds, and
Figure 11f depicts the corresponding test image. The remaining five seeds were omitted due to space constraints. We recall that this network consisted of two convolutional layers with eight feature filters, meaning the network still relied on semantically meaningful objects. In all cases, the CNN depended on the tungsten carbide specular reflection for discrimination. The CNN was additionally using the resonances caused by Rayleigh or Lamb surface waves, but they were not as influential. In some cases, the CNN relied on additional low-frequency features (
Figure 11a,c), while others depended on localized resonances (
Figure 11b,d,e). An additional processing step could be included in the classification pipeline to highlight the high-frequency resonances and force the CNN to depend on these features, which are known from this analysis to be consistent across random seeds and impact classification.
Throughout examination of the Grad-CAM results, in order to highlight features that were determined to be relevant, various forms of augmentation could be used to increase the dataset and robustness of the network. Additionally, the insight gained from the heat maps allow recommendations to be made when designing different classifiers. For example, if the target geometry is to be classified, then the majority of the information relevant to the network is within the specular scattering, as seen in
Figure 10. This places less relevancy on the end of the return, and the signal detector can be adjusted to focus on the initial return. This would automatically decrease the input dimensionality and focus the network beforehand on the relevant features, thereby intuitively positively biasing the results and decreasing the training time.
4. Conclusions
Simulated backscattered responses of various materials, shapes, and sizes were generated prior to projection onto two time-frequency representations: spectrograms generated using the short time Fourier transform and scalograms generated using the continuous wavelet transform. Multiple convolutional neural networks were trained to classify the material type, target geometry, and interior material of the target. Bayesian optimization was used to determine the number of layers, number of feature maps per layer, and the initial learning rate. This resulted in classifiers that were optimized for the specified classification task. The trained networks were examined by using an explainable artificial intelligence technique—gradient-weighted class activation mapping (Grad-CAM)—to determine the post-training features used for classification. The Grad-CAM results were depicted using heat maps, representing the spatial locations of large positive gradients.
The scalogram representation provided a negligible increase in the average classification accuracy over the spectrograms. The networks trained to discriminate between target geometries resulted in the highest classification. The networks trained to discriminate the target’s interior material had the lowest accuracy. The main feature highlighted when examining the CNNs trained to classify the interior of the material was the specular reflection, with a small portion of the resonances being used for classification. This network contained one convolutional layer. Typically, the initial layers of the network will lock onto spatially important features (such as contours), while deeper layers separate out things that are not visually apparent. The CNN used to classify the material type was two convolutional layers. These CNNs highlighted the resonances for discrimination, but the specular component was also still being used. The analysis performed throughout this investigation can be leveraged when designing classification pipelines by amplifying the meaningful scattering mechanisms and suppressing the less-informative features. Possible classifier failure mode mitigation techniques were discussed, and recommendations for how to intuitively and positively bias classifiers were provided.
Future work can further relate the network-determined features through first simulating the modal rigid and soft residuals. The Bayesian optimization topology can have increased classification to follow the best practices in CNN topology, such as by increasing the number of filters with each additional convolutional layer. Additional complexities can be included in the classification pipeline, such as coating the shell and simulating cylinders and investigating different additive noise models at different SNRs. The latter analysis would further investigate the impact of the SNR on the network analysis and explainability technique. The thickness of the shell and relation to the classification results can be further investigated. Deeper networks can be trained and examined to see if separation between the different scattering mechanisms occurs. Lastly, the analysis of the trained networks can be expanded. The results presented were qualitative through visual examination of the Grad-CAM heat maps. An image similarity measurement, such as the structural similarity index measure (SSIM) [
48] or feature similarity index (FSIM) [
49], could be used to quantify intra-class features and compare them to inter-class features.