Next Article in Journal
Volume Loss Assessment with MT-InSAR during Tunnel Construction in the City of Naples (Italy)
Previous Article in Journal
Spatio-Temporal Patterns of NDVI and Its Influencing Factors Based on the ESTARFM in the Loess Plateau of China
 
 
Article
Peer-Review Record

An HMM-DNN-Based System for the Detection and Classification of Low-Frequency Acoustic Signals from Baleen Whales, Earthquakes, and Air Guns off Chile

Remote Sens. 2023, 15(10), 2554; https://doi.org/10.3390/rs15102554
by Susannah J. Buchan 1,2,3,4,*, Miguel Duran 4, Constanza Rojas 2, Jorge Wuth 3, Rodrigo Mahu 3, Kathleen M. Stafford 5 and Nestor Becerra Yoma 3
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2023, 15(10), 2554; https://doi.org/10.3390/rs15102554
Submission received: 29 December 2022 / Revised: 21 March 2023 / Accepted: 31 March 2023 / Published: 13 May 2023
(This article belongs to the Section Ocean Remote Sensing)

Round 1

Reviewer 1 Report

1. In the line 72 - 79, only the machine learning algorithms are introduced. Can you add more references which employed the deep learning methods?

2. I agreed that HMM-GMM is one of the appropriate baseline for the performance comparison. However, I think more comparisons are necessary.

For example, LSTM or its variants (bi-LSTM, Transformer, ...) can be employed. You can fix the window length of LSTM and stride the window until the whole input signal is covered.

3. F1-score for rare cases (S22, SEP, ...) are too low due to class imbalances. Can you apply augmentation approaches to match the class samples' distribution such as SMOTE, SMOTE++, or others?

4. To perform sound event detection (SED), model requires to classify target sound with non-target sound. However, dataset only contains small amounts of non-target sound (UND class); which causes any non-target sound can be misclassified into one of the target classes (whales, earth quackes, ...); this is out-of-distribution (OOD) problems. Can you discuss about this?

 

 

 

Author Response

Please, find a detailed answer in the attached document.

Author Response File: Author Response.docx

Reviewer 2 Report

This manuscript describes a new system for the detection and classification of low frequency biotic an abiotic sounds in the marine environment. This system performs well and represents a step forward in the ability to efficiently analyse acoustic recordings. The paper is well written and comprehensive, although a bit long. The methods section in particular is quite long and detailed. It is well presented but, in many places, results are included. This makes the methods section longer than it needs to be. Please make sure that results are only given in the results section. I have some detailed comments below, but once they have been addressed, I recommend this manuscript for publication.

 

Line 61: recommend removing the word ‘knowing’ from this sentence.

Line 62: automation of these process is extremely helpful, but I would not say essential. Could you please adjust this statement?

Line 71: Please include some citations as examples here.

Lines 75-76: please change the wording to reflect the fact that your examples are only some of the methods that are used and there are many more.

Lines 78-79: many existing systems fail to integrate detection and classification, but some do. Please change this sentence to reflect this.

Line 81: this is the first time you have referred to Hidden Markov Models. Please spell it out here.

Lines 81-84: Please provide a reference here.

Line 91: Please change to ‘In [39] the author…’

Line 105: please define DNN here.

Line 111: Please change ‘which allows to capture’ to ‘which makes it possible to capture’

Lines 113-115: why does combining HMM with DNNs require fewer trainable parameters than other deep learning approaches?

Line 135: please change to ‘were divided into 150 s long…’

Line 138: sequence of events – what is an event? Is that an individual whale call, airgun pulse or earthquake?

Line 169: yielded better results than what?

Line 171: could you add a brief explanation of word error rate?

Lines 175-176: this wording is difficult to understand. Do you mean ‘the more numerous events being the ones that contribute most to classification accuracy’?

Lines 194-197: how many hours were selected from each month? How were these selections made? Were hours randomly selected or were they selected based on the presence of target signals?

Line 195: Did you run any inter-observer reliability or ground-truthing to ensure that the analyst was annotating accurately and completely?

Line 196 and Table 1: The number of target events per class belongs in the results section, not the methods

Line 198: What spectrogram settings were used?

Lines 22-225: this is a little confusing. Is the data-driven transformation the additional step you are referring to here?

Lines 314-315: better results than what? Please provide a reference here.

Lines 376-386: There are a lot of results mixed in here with the methods. Please separate out the results and move them to the results section.

Lines 390-391: did you train the HMM-GMM system to provide the initial set of target states required for the HMM-DNN system? Please make this clear.

Lines 393-395: This belongs in the results, not the methods.

Lines 399-400: This last sentence can be removed. It does not belong in the methods section.

Line 412: please change to: we divided the database into three…

Lines 416-417: This last sentence can be removed as it is results.

Lines 419-431: your methods section is quite long and detailed. This is a paragraph that could be condensed to make the methods more concise. Please condense this a little bit.

Equation 6: What is N?

Lines 476-477: This sentence is worded awkwardly. Please re-word to make it clearer

Lines 484-494: It would be much easier for the reader to understand this section if you used the actual class names (e.g. Antarctic blue whale, seismic airgun) with the abbreviations in brackets here.

Lines 524-526: when you ran the HMM-DNN system with 500 events, did you have the same number of classes as you did when you ran it with 70 events? If you did, then the comparison of accuracy is a fair one. If you had fewer classes when you used 500 events, then you would expect a higher accuracy just by chance so doing a straight comparison of WER doesn’t seem like a fair comparison. Could you please clarify how many classes you had in each analysis and then address this issue if the different analyses included different numbers of classes?

Line 525: the word considerable should be considerably.

Lines 532: it would be helpful to state how many classes you had in your system here. That will make it easier to compare to the 11 species system you mention in the next sentence.

Line 540: I think that the word ‘were’ should be removed here. Either that or there is a word missing after ‘were’.

Line 551: What do you mean by complexity? Do you mean the higher variability in these signals? Also, the term song call is not the best term to use. It would be better to just say song.

Line 572-573: What kind of decision making? Please expand on this a little bit.

Line 575: Please define STA/LTA.

Line 579: You say multiple times through your manuscript that your dataset has relatively high levels of noise, but you do not quantify that anywhere. Could you please define what you mean by ‘relatively high levels of noise’? This would be best done the first time you make that statement.

Line 604: Your DNN did outperform the GMM, but by 1.9% and 6.72% for event-level accuracy. This is an improvement, but I would not call it vast. Please reword this sentence slightly.

Line 608: Please include a brief discussion of the impacts of losing 50% of events when only looking at high SNR. What does this mean in practical terms? For some applications, it is more desirable to have higher confidence in the classifications than it is to detect/classify all events. But in other applications it is important to detect/classify a large proportion of events. Please discuss this a little bit.

Lines 611-613: I think this is a huge benefit of your system and should be emphasized earlier on.

 

 

 

 

Author Response

Please, find a detailed answer in the attached document.

Author Response File: Author Response.docx

Reviewer 3 Report

In this paper, the authors presented an approach to automatically detect and classify a variety of low-frequency signals including baleen whales, earthquakes, and airguns based on a hybrid architecture containing an HMM stage and a neural network.

The authors also compared the performance with other methods such as GMMs and showed performance improvement. What is interesting is that the proposed method also signaled the timestamp occurrence of a sound event, which makes the whole procedure automatic and ready for potential industrial applications. 

There are a few comments that the authors could address upon.

1. The proposed method uses a DNN structure. Did the author compare with the convolutional neural network (CNN) structure? It seems that most powerful speech-processing or acoustic classification approaches use the CNN, such as the EfficientNet.  

2. The filter-bank based feature extraction approach in the paper is supposed to outperform the MFCC, which is designed for human voice applications. However, the log Mel spectrogram or other features such as chromagram is often used in the environmental sound classification.  If possible, the authors could consider comparing with the Mel-spectrogram to confirm if there is some performance improvement. 

3. Did the author use any augmentation of signals when training the network? Data augmentation should heavily affect the performance. If so, please state in the paper the specific method used in training, e.g., adjust amplitude, time stretch, or spectral masking, etc.

4. Transformer (Multi-headed self-attention mechanism) is the state-of-the-art algorithm for a number of NLP and image processing applications. The authors are encouraged to investigate how to use the Transformer in the low-frequency signal classification. 

5. The Viterbi algorithm is a well-known sequential detection approach. An alternative is to use the soft-output Viterbi (SOVA) algorithm or the BCJR algorithm provided there is any a priori information. Is it possible to incorporate such information in the HMM stage in the paper?

 

Author Response

Please, find a detailed answer in the attached document.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Only the Q3 is answered clearly.

I think additional experiments using deep learning based methods (LSTM, Transformer, and others) are required. Not an future research.

Back to TopTop