Next Article in Journal
Characteristics of Overburden Damage and Rainfall-Induced Disaster Mechanisms in Shallowly Buried Coal Seam Mining: A Case Study in a Gully Region
Previous Article in Journal
Sustainable Well-Being and Sustainable Consumption and Production: An Efficiency Analysis of Sustainable Development Goal 12
 
 
Article
Peer-Review Record

Recognition of Western Black-Crested Gibbon Call Signatures Based on SA_DenseNet-LSTM-Attention Network

Sustainability 2024, 16(17), 7536; https://doi.org/10.3390/su16177536
by Xiaotao Zhou 1,†, Ning Wang 1,†, Kunrong Hu 1,*, Leiguang Wang 2,3, Chunjiang Yu 1, Zhenhua Guan 1,4, Ruiqi Hu 1, Qiumei Li 5 and Longjia Ye 1
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Sustainability 2024, 16(17), 7536; https://doi.org/10.3390/su16177536
Submission received: 31 May 2024 / Revised: 19 August 2024 / Accepted: 27 August 2024 / Published: 30 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

My review is attached

Comments for author File: Comments.pdf

Comments on the Quality of English Language


Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper reports an interesting application of a variation of deep-learning approaches to classify Western black-crested gibbon calls/songs. The authors propose the use of a DenseNet-LSTM-Attention based recognition network to classify calls recorded using a call monitoring system that recorded calls of multiple species. The improvement in the model, though small, can prove to be useful in parsing out different call types despite small but important acoustic variations and their existence within a noisy environment.

Call classification is a non-trivial problem given the inherent variation in utterances of the same call type emitted by multiple individuals. Acoustic variation is an inherent property of all communication sounds. This variation provides information, not only of the call type, but also the individual emitting the call and the mood of the emitter, which is reflected in each utterance of the sound. The methodology developed in this study is not completely original, building on established deep-learning methodologies. However, the network constitution is appropriately conducted and tested. There are, however, ways in which the presentation of the work can be improved to make it appealing and useful to a wider scientific community. 

1). What is the level of variation captured for different call types? Please provide a box-and-whisker plot for the recognized calls types for which spectrograms are provided to show that call types are not being overly parsed. Variation in acoustic parameters is an inherent property of calls used for acoustic communication. These are important factors to consider for actual improvement in the accuracy, precision, and recall.

2). It is recommended that the authors' establish the point at which the DenseNet configuration fails to establish the upper bound of acoustic variation and noise levels that can be tolerated for usage of the model for call classification.

3). The usage of call taxonomy and classification scheme needs to be clarified for the model to be implemented appropriately and realize its usefulness. Extensive work has been done on this in multiple species. For example, a classification methodology built on spectrographic structure would be most relevant since behavioral data are not available here and a classification scheme based on anthropomorphic categorization of animal vocalizations is in general not recommended. Thus, vague usage of "song" vs."call" should be avoided. Both terms have clear definitions. Calls used for social communication in animals have been shown to consist of simple sounds or “syllables” that may be emitted either individually as calls, as appears to be the case in the gibbon communication sounds shown in figure 2, or combined to form “composites” (Lin et al., 2016). Song is a structured sequencing of syllables and composites with frequent repetitions (Payne and McVay, 1971; Whaling et al., 1997; Behr and von Helversen, 2004; Bohn et al., 2009, 2013). Thus, it will be useful to know if the DenseNet configuration is effective in recognizing monosyllabic calls and composites (shown), or phrases and songs as well, and if so, how do the success rates vary for the different acoustic structures. Even for monosyllabic calls, there may be a variation in the success/accuracy rates based on the duration of a call type. This is important information to have for future application of the proposed network configuration.

Suggestions to improve figure 2: Label x- axis (time) and y-axis of all spectrograms to the same scale of 10K, with e.g., major ticks every 2K and minor ticks 0.5 K so that call structure and differences are clear. Calls appear to be (a) short, Quasi-CF train, (b) Sinusoidal FM, (c) DIpped-CF pair, and (d) upward FM-downward FM composite. If a song was observed and recognized by the network, please provide a spectrogram for that. 

4). Based on the above information and the spectrograms provided in figure 2, the title of the paper should say, “gibbon call recognition” rather than “gibbon song recognition”  unless the spectrograms constitute a song, which they don’t appear to. Also, the term, “network” or “network configuration” appears to be missing at the end of the title. 

5). Please include a spectrogram to give a sense of the level of background acoustic noise in the recordings so readers can estimate the usefulness of the DenseNet-LSTM-Attention network in call recognition in other instances of such recordings. 

6). The text in some of the figures is too small. Please check and adjust font size as needed. 

Bibliography

Behr O, von Helversen O. Bat serenades—complex courtship songs of the sac-winged bat (Saccopteryx bilineata). Behav Ecol Sociobiol 2004; 56: 106–15.

Bohn KM, Schmidt-French B, Schwartz C, Smotherman M, Pollak GD. Versatility and stereotypy of free-tailed bat songs. PLoS ONE 2009; 4: e6746.

Bohn KM, Smarsh GC, Smotherman M. Social context evokes rapid changes in bat song syntax. Animal Behaviour 2013; 85: 1485–91.

Lin A, Jiang T, Feng J, Kanwal JS. Acoustically diverse vocalization repertoire in the Himalayan leaf-nosed bat, a widely distributed Hipposideros species. J Acoust Soc Am 2016; 140: 3765.

Payne RS, McVay S. Songs of humpback whales. Science 1971; 173: 585–97.

Whaling CS, Solis MM, Doupe AJ, Soha JA, Marler P. Acoustic and neural bases for innate recognition of song. Proc Natl Acad Sci USA 1997; 94: 12694–8.

Comments on the Quality of English Language

There are a few places where the grammar can be improved and some typos corrected. In a few other places, sentence structure can be improved for clarity. Otherwise, the English appears to be fine. 

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

For the species western black crested gibbon, the authors proposed an audio recognition model based on SA_DenseNet-LSTM-Attention, which focuses on several aspects. The first is to explore the effectiveness of data augmentation methods, then the design of this integrated attention mechanism of the recognition network, and finally the comparison and verification of methods. After reading this paper, I would like to offer the following suggestions for the authors to consider:

 

1. I think the biggest problem with this paper is that the model proposed by the authors looks more like a stack of existing methods, and there is not enough innovation from the authors. The whole analysis about the model also looks more like an experimental report than a paper.

 

2. Table 6 shows the improvement of the authors' proposed algorithm compared to other methods, but the comparison here is not comprehensive. The authors should add more algorithms specifically for audio signal recognition for comparison to better illustrate the effectiveness of this method.

 

3. Figure 3 is the description of the overall framework. In my opinion, the data augmentation module here should not be regarded as a part of the method, for two reasons: (1) Data augmentation is described in section materials rather than section methods and is considered more of a general model performance enhancement method. (2) From the author's description of the used data augmentation method, it is a combination of existing methods rather than proposing new ones. Of course, if the author insists that it is part of the proposed method, then the description should be moved to the methods section.

 

4. In the introduction section, the author should better summarize his own work and innovation, such as: "The major contributions can be summarized as follows..." This is the form.

 

5. Figure 5 describes the bottleneck Layer, but this structure is too simple to be explained in a single figure. I suggest that the author remove it or merge it with another figure.

 

6. Figure 7 is about comparison of accuracy after augmentation by different data augmentation methods, but the bar chart is not necessary here. This information can be expressed in a short table and is more readable.

 

7. The paper is not well structured, like "In order to solve this problem, a lot of research has been conducted by many people[44-47]..." For such a discussion of existing methods, a separate section named related works should be added instead of discussing it in the methods section, which will cause confusion.

 

8. How can a description of the method appear in the results section, such as: "To improve the identification of the four distinct call types exhibited by the western black-collared gibbon, we concurrently implemented the temporal attention mechanism within the FC-LSTM network module...". There are other similar cases that the author himself should check.

 

9. Authors should also not comment excessively on other people's work in the discussion section, such as: "Expansion of the dataset can effectively improve the accuracy of the model, e.g.,  the literature [22,23,40,41] is used to improve the classification accuracy of the model through data expansion...". 

Comments on the Quality of English Language

For the species western black crested gibbon, the authors proposed an audio recognition model based on SA_DenseNet-LSTM-Attention, which focuses on several aspects. The first is to explore the effectiveness of data augmentation methods, then the design of this integrated attention mechanism of the recognition network, and finally the comparison and verification of methods. After reading this paper, I would like to offer the following suggestions for the authors to consider:

 

1. I think the biggest problem with this paper is that the model proposed by the authors looks more like a stack of existing methods, and there is not enough innovation from the authors. The whole analysis about the model also looks more like an experimental report than a paper.

 

2. Table 6 shows the improvement of the authors' proposed algorithm compared to other methods, but the comparison here is not comprehensive. The authors should add more algorithms specifically for audio signal recognition for comparison to better illustrate the effectiveness of this method.

 

3. Figure 3 is the description of the overall framework. In my opinion, the data augmentation module here should not be regarded as a part of the method, for two reasons: (1) Data augmentation is described in section materials rather than section methods and is considered more of a general model performance enhancement method. (2) From the author's description of the used data augmentation method, it is a combination of existing methods rather than proposing new ones. Of course, if the author insists that it is part of the proposed method, then the description should be moved to the methods section.

 

4. In the introduction section, the author should better summarize his own work and innovation, such as: "The major contributions can be summarized as follows..." This is the form.

 

5. Figure 5 describes the bottleneck Layer, but this structure is too simple to be explained in a single figure. I suggest that the author remove it or merge it with another figure.

 

6. Figure 7 is about comparison of accuracy after augmentation by different data augmentation methods, but the bar chart is not necessary here. This information can be expressed in a short table and is more readable.

 

7. The paper is not well structured, like "In order to solve this problem, a lot of research has been conducted by many people[44-47]..." For such a discussion of existing methods, a separate section named related works should be added instead of discussing it in the methods section, which will cause confusion.

 

8. How can a description of the method appear in the results section, such as: "To improve the identification of the four distinct call types exhibited by the western black-collared gibbon, we concurrently implemented the temporal attention mechanism within the FC-LSTM network module...". There are other similar cases that the author himself should check.

 

9. Authors should also not comment excessively on other people's work in the discussion section, such as: "Expansion of the dataset can effectively improve the accuracy of the model, e.g.,  the literature [22,23,40,41] is used to improve the classification accuracy of the model through data expansion...". 

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

Thank you to the authors for making the edits to eh manuscript. I think it is much improved. The abstract still needs to be re-written to clearly describe what the findings of the project were. Following that, I am happy to recommend the article for publication. 

 

Abstract:

Firstly, to address the problem of the lack of four different call types of the western black crested gibbon, this paper explores 10 different data expansion methods to process all the datasets, and then all the sound data are converted into Mel spectrograms for the input of the model.

Im not sure what the ‘problem of the lack of four different call types’ means. Please clarify

 

Abstract still should include the fact that this classifier is classifying to 13 sound types, four of which are calls of the gibbons. This is a highlight of the paper and the abstract gives the suggestion that it is just classifying between four gibbon call types.

Comments on the Quality of English Language

I applaud the authors for communcating so well in the English language. It could benefit from one more read through with an eye for English. 

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

While the authors showed some changes in response to the reviewer’s comments, they failed to incorporate these within the manuscript. Some of the points raised earlier are critical and from the data provided to the reviewer there is some concern, for example of the stereotypic nature of call spectrograms that are classified using the methodology proposed in this manuscript.

It is unclear why the suggested changes in the spectrogram were not incorporated in the figure in the paper. The changes suggested were not to simply respond to the reviewer, but to improve the manuscript for other readers. The box and line diagrams provided to the reviewer show that the spectrograms captured by the Densenet are highly stereotypic and do not capture any variation in the various parameters that is inherent in all communication sounds. Therefore, for the reader, this can be misleading. In fact, it calls into question the purpose of using this methodology for call recognition. The authors need to honestly indicate to the reader what the algorithm can or cannot do. As is now, the call classification is more like template-matching, and creates doubts as to the practical usefulness of the proposed methodology. 
It is also unclear why the authors state, “there will be a lot of blank parts in the spectrogram after incorporating the suggested scaling to 10 K” for frequency on the y-axis. The spectrograms provided to the reviewer are definitely improved, but it is unclear why the authors chose not to include them in the manuscript. The reason provided does not appear to be valid because a scaling of 15 and 20 K leaves even more blank space in the spectrograms.

 

In summary, it is important to realize that a classification system is only useful if it tackles the realistic nature of and variation within spectrograms of calls. Spectrographic variation is  a hallmark of communication sounds. Not being able to capture this variation does not help and can be misleading in terms of estimating the number of different call types that are emitted. If the goal is only to detect a specific variant of a call type, then this should be clearly indicated in the title by replacing “call recognition” with “call signature identification”. 

This goal of the methodology is to contribute to the filed and resolve call discrimination, but by ignoring the work of others and recognizing the difficulty of this problem, the authors’ contribution does not benefit either the AI field or that of auditory communication or of applications to ecology. Either a call naming scheme adopted by previous authors working with gibbons (e.g., similar work done by Klinck and Clink, 2020, is completely ignored) or of spectrograms via acoustic structure alone should be adopted and appropriate references cited. 

 

“Song” is still used in one or more places. 

 

Large sections of the text has been modified in several places without any explanation, but suggested changes have been largely ignored. This reflects poorly on the authors and the journal.

 

I am at a loss to understand why the authors would not take advantage of the feedback to improve the manuscript and provide unfounded reasons not to do so. 

Comments on the Quality of English Language

Appears to be mostly OK.

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The revision is still unsatisfactory.

Comments on the Quality of English Language

Minor editing of English language required.

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript is much improved and more accurately represents the findings and potential applications, greatly increasing its credibility. A few minor enhancements are noted below. 
A slightly modified will be better because “call recognition” is a loaded phrase that can be misleading, and does not clarify whether reference is to individual calls or call types, which is a critical issue here. A safe rand better title is:

“Recognition of Western black-crested gibbon call signatures based on SA_DenseNet-LSTM-Attention network”

In figure 2, please include in the spectrogram label, the name of the call type, e.g., “aa”, modulated, etc., in Figure 3 and Table 1, assuming these correspond to those shown. Otherwise, please make clear that they don’t because no other spectrograms are shown, so the reader is left wondering.

Under Call type in Table 1, does “figure” need to be noted?

Please move paragraph starting on line 271 and referencing Figure 3 to a location before Figure 3. Also, please change “variation” to “differences”. In fact, the plots show a lack of variation in the extracted data even though call types do have inherent variation. The Densenet is able to extract signatures of different call types, which could still be useful, without capturing the entire range of variation. For this, the algorithm likely needs to be trained on the multiparametric distribution or range of variation for each call type. In fact, in doing so, the Densenet may have the potential to outperform other algorithms by a greater margin because this is a more difficult, largely unsolved, problem. The reason being that upper bounds of tolerable call variation are difficult to establish at a behavioral or cognitive level (cannot easily ask animals) and therefore also at the spectrographic level. All this can be clearly stated to avoid misunderstanding.

 

Comments on the Quality of English Language

Improvements can be made in a few locations. Please check.

Figure 2: I believe each spectrogram represents one call type. Delete plural “types”.

For example, in last sentence, comparison is implied and “in comparison” is best deleted. 

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

revised as required.

Comments on the Quality of English Language

Moderate editing of English language required.

Author Response

Dear reviewers.
       First of all, thank you very much for your valuable suggestions on our paper, for your suggestions and questions on our paper we have organized in the following word document, please check it. Thank you again for your suggestions.

Kind regards,

Xiaotaozhou

Back to TopTop