Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Initial Machine Learning-Based Victim’s Scream Detection Analysis for Burning Sites

Appl. Sci. 2021, 11(18), 8425; https://doi.org/10.3390/app11188425

by Fairuz Samiha Saeed^1,*, Abdullah Al Bashit²

, Vishu Viswanathan¹ and Damian Valles^1,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2021, 11(18), 8425; https://doi.org/10.3390/app11188425

Submission received: 11 August 2021 / Revised: 26 August 2021 / Accepted: 9 September 2021 / Published: 10 September 2021

(This article belongs to the Section Environmental Sciences)

Round 1

Reviewer 1 Report

# General points
* The evaluation methodology needs to be seriously considered. I would expect the priority in this study to maximise Recall, rather than precision, as given the context, we want to ensure that we absolutely find all victims. There are a number of evaluation metrics that are compared, but the true meaning need to be considered and one evaluation metric used.
* Why is the LSTM trained using different feature inputs to the SVM?
* The k-fold splits and training procedure is not well explained. Is it a k=7 fold? on the training+validation set, as 70% of the data? Are the results on the training and validation the best scores? or the mean scores?
* Further evaluation to identify if the varying SNR values are impacting the accuracy and recall scores. This could also provide some excellent insight into this work and how SNR will effect real world results.

# Specific Points

## Introduction
"Data shows that hundreds of firefighters have lost their lives on duty." - citation needed.

"Researchers continually improve the firefighting technique with the latest technologies where Artificial Intelligence (AI) comes into play" - citation needed

"Since human nature is to scream and shout for help in such emergencies, these audio sources can play a vital role in detecting the victim in fire emergencies." - it is worth acknowledging that not all victims will scream and shout, quiet or unconscious individuals will still need help, so this tech should be used in additionl to other approaches.

## Methodology

The discuss more around the synthesis of the dataset - the varyign SNR's are very interesting, but perhaps -10dB is not enough for detecting a scream in a burning building? I would expect these to be very noisy environments.
Is there any data preprocessing or denoising step before audio feature extraction?

Eq2. Is the \sigma intended to be the scale factor from Eq1.?

3.2 Audio Feature Extraction
"For SVM, ZCR, RMSE, and 326 Spectral Centroid, one value is extracted per frame..." The list of acronyms have not been defined. And this makes it look like SVM is another feature, so consider rephrasing.

Table 2. - It is not clear what the column "Functionals (SVM)" means. This could be explained further

Towards the end of this section, it became clearer that the audio feature extraction section only really applies directly to the SVM. Consider generalising this section and then add the specific means to the SVM section.

3.4 Classifiers
It is not clear why SVM and LSTM were the two selected approaches, and both were provided very different features. Would a fully connected MLP have been more successful than SVM?

Section 3.4.1 and 3.4.2 go into a lot of detail regarding the appraches, which do not always seem relevant.

"However, RNN suffers from vanishing gradient problems" -> "However, RNN can suffers from vanishing gradient problems" - not all RNN suffer from vanishing gradient, as LSTM is a subset of RNN, designed to solve this problem.

"Therefore, it is important for all the audio files to 443 be of equal duration to match the same size and number of frames per audio file" - I don't think this is correct for LSTM. The recurrent nature means that RNN's can take inputs of variable lengths. This should also mean that there is no need to clip off the end of some recordings.

3.5 Performance Metrics
How was the K-fold technique applied? Was it a k=7, with a the 70% of training+validation data?

Section 4.
Table 5 - Why is Scream in bold?
Figure 7, 11 and 15 - Which axis is the true labels? Which is the predicted labels?

"For generating the ROC curve, we have used the one-versus-all approach." - Please explain the one-versus-all approach.

Figure 8.a - What is the blue diagonal line?
Figure 8.b What are the iterations used for SVM? And the axis is labelled iterations over SVC? What is the SVC in this case?

"It can be observed that all the features are con- 597 tributing to classification; however, MFCC [4], MFCC [8], MFCC [6], and RMSE stand out" - The MFCC's look like references, with this formatting. Please clarify

Figure 9. It is not clear what the purpose or value of this feature importance claculation is. Is this designed to remove features and reduce computation?

Section 4.3 t-SNE

The t-SNE analysis is very intersting, but should possibly be placed before the machine learning approaches, as a data visualisation approach, rather than retrospective.

Please cite the t-SNE paper, and fix the legend labels in Figure 14.

Section 4.4
Figure 15 - This implies that no screams were correctly identified. This would suggest a considerably problem with the system, and the scream data synthesis approach is flawed. Consider revising the data synthesis approach.

Great to see the github and dataset are available, but please link this in the paper, perhaps in a footnote.

Author Response

Reviewer 1.

#Specific point

##Introduction

"Data shows that hundreds of firefighters have lost their lives on duty." - citation needed.

Response: citation added (line 38)

"Researchers continually improve the firefighting technique with the latest technologies where Artificial Intelligence (AI) comes into play" - citation needed.

Response: citation added (line 44)

Response: Rephrased (line 53-54)

## Methodology

The discuss more around the synthesis of the dataset - the varyignSNR's are very interesting, but perhaps -10dB is not enough for detecting a scream in a burning building? I would expect these to

be very noisy environments.

Response: The noise in a burning site can be worse than -20 dB. However, the features that we used for our model are susceptible to noise. Since we did not apply any denoising technique to the audio files, we decided to discard SNR's lower than -10 dB for now.

Is there any data preprocessing or denoising step before audio feature extraction?

Response: No data preprocessing or denoising step has been applied before feature extraction. 'Raw audio file' emphasis that no denoising techniques/data preprocessing has been done (line 249)

Eq2. Is the \sigma intended to be the scale factor from Eq1.?

Response: Yes. Since \sigma represents the standard deviation, we changed the symbol to k and updated the equations. (line 296,300)

3.2 Audio Feature Extraction

"For SVM, ZCR, RMSE, and 326 Spectral Centroid, one value is extracted per frame..." The list of acronyms have not been defined.And this makes it look like SVM is another feature, so consider rephrasing.

Response: Deleted this part to generalize the audio feature extraction section

Table 2. - It is not clear what the column "Functionals (SVM)"means. This could be explained further

Response: Deleted this part to generalize the audio feature extraction section since the functionals only apply to SVM

Towards the end of this section, it became clearer that the audio feature extraction section only really applies directly to the SVM.Consider generalising this section and then add the specific means to the SVM section.

Response- All the part referred to SVM has been removed from this section and added later on to the SVM section as per revision. The feature extraction section is rearranged that discusses the features used for LSTM and SVM

3.4 Classifiers

It is not clear why SVM and LSTM were the two selected approaches, and both were provided very different features. Woulda fully connected MLP have been more successful than SVM?

Response- Based on the previous research on scream detection, we decided to use SVM and LSTM. Added three more papers supporting the choice of LSTM and SVM for this application (including the background section) (line 199-201,397-404)

MLP part - The previous research on scream detection suggested SVM has an average accuracy of 86.3%, whereas MLP has an average accuracy of 76%

Section 3.4.1 and 3.4.2 go into a lot of detail regarding the approaches, which do not always seem relevant.

Response- Removed the non-relevant part from this section. Added the extra feature extraction step for SVM here. (line 418-424)

Response- Yes, that is true for the frame to frame detection. Since we are not making any frame-to-frame prediction (the prediction is given at the very last step), therefore we have used the same length audio files. The paragraph has been rephrased to avoid misconception (line 448-451)

3.5 Performance Metrics

How was the K-fold technique applied? Was it a k=7, with a the70% of training+validation data?

Response: We selected a k= 10 fold on 70% of the total dataset (training+validation) and took the average overall ten folds accuracy. (line 514-518, 552-556)

Section 4.

Table 5,7 - Why is Scream in bold?

Response: Corrected

Figure 7, 11 and 15 - Which axis is the true labels? Which is thepredicted labels?

Response: Updated the labels of x and y-axis as predicted and true label respectively (Figure 6 and Figure 10)

"For generating the ROC curve, we have used the one-versus-all approach." - Please explain the one-versus-all approach.

Response: The one versus rest approach should be mentioned in the performance matrics section (ROC). Usually used for multiclass classification. In the result section, the lines (line 507-509, 568-571) refers to the ovr approach

Figure 8.a - What is the blue diagonal line?

Response: Added discussion of the blue diagonal line (line 571-572)

Figure 8.b What are the iterations used for SVM? And the axis islabelled iterations over SVC? What is the SVC in this case?

Response: Iteration is 500 (line 535). SVC refers to Support Vector Classifier, which has been considered the same as SVM here. To clear confusion regarding this term, we have taken out SVC and updated it to SVM (Figure 7b)

"It can be observed that all the features are con- 597 tributing to classification; however, MFCC [4], MFCC [8], MFCC [6], andRMSE stand out" - The MFCC's look like references, with this formatting. Please clarify

Figure 9. It is not clear what the purpose or value of this feature importance calculation is. Is this designed to remove features and reduce computation?

Response: The clarification has been added (line 586-588), and the format has been changed (line 590-591)

Section 4.3 t-SNE

The t-SNE analysis is very interesting, but should possibly be placed before the machine learning approaches, as a data visualisation approach, rather than retrospective.

Response: Moved t-SNE analysis from the ML result section to here as a data visualization part (line 318-336)

Please cite the t-SNE paper,

Response: Added the reference (line 320)

and fix the legend labels in Figure 14

Response: The figure has been deleted as the real-time data was not recorded correctly.

Section 4.4

Figure 15 - This implies that no screams were correctly identified.This would suggest a considerable problem with the system, and the scream data synthesis approach is flawed. Consider revising the data synthesis approach.

Response: We decided to remove this section. It was initial work, and the data was not correctly recorded. We are still working to fix this issue. However, we decided to exclude this part for now.

# General points

* The evaluation methodology needs to be seriously considered. I would expect the priority in this study to maximise Recall, rather than precision, as given the context, we want to ensure that absolutely find all victims. There are a number of evaluation metrics that are compared, but the true meaning need to be considered and one evaluation metric used.

Response- We agree with this suggestion. The result and comparison section has been rearranged based on Recall (line 542-543, 551-552, 620-622, 663-674)

* Why is the LSTM trained using different feature inputs to theSVM?

Response: We have used all 16 features for SVM and LSTM. However, since we extracted the 16 features, every frame gives 1,760 feature values on average per audio file. This would have resulted in overfitting the SVM with this high number of features. Thus, we applied the average function for the features of SVM over the entire audio file, which reduced the feature number to 16 per audio. For LSTM, this step was not required since the features are fed in every time step/frame.

* The k-fold splits and training procedure is not well explained. Is ita k=7 fold? on the training+validation set, as 70% of the data? Are the results on the training and validation the best scores? or themean scores?

Response: We selected a k= 10 fold on the 70% of the total dataset (training+validation) and took the average overall 10 folds accuracy. (line 514-518, 552-556). The results on the training and validations are the best scores found after hyper tuning.

* Further evaluation to identify if the varying SNR values are impacting the accuracy and recall scores. This could also provide some excellent insight into this work and how SNR will effect real-world results.

Response: This is a great suggestion. We have added subsection 4.3 describing the effect of SNRs on the models. (line 649-674)

Author Response File: Author Response.docx

Reviewer 2 Report

This paper aims to detect victims trapped in fire emergencies from their screams, using a machine learning model and an autonomous vehicle.

But I cannot find any comparison test result with previous researches related to scream detection analysis.

And in section 4, it should be described the test results per various SNRs.

It is only a comparison of the basic machine learning algorithm, and it is recommended to reinforce the contents and the experiments.

Author Response

This paper aims to detect victims trapped in fire emergencies from their screams, using a machine learning model and an autonomous vehicle.

But I cannot find any comparison test result with previous researches related to scream detection analysis.

Response: The background section on the scream detection part discusses other research works with scream detection and Audio event detection for surveillance applications where scream is a class.

To address this review, we have added a table in the result subsection (4.4) comparing the scream detection-focused papers.

And in section 4, it should be described the test results per various SNRs.

Response: The models were trained on the combined dataset, including nine SNR levels (-10 dB to 30 dB). Therefore, the result discussed in that section was based on the combined SNR levels. A new subsection 4.3 has been added to the result section, which describes the effect of different SNR levels on both models to see the effect of the SNR. (line 649-674)

It is only a comparison of the basic machine learning algorithm, and it is recommended to reinforce the contents and the experiments.

Response: This work is an initial analysis of scream detection in burning sites with Autonomous vehicles. We are currently expanding the content with another real-time implementation result with Jetson nano, using a transfer learning model, noise-robust feature extraction technique, dealing with an unbalanced dataset with few screams and a large amount of no scream data.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

It is well addressed all commented.

Article Menu

An Initial Machine Learning-Based Victim’s Scream Detection Analysis for Burning Sites

Further Information

Guidelines

MDPI Initiatives

Follow MDPI