Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

ODIN112–AI-Assisted Emergency Services in Romania

Appl. Sci. 2023, 13(1), 639; https://doi.org/10.3390/app13010639

by Dan Ungureanu¹, Stefan-Adrian Toma²

, Ion-Dorinel Filip¹

, Bogdan-Costel Mocanu¹

, Iulian Aciobăniței², Bogdan Marghescu²

, Titus Balan^3,4

, Mihai Dascalu^1,5,*

, Ion Bica²

and Florin Pop^1,6

Reviewer 1:

Artur Janicki

Reviewer 2:

Bernardo Breve

Reviewer 3: Anonymous

Reviewer 4:

Angelito Silverio

Reviewer 5:

Aymen Akremi

Appl. Sci. 2023, 13(1), 639; https://doi.org/10.3390/app13010639

Submission received: 2 December 2022 / Revised: 24 December 2022 / Accepted: 27 December 2022 / Published: 3 January 2023

(This article belongs to the Special Issue Artificial Intelligence for Sustainable Services, Applications and Education)

Round 1

Reviewer 1 Report

The authors propose an AI-based architecture to handle 112 emergency calls with NLP tools for Romanian. The article is well-structured and nicely written.

However, some issues must be clarified before a potential publication:

- Are you sure that the Full Rate GSM codec choice is justified? Which version did you use? Did you check which codecs in mobile telephony are most popular nowadays?

- No metric is given in Table 3. I guess it is WER? Did you ensure that train and test datasets were entirely disjoint? What were the sizes of each of the used corpora?

- Did you perform any tests in noisy conditions? Could you please comment on the performance in the presence of noise, which is very likely for emergency calls?

- In Fig. 4, you discuss sentiment recognition, while later (e.g., Fig. 5), you continue with emotion recognition. They are not the same. Please correct.

- Where actually the ASR module is going to be placed in the ODIN architecture? Maybe it should be added to Fig. 4?

- How the text-based analysis is going to be performed?

Editing comments:

- It is good that the authors introduced acronyms for Automatic Speech Recognition (ASR), Gaussian Mixture Models (GMMs), etc.; however, they rarely use them later in the text. Please correct.

- In contrast, WER is heavily used, even though it was not introduced. Please check all acronyms.

- In several bibliographic items, the names are confused with the surnames, e.g., in [22], partially [19].

- space is needed between a word/number and a reference, e.g., DeepSpeech 2[9] => DeepSpeech 2 [9], operator[23] => operator [23].

- no space is required between a number and the % sign, e.g., 50 % => 50%

- a missing dot? "interfacesThose" => "interfaces. Those"

- usually, one uses "step" or "analysis step" instead of "stride."

Author Response

The authors propose an AI-based architecture to handle 112 emergency calls with NLP tools for Romanian. The article is well-structured and nicely written.
Response: Thank you kindly for your thorough feedback, which helped us further improve the manuscript! All major edits are marked in blue.

However, some issues must be clarified before a potential publication:

- Are you sure that the Full Rate GSM codec choice is justified? Which version did you use? Did you check which codecs in mobile telephony are most popular nowadays?
Response: Thank you for your comment! Indeed, ARM and AMR-WB are currently widely used. However, we consider the Full Rate GSM codec as a worst-case scenario. The speech quality (measured with MOS) of the decoded ARM for the lowest possible bit rate is at least equal, if not greater than the quality of the Full Rate GSM decoded speech. Therefore, if our algorithm can correctly classify emotional speech samples coded/decoded with the GSM Full Rate codec, it will also work with better codecs.

- No metric is given in Table 3. I guess it is WER? Did you ensure that train and test datasets were entirely disjoint? What were the sizes of each of the used corpora?
Response: A "%WER" label was added to all columns, and a better description was added to the label. Yes, all the datasets were disjoint (in terms of recordings and transcripts). The sizes were roughly 10% of the entire dataset.

- Did you perform any tests in noisy conditions? Could you please comment on the performance in the presence of noise, which is very likely for emergency calls?
Response: The test datasets also include some noisy recordings, but most of the recordings usually have high quality with very little noise. In future developments, we will consider adding noise to test recordings.

- In Fig. 4, you discuss sentiment recognition, while later (e.g., Fig. 5), you continue with emotion recognition. They are not the same. Please correct.
Response: Thank you for your valuable observation. We have updated both Figures 4 and 5 for consistency.

- Where actually the ASR module is going to be placed in the ODIN architecture? Maybe it should be added to Fig. 4?
Response: Thank you for your valuable observation, and we regret this omission. We have updated Figure 4 to include the ASR module.

- How the text-based analysis is going to be performed?
Response: We introduced brief details and made it clear that these components are out of the scope of this study.

Editing comments:

- It is good that the authors introduced acronyms for Automatic Speech Recognition (ASR), Gaussian Mixture Models (GMMs), etc.; however, they rarely use them later in the text. Please correct.
Response: We checked the acronyms across the entire paper and we hope everything is in order now.

- In contrast, WER is heavily used, even though it was not introduced. Please check all acronyms.
Response: We added the corresponding definition in section 4.1 and the motivation behind using WER.

- In several bibliographic items, the names are confused with the surnames, e.g., in [22], partially [19].
Response: Thank you! We corrected both references.

- space is needed between a word/number and a reference, e.g., DeepSpeech 2[9] => DeepSpeech 2 [9], operator[23] => operator [23].
Response: The issue was addressed, thank you for the suggestion.

- no space is required between a number and the % sign, e.g., 50 % => 50%
Response: The issue was addressed, thank you for the suggestion.

- a missing dot? "interfacesThose" => "interfaces. Those"
Response: The issue was addressed, thank you for the suggestion.

- usually, one uses "step" or "analysis step" instead of "stride."
Response: We prefer to stick to stride as in the PyTorch documentation. We hope it is fine with you as well.

Reviewer 2 Report

- The proposal needs to be better formalized by going into more detail about the advancement of the state of the art.

- In order to improve the self-explainability of the work, terms such as ETSI and NG112 need to be better described and clarified, right from the introduction.

- Figures' captions are not centered.

- "interfacesThose" -> "interfaces. Those"

- Results are very poorly detailed. Section 4 requires a better discussion. Furthermore, A statement on the achievement of results similar to the SOTA should be accompanied by a proper comparative evaluation, where considered models are tested under the same, specified, conditions.

- The abbreviation "BERT" is never used throughout the paper.

Author Response

The authors propose a decision support system for Romanian emergency number call centers. Which by exploiting a combination of Natural Language Processing and Automatic Speech Recognition is able to analyze audio input from a phone, determining the urgency of the emergency call.
Response: Thank you kindly for your thorough feedback, which helped us further improve the manuscript! All major edits are marked in blue.

- The proposal needs to be better formalized by going into more detail about the advancement of the state of the art.
Response: Based on this recommendation, we have extended the Introduction paragraph with examples of previous work that included AI-based emergency call handling and optimization.

- In order to improve the self-explainability of the work, terms such as ETSI and NG112 need to be better described and clarified, right from the introduction.
Response: The acronyms were explained in their first occurrence throughout the text.

- The contribution made by the speech dataset collected by the authors themselves should be more highlighted, specifying how the items were collected, how the participants were recruited, what age group they belong to, and so on.
Response: We added a paragraph that describes how we collected the entries, the profile of the volunteers, as well as statistics regarding their age and gender.

- Figures' captions are not centered.
Response: We have used the LaTeX template, and it does not center any of the captions. Moreover, we have checked other articles, and none had the figures’ captions centered. If anything is out of order, we are confident that the MDPI editors will take care of this.

- "interfacesThose" -> "interfaces. Those"
Response: The edit was performed.

- The abbreviation "BERT" is never used throughout the paper.
Response: The abbreviation was removed, and we clarified how sentiment analysis on the text is performed.

Reviewer 3 Report

The authors should also clarify whether the experiments are the outcome of the "Decision module" rather than the output of every single module producing the "ODIN data structures". This makes the contribution somewhat unclear: if the authors are describing a whole pipeline, the readers expect that the experiments result from the Decision module. On the other hand, the authors should clarify which system component was tested in the experimental setting, and the same name should also appear in the architecture description (Figure 4 and 5). Furthermore, the authors should also clarify whether their proposed experimental setting is just an implementation of already-existing architectures or whether each described component had some innovative elements if compared to state-of-the-art. In the former scenario and the case that the overall experiments are not reflecting the outcome of the decision model, the scientific contribution of the paper becomes less evident. If, on the other hand, the innovation is also in the AI architecture of each single component (as I might expect from the current manuscript setting), the authors should indulge in comparative experiments with competing approaches [22-29], which are now just engulfed and referenced in the text with no explicit experimental comparison (§2.2.2. and §4.2. are some examples: «The EMO-IIT results surpass previous experiments with accuracies of around 85%.»). The authors should compare with these systems while providing the results of such comparisons explicitly in the tables; for the cases when competing approaches were not tested with the same dataset as the one in the present paper, the authors should try to retrieve the original system and re-do the experiments on this other setting, thus better-remarking the pros and cons of the proposed approach, if any. Generally speaking, the result section is a bit short for a journal paper, where we might expect to discuss better which part of the benchmarks benefits from the proposed pipeline while comparing the same result achieved by other competitors. This might be more evident by better discussing the features of different datasets while linking how different features affect the competitors' results. By setting the experimental setting as such, the authors will further corroborate and validate the need for the proposed approach. Furthermore, the authors did not elaborate on how the training and the hyperparameter tuning were performed, which is an essential requirement and description of the experimental setup. This is relevant for improving results' reproducibility. Some examples:

* Which is the definition of an iVector, and which was the process required for generate them?

* Which is the exact configuration of "nnet3"?

The paper's clarity might be improved by further elaborating on the authors' use case example. E.g., both §3.2 and §3.6. should be moved to the Introduction and compacted into one cohesive view, thus describing the use case scenario of interest from the paper. Contextually, Figure 4 should be slightly modified, thus explicitly remarking the pipeline's input and output, taking particular care to highlight the different types of classification results provided by the pipeline. The text should also explicitly refer back to Figure 4 to remark on which pipeline output might be exploited by the human 112 operators and for which reason. The authors should also remark which are going to be the benefits of such an automated system.

The paper is very well written and easy to read. Although there is always room for improving writing style, this is not the primary concern of the paper. Because further expanding the experimental section might require extra time, I would advocate accepting the paper after revising the experimental setting.

Minor Comments and Typos:

* If this paper is an invited extended paper, please exploit the \conference macro to specify the original information [5]. Please double-check whether your journal policies require you to quote such a paper on the first page of the present manuscript.

* For better clarity and internationalising the paper, I'd suggest rephrasing 112 with either "National Emergency Response Service" or its associated acronym.

* The pipeline is alternately referred to as ODIN or ODIN112. Please be consistent with your denomination and pick one of the two.

Suggested edits:

* Line 250: interfacesThose --> interfaces. Those

* Line 281/282: "and a recipe that can ingest our training data" --> "and designing a framework ingesting our training data".

Author Response

The authors propose ODIN112, a pipeline combining Automatic Speech Recognition and Speech Emotion Recognition capabilities for assisting workers handling emergency calls. After an extensive and all-embracing description of the literature review, the authors describe the pipeline's architecture for each component.
Response: Thank you kindly for your thorough feedback, which helped us further improve the manuscript! All major edits are marked in blue.

On the other hand, the authors should clarify which system component was tested in the experimental setting, and the same name should also appear in the architecture description (Figure 4 and 5).
Response: We appreciate your valuable insights. We have updated Figures 4 and 5 in order to be consistent and to clarify any ambiguous aspects regarding the components and modules of our solution.

Furthermore, the authors should also clarify whether their proposed experimental setting is just an implementation of already-existing architectures or whether each described component had some innovative elements if compared to state-of-the-art.In the former scenario and the case that the overall experiments are not reflecting the outcome of the decision model, the scientific contribution of the paper becomes less evident. If, on the other hand, the innovation is also in the AI architecture of each single component (as I might expect from the current manuscript setting), the authors should indulge in comparative experiments with competing approaches [22-29], which are now just engulfed and referenced in the text with no explicit experimental comparison (§2.2.2. and §4.2. are some examples: «The EMO-IIT results surpass previous experiments with accuracies of around 85%.»).
Response: We clarified the scope and provided extensive comparisons in the new Discussion section.

The authors should compare with these systems while providing the results of such comparisons explicitly in the tables; for the cases when competing approaches were not tested with the same dataset as the one in the present paper, the authors should try to retrieve the original system and re-do the experiments on this other setting, thus better-remarking the pros and cons of the proposed approach, if any.
Response: We have referenced two of the best ASR models for the Romanian language. The referenced papers used the same RSC-eval dataset for evaluation purposes.

Generally speaking, the result section is a bit short for a journal paper, where we might expect to discuss better which part of the benchmarks benefits from the proposed pipeline while comparing the same result achieved by other competitors. This might be more evident by better discussing the features of different datasets while linking how different features affect the competitors' results. By setting the experimental setting as such, the authors will further corroborate and validate the need for the proposed approach. Furthermore, the authors did not elaborate on how the training and the hyperparameter tuning were performed, which is an essential requirement and description of the experimental setup. This is relevant for improving results' reproducibility. Some examples:
Response: Indeed this was a weak point of the paper - we introduced new Results and the Discussion section.

* Which is the definition of an iVector, and which was the process required for generate them?
Response: We added the corresponding definition in section 3.3 and the motivation behind using it.

* Which is the exact configuration of "nnet3"?
Response: nnet3 is the latest neural network implementation of Kaldi. We added references and clarified parameters for the models.

The paper's clarity might be improved by further elaborating on the authors' use case example. E.g., both §3.2 and §3.6. should be moved to the Introduction and compacted into one cohesive view, thus describing the use case scenario of interest from the paper.
Response: Thank you for your valuable suggestions. We merged the content of Section 3.6 into Section 3.2. We consider that the information presented in Section 3.2 fits better in the Method section because it presents the architectural details of the ODIN112 system and its use case.

Contextually, Figure 4 should be slightly modified, thus explicitly remarking the pipeline's input and output, taking particular care to highlight the different types of classification results provided by the pipeline. The text should also explicitly refer back to Figure 4 to remark on which pipeline output might be exploited by the human 112 operators and for which reason. The authors should also remark which are going to be the benefits of such an automated system.
Response: Thank you for your meaningful observation. We have updated Figure 4. Also, we added a remark on the advantages of our approach.

Minor Comments and Typos:

* To further improve the replicability of the results, the authors should consider providing a link to the codebase or describing the explicit outcome of the training (as the hyperparameters are returned).
Response: Unfortunately, we cannot release the trained models or the code because it belongs to the Romanian Service for Special Telecommunications. We added the hyperparameters for SER training in Table 5.

* For better clarity and internationalising the paper, I'd suggest rephrasing 112 with either "National Emergency Response Service" or its associated acronym.
Response: We replaced “112” and added a paragraph in the introduction that presents 112.

* The pipeline is alternately referred to as ODIN or ODIN112. Please be consistent with your denomination and pick one of the two.
Response: We replaced all occurrences of “ODIN” or “ODIN-112” with “ODIN112”.

Suggested edits:

* Line 250: interfacesThose --> interfaces. Those

* Line 281/282: "and a recipe that can ingest our training data" --> "and designing a framework ingesting our training data".
Response: Thank you for the suggested edits, both issues were resolved, and we reviewed the entire document one more time.

Reviewer 4 Report

Please indicate approximate response time of your system for defined test cases emulating actual situation to validate the claim on rapid response

Author Response

Please indicate approximate response time of your system for defined test cases emulating actual situation to validate the claim on rapid response
Response: Thank you for your review, and we appreciate your meaningful observation. We added a remark on the response time of the system and compared it to automated caption systems for live streaming.

Reviewer 5 Report

Review of MDPI paper: ODIN112 – AI-Assisted Emergency Services in Romania

Point 1:

Section 1: authors mentioned that the data is available to the public at the following link: https://echo.readerbench.com/, however, the website enables only the browsers to

record or listen to their recorded voices. No link to download the collected

dataset. Since the authors mentioned that the dataset is available, it is

mandatory to provide a working downloadable link or join the dataset as

supplement material within the submission process.

Point 2:

Section 1: The introduction must include, at the end, a paragraph detailing the remainder of the paper sections. The reader must know from the beginning what is about to read. Also, it helps the following of the paper's contributions.

Point 3:

Section 2.1: there are a lot of affirmative sentences such as “The optimization process was laborious, and it sometimes reached the local optimum for each component instead of the global optimum.”, it is very important to cite references confirming those affirmations.

Point 4:

Section 2.1.1: The authors depict all Romanian Speech datasets summing up to around 300 hours of recordings. My question: How does the newly released dataset (150 hours) outperforms the existing datasets since all are related to Romanian accents? Both contain too small recordings and unknown sources. I can figure from the recording website (https://echo.readerbench.com/) that the dataset contains records lasting just 1

second and even anonymous persons could record their voices.

Point 5:

Section 2.2.1: This section is hard to follow and readability must be improved. For instance, it is better to talk first about the EmoIIt dataset (from lines 179 to 183) and then about the selected one for your experiments (from lines 176 to 178).

Point 6:

Section 3.1.1: “Previous research has proven it to be very efficient; also, 235 it has the added advantage of not requiring additional data.”, authors must include supporting references.

Point 7:

Section 3.2: The proposed ODIN architecture uses bidirectional TCP streams to communicate audio calls. The authors state they achieve real-time processing. However, TCP protocol (connection-oriented) is not a real-time protocol to use for multimedia content instead, usually, UDP (connectionless protocol) is deployed to stream real-time multimedia content. But, UDP does not guarantee data quality preservation. How do the authors achieve real-time processing using TCP as a communication protocol?

Point 8:

Section 3.2: The Authors adopt Kafka to manage the different entities. However, Raft as an alternative is better in terms of real-time processing and requires less latency time. Any explanation of your choice?

Point 9:

Section 4.1: Table 3 presents the WER of the different experiments. Authors should mention explicitly that the results are about WER since nothing in the table or within describing paragraph mentioned what the values and percentages mean.

Point 10:

Section 4.1: It is clear from the results that the use of the released dataset to train the model does not perform better than the existing one, although they state that they record high-quality voices. Authors must discuss the table result and interpret them. Also, there is no comparison between the proposed model performance and related works since as stated in section 2.1.2 some existing results for the Romanian language speech recognition vary between 3.8% WER and 23.17% WER.

Point 11:

Section 4.2: Authors must interpret, compare and discuss the obtained results.

Author Response

The paper proposes an automatic speech recognition model to transcribe calls automatically to enhance Romanian emergency services. They release the largest high-quality speech dataset of more 7 than 150 hours for Romanian. The dataset is of great importance to help even other sectors to recognize voice calls.
Response: Thank you kindly for your thorough feedback, which helped us further improve the manuscript! All major edits are marked in blue.

Point 1:

Section 1: authors mentioned that the data is available to the public at the following link: https://echo.readerbench.com/, however, the website enables only the browsers to
record or listen to their recorded voices. No link to download the collected
dataset. Since the authors mentioned that the dataset is available, it is
mandatory to provide a working downloadable link or join the dataset as
supplement material within the submission process.
Response: We added a new page that describes how to obtain a download link to the dataset: https://echo.readerbench.com/download

Point 2:
Section 1: The introduction must include, at the end, a paragraph detailing the remainder of the paper sections. The reader must know from the beginning what is about to read. Also, it helps the following of the paper's contributions.
Response: Thank you for the suggested edit! The issue has been resolved.

Point 3:
Section 2.1: there are a lot of affirmative sentences such as “The optimization process was laborious, and it sometimes reached the local optimum for each component instead of the global optimum.”, it is very important to cite references confirming those affirmations.
Response: The advantages of DNNs over HMMs have been discussed in various papers over time, especially in the ones that described DNNs for ASR initially, such as DeepSpeech. We have added a citation, but are considering this more of a general knowledge fact.

Point 4:
Section 2.1.1: The authors depict all Romanian Speech datasets summing up to around 300 hours of recordings. My question: How does the newly released dataset (150 hours) outperforms the existing datasets since all are related to Romanian accents? Both contain too small recordings and unknown sources. I can figure from the recording website (https://echo.readerbench.com/) that the dataset contains records lasting just 1
second and even anonymous persons could record their voices.
Response: Our dataset consistently outperforms the other datasets because it is much more diverse in terms of speakers and transcripts, and as a result the trained model can generalize better. We further expanded this idea in the Discussion section.

Point 5:
Section 2.2.1: This section is hard to follow and readability must be improved. For instance, it is better to talk first about the EmoIIt dataset (from lines 179 to 183) and then about the selected one for your experiments (from lines 176 to 178).
Response: Further details have been introduced, and a new Discussion section was introduced.

Point 6:
Section 3.1.1: “Previous research has proven it to be very efficient; also, 235 it has the added advantage of not requiring additional data.”, authors must include supporting references.
Response: The original SpecAugment paper supports that. We rephrased that paragraph to make this aspect clearer.

Point 7:
Section 3.2: The proposed ODIN architecture uses bidirectional TCP streams to communicate audio calls. The authors state they achieve real-time processing. However, TCP protocol (connection-oriented) is not a real-time protocol to use for multimedia content instead, usually, UDP (connectionless protocol) is deployed to stream real-time multimedia content. But, UDP does not guarantee data quality preservation. How do the authors achieve real-time processing using TCP as a communication protocol?
Response: Thank you for your insightful observation. We have added a paragraph that states our goal to process online audio streams. Our solution is different from real-time communication (such as VoIP), because we use TCP sockets to continuously feed the processing modules with new parts of the audio call.

Point 8:
Section 3.2: The Authors adopt Kafka to manage the different entities. However, Raft as an alternative is better in terms of real-time processing and requires less latency time. Any explanation of your choice?
Response: Thank you for your valuable observation. We added our motivation for choosing Kafka. We base our argumentation on the fault tolerance properties of Kafka framework.

Point 9:
Section 4.1: Table 3 presents the WER of the different experiments. Authors should mention explicitly that the results are about WER since nothing in the table or within describing paragraph mentioned what the values and percentages mean.
Response: We added a label to each column and changed the table’s caption to reflect that too.

Point 10:
Section 4.1: It is clear from the results that the use of the released dataset to train the model does not perform better than the existing one, although they state that they record high-quality voices. Authors must discuss the table results and interpret them. Also, there is no comparison between the proposed model performance and related works since as stated in section 2.1.2 some existing results for the Romanian language speech recognition vary between 3.8% WER and 23.17% WER.
Response: We added a Discussion section that covers this point in detail.

Point 11:
Section 4.2: Authors must interpret, compare and discuss the obtained results.
Response: We have added a new Discussion section where we analyze and compare results with other similar systems.

Round 2

Reviewer 2 Report

The authors have fulfilled my comments. The overall quality of the manuscript has been significantly improved.

One note: caption of Table 5, "finetunning" -> "finetuning"

Reviewer 5 Report

The authors improved the manuscript and responded to the requested changes. Still, I am not convinced that their created dataset is better in terms of diversity than existing datasets since according to table 3 and the description of the dataset, almost all the participants are university students and aged between 20 to 24 years. So, the dataset does not contain sufficient records for old persons who mostly need to call 112 emergencies. That's why results show a consistent accuracy between 3 and 5% since almost all records correspond to young people.

Generally, the dataset is original and provides valuable outcomes to other sectors. Also, it will help alongside other dataset to model better the Romanian accents.

Article Menu

ODIN112–AI-Assisted Emergency Services in Romania

Further Information

Guidelines

MDPI Initiatives

Follow MDPI