Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Efficient and Robust Arabic Automotive Speech Command Recognition System

Algorithms 2024, 17(9), 385; https://doi.org/10.3390/a17090385

by Soufiyan Ouali^*

and Said El Garouani

Reviewer 1:

Ching-Ta Lu

Reviewer 2: Anonymous

Algorithms 2024, 17(9), 385; https://doi.org/10.3390/a17090385

Submission received: 30 June 2024 / Revised: 17 August 2024 / Accepted: 21 August 2024 / Published: 2 September 2024

(This article belongs to the Special Issue Artificial Intelligence and Signal Processing: Circuits and Systems)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Contributions:

This paper presents a Moroccan Arabic automotive speech recognition system. My comments are as follows:

(Line 69 on page 2) The authors state the main contribution 1:

This paper only presents an application in command recognition. This is not a challenging task.
The application and method are classical and not novel.
This study can be regarded as a project report.
(Line 69 on page 2) The authors state that we constructed the first automotive speech recognition system in the Arabic language in the Arab world. It’s not fair. As I know, some studies have conducted speech recognition for the Arabic language, for example:

Salima Harrat, Karima Meftouh, Kamel Smaili, “Machine translation for Arabic dialects (survey),”, Information Processing & Management, Vol. 56, No. 2, pp. 262-273, 2019,

(Page 4) The recognition topic is collecting significant commands suitable for car driving. Only 20 commands are recognized. This is not a challenging research task.
(Page 10) The authors only used well-known machine learning and deep learning models. However, the transformer is a novel model. The performance should be compared.
(Page 11) The statement for BiLSTM-CNN is too brief.
(Page 11) The input feature to the CNN layer is not two-dimension.
(Page 11) The output has 20 units. It is not adequate. If the input utterance is not included in the 20 commands, it will be one of the specified commands. The result is wrong.
(Line 105 on page 3) “… employed the Kalman filter introduced by K. Paliwal and A. Basu [12].” should be revised as “… employed the Kalman filter introduced in [12].”. Please check the usage throughout the paper.
(Page 4) The title and sub-grid lines should be removed in Fig. 2.
(Page 5) Table 1 should be moved to an appendix.
(Page 5) Sub-section 3.1.2-3 should be moved to the experiments.
(Page 7) The title should be removed and moved to the caption in Fig. 4. Please check all figures.
(Page 8) The method for speech feature extraction is well known. Sub-sections 3.2 should be should be shortened. Figures 6 and 7 can be removed.
(Line 487 on page 15) “Whit” should be “with”.
Many state-of-the-art methods are missed in the reference.

Comments on the Quality of English Language

The quality of English language should be further improved.

Author Response

Comments 1,2, and 3:

This paper only presents an application in command recognition. This is not a challenging task.
The application and method are classical and not novel.
This study can be regarded as a project report.

Responses 1,2, and 3:

Thank you for your insightful comments and valuable thoughts on our manuscript. We agree with your observation and have accordingly deleted the word "novel" from many phrases throughout the text. For example, in page 2 line 73 the phrase "We developed a novel hybrid deep learning model that" has become "We developed a hybrid deep learning model that". However, we would like to emphasize that the contribution of our article lies in the development of a novel Moroccan Arabic dataset specifically for the automotive field. Additionally, we have conducted various experiments to build an efficient model capable of recognizing drivers' commands in both clean and noisy settings, thereby reducing distractions and improving road safety.

Comments 4:

(Line 69 on page 2) The authors state that we constructed the first automotive speech recognition system in the Arabic language in the Arab world. It’s not fair. As I know, some studies have conducted speech recognition for the Arabic language, for example:

Salima Harrat, Karima Meftouh, Kamel Smaili, “Machine translation for Arabic dialects (survey),”, Information Processing & Management, Vol. 56, No. 2, pp. 262-273, 2019.

Response 4:

Thank you for raising this point. We agree with the comment that Automatic Speech Recognition (ASR) in Arabic is not a new research field, and there have been many advancements in the state-of-the-art (SOTA). However, before writing this article, we conducted a thorough investigation to identify the main gaps in the Arabic ASR field.

Our research revealed that while many Arabic ASR models exist in the literature (as stated on page 2 line 83 to line 87 ), they are each oriented towards specific domains ( page 2, section 2 Literature Review ). We discovered that none have been tailored specifically for smart cars using speech recognition technology, particularly for dialectical Arabic. Considering the importance of this technology in enhancing driving safety and efficiency, we decided to develop an effective Moroccan Arabic speech recognition system specifically for the automobile field.

This is why we have stated that our work represents the first speech recognition system in the automotive domain, specifically focusing on Moroccan Arabic.

Comment 5:

(Page 4) The recognition topic is collecting significant commands suitable for car driving. Only 20 commands are recognized. This is not a challenging research task.

Response 5:

Thank you for this comment. We agree that building a SR model with 20 output commands may not be considered a particularly challenging task in isolation. However, the decision to focus on 20 specific commands was not arbitrary. It was guided by the results of a survey we conducted, which aimed to identify the most essential commands required by drivers for an effective in-car speech recognition system.

Our goal was to develop a model that fulfills only the key needs of drivers, ensuring practicality and usability in real-world scenarios. The details of this survey, including the methodology and the rationale behind the selection of these commands, can be found in Section 3.1.1, titled "Command Choice."

Comments 6:

(Page 10) The authors only used well-known machine learning and deep learning models. However, the transformer is a novel model. The performance should be compared

Response 6:

Thank you for your insightful comment. We acknowledge that our study primarily utilized well-known machine learning and deep learning models. While the Transformer model is indeed novel and has shown promising results in various fields, we did not include it in our current study.

However, we recognize the importance of comparing our approach with the Transformer model. As mentioned on page 17, line 549, this is an area we plan to explore in our future work. In these upcoming studies, we will evaluate the performance of the Transformer model alongside the models we have already tested, allowing for a more comprehensive analysis of the different methodologies.

Comments 7:

(Page 11) The statement for BiLSTM-CNN is too brief.

Response 7:

Thank you for pointing this out. We have corrected the issue, and the revisions can be found highlighted in yellow in the article (page 11, lines 373 to 377).

Comments 8:

(Page 11) The input feature to the CNN layer is not two-dimension.

Response 8:

Thank you for your observation. In our study, we utilized 1D CNNs to process features extracted from audio data, using techniques like MFCCs.

The choice of using 1D CNNs was based on the nature of the features and the task at hand. MFCCs, for instance, represent a sequence of features over time, and this sequential data is naturally suited for 1D convolutional processing. By feeding these 1D sequences into our CNN, we were able to capture temporal patterns and dependencies effectively.

In our approach, the MFCCs are organized as 1D arrays where each element represents a feature vector at a specific time frame. This allows the 1D CNN to focus on the temporal relationships between these features, which is crucial for tasks like speech recognition where understanding how features evolve over time is important.

Using 2D CNNs in this context would have been less appropriate since our input data does not inherently represent a 2D structure like a spectrogram. The 1D approach allows us to efficiently process the sequential nature of the features and learn meaningful patterns relevant to the speech recognition task.

We hope this clarifies our choice of using 1D CNNs and demonstrates its suitability for our specific feature representation and application.

We apologize for not including this information earlier. We have now added details about the 1D nature of the CNN model in the revised version of the manuscript. You can find this information highlighted in yellow on page 11, lines 360 to 366.

Comments 9:

(Page 11) The output has 20 units. It is not adequate. If the input utterance is not included in the 20 commands, it will be one of the specified commands. The result is wrong.

Response 9:

Thank you for your comment. We understand the concern regarding the output layer having 20 units, which corresponds to the number of predefined commands in our model.

In our design, the model is trained to recognize and classify utterances into one of these 20 specific commands. If an input utterance does not correspond to any of the predefined commands, the model may still assign it to one of the existing categories. This limitation is due to the fixed number of command classes that the model was trained on.

To address this issue in future work, we plan to explore the following approaches:

Out-of-Vocabulary (OOV) Handling: Implement mechanisms to detect and handle utterances that fall outside the predefined set of commands. This may involve incorporating an "unknown" or "OOV" class or using techniques such as confidence thresholds to identify when an utterance is not confidently mapped to any of the 20 commands.
Extended Command Set: Expand the number of commands to cover a broader range of possible utterances, which could improve the model's ability to handle variations in input.
Alternative Models: Investigate other models or architectures that might better accommodate dynamic or unforeseen inputs.

We appreciate your feedback and will consider these strategies in our ongoing research to improve the robustness and accuracy of our speech recognition system.

Comments 10:

(Line 105 on page 3) “… employed the Kalman filter introduced by K. Paliwal and A. Basu [12].” should be revised as “… employed the Kalman filter introduced in [12].”. Please check the usage throughout the paper.

Response 10:

Thank you for raising this point. We have corrected the mistake. The revised information can be found highlighted in yellow on page 3, line 105 of the article.

Comments 11:

(Page 4) The title and sub-grid lines should be removed in Fig. 2.

Response 11:

Thank you for raising this point. We have corrected this.

Comments 12:

(Page 5) Table 1 should be moved to an appendix.

Response 12:

Thank you for raising this point. We have corrected this.

Comments 13:

(Page 5) Sub-section 3.1.2-3 should be moved to the experiments.

Response 13:

Thank you for your comment. We have made the necessary correction by moving Subsections 3.1.2 to to the Experiments section. You can now find the revised Subsection 3.1.2 in Subsection 4.1.

Comments 14:

(Page 7) The title should be removed and moved to the caption in Fig. 4. Please check all figures.

Response 14:

Thank you for raising this point. We have made the correction and moved the titles to the captions for all figures, including Figure 4.

Comments 15:

(Page 8) The method for speech feature extraction is well known. Sub-sections 3.2 should be shortened. Figures 6 and 7 can be removed.

Response 15:

Thank you for your comment. We have made corrections to Sub-sections 3.2 to eliminate unnecessary or supplementary information. Additionally, we have removed Figure 6 as suggested. However, we retained Figure 7 because it helps readers understand the process of how WMFCCs work more easily. If you still find it unnecessary, please let us know, and we will remove it.

Comments 16:

(Line 487 on page 15) “Whit” should be “with”.

Response 16:

Thank you for raising this point. We have corrected this.

Comments 17:

Many state-of-the-art methods are missed in the reference.

Response 17:

Thank you for your feedback. We have made every effort to include all important references and have thoroughly investigated all Arabic command recognition systems to ensure that our work is comprehensive. In our future work, we plan to build a model using these state-of-the-art methods and conduct a detailed comparison.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper examines automotive speech recognition for Arabic. It created a speech dataset of 20 commonly used car commands. It consists of 5600 instances collected from Moroccan contributors in clean and noisy environments.

Standard features (MFCC, Weighted MFCC, and Spectral Subband Centroids (SSC)) are used for feature extraction. A hybrid architecture of BiLSTM and CNNs is used. Most of the paper uses common ASR methods; so there is very little novel, except for the automotive application for Arabic.

A major motivation for this work is a supposed lack of Arabic ASR, but

as it is, there have been many Arabic ASR papers over the last 20 years, e.g., in ICASSP-2003 (Kirchhoff et al Novel approaches to Arabic speech recognition).

The listed contributions are: 1) automotive speech recognition system in Arabic (however, there is little distinctive about automotive ASR vs. Regular ASR), 2) a novel hybrid deep learning model (however, the proposed combination of BiLSTM-CNN is typical of much research).

The literature review is a sequence of 5 short paragraphs, each describing a different system with widely diverging performance. There is no summary or comparison to help readers understand.

There is much discussion about regional dialects in section 3.1.2, which is not relevant to ASR design, even though it is of some linguistic interest.

A large number of the citations are not from the ASR or signal processing mainstream, but instead from minor sources. In addition, most of the chosen sources (both conferences and journals) are not focused on speech applications. The best speech articles are instead found in sources that use knowledgeable reviewers for the area. Several references have no listed source at all, e.,g., refs. 8, 13. Ref. 41 deals with a very common area now (Speech Recognition using Convolution Deep Neural Networks), and yet chooses a journal I have never heard of. The same journal is cited twice (refs. 25-26) for a very broad area of speech processing.

Given the limited novelty of this work and many other flaws noted, I cannot recommend it.

Specific points:

..Nowadays, Cars are equipped .. ->

..Nowadays, cars are equipped ..

..Speech recognition (SR), a branch of .. ->

..Automatic speech recognition (ASR), a branch of ..

..commands without being distracted, .. - while allowing eyes on the road (there is still distraction in thinking and speaking commands)

..research in this field for the different dialects of the Arabic language .. - the text appears to state a lack of Arabic ASR wok, but cites nothing

.. dataset .. that is representative, robust, .. - how can a dataset be robust? A system can be robust (e,.g., ASR), not a dataset

..systems existed for a long time. ->

..systems have existed for a long time.

..miss recorded audio. ->

..mis-recorded audio.

..been Equipped with more .. ->

..been equipped with more ..

.. voice tone and pith which .. ->

.. voice tone and pitch, which ..

..which simulates the status of a driver’s voice, .. - how does such a change do this?

..models are trained by feeding them numerical data [22]. - very trite and obvious; also, you likely mean “digital,” as numerical includes irrational and real values

..it is initial to convert.. =>

..one must first convert..

..Many techniques are proposed for this task such as analog-to-digital conversion (ADC) and Pulse Code Modulation (PCM) [23] which uses sampling, quantization, and encoding [24] - this is standard, and should not be discussed

The references often misuse capital letters; e.g., both Kalman and kalman; refs. 14 and 30 have all capital letters.

..ultimate question posed by researchers interested in SR is: What features or characteristics distinguish a sound .. - poorly phrased; this is a major objective, but not an “ultimate question.”

..series of preprocessing is conducted .. - no, most ASR does not do this

..a thorough investigation was conducted. Therefore, we used MFCC, WMFCC, and SSC. - who did this investigation? Using what criteria? You are also using acronyms before defining them.

..40 feature coefficients of MFCC, .. - this suggests a serious misunderstanding by the authors; no one uses 40 MFCCs; 13 is standard. If one instead uses mel-cepstra (not MFCCs), then 40 and more is common.

Comments on the Quality of English Language

Author Response

Comment 1:

Response 1:

Thank you for your feedback and comments. We agree that Arabic speech recognition is not a new field, and there is extensive research in the literature. However, the novelty of our contribution lies in developing a new dialectal Arabic dataset and building an effective speech recognition system for the automotive field. This system is capable of recognizing voice commands in both noisy and clean environments with high accuracy.

Comment 2:

A major motivation for this work is a supposed lack of Arabic ASR, but as it is, there have been many Arabic ASR papers over the last 20 years, e.g., in ICASSP-2003 (Kirchhoff et al Novel approaches to Arabic speech recognition).

Response 2:

Thank you for your comment. We agree with your observation. However, a study conducted by Amira Dhouib et al. (2022), titled "Arabic Automatic Speech Recognition: A Systematic Literature Review" (DOI: https://doi.org/10.3390/app12178898), reveals that out of 37 research papers in the Arabic SR field, only 10 were focused on dialectal Arabic. Given that the majority of Arabic speakers do not actually speak in Modern Standard Arabic (MSA)—with only a few countries like Saudi Arabia predominantly using MSA—the majority communicate in their local dialects. More information on this can be found in the Introduction and Subsection 4.1. Therefore, despite the respectable efforts in the field, there remains a significant gap in Arabic speech recognition, particularly for dialectal Arabic.

Comment 3:

Response 3:

Comment 4:

The literature review is a sequence of 5 short paragraphs, each describing a different system with widely diverging performance. There is no summary or comparison to help readers understand.

Response 4:

Thank you for bringing this to our attention. We have made the correction, and the revision can be found on page 3, line 115.

Comment 5:

There is much discussion about regional dialects in section 3.1.2, which is not relevant to ASR design, even though it is of some linguistic interest.

Response 5:

This section is essential for readers interested in understanding the linguistic complexities involved in building a representative dataset that meets the needs of all users, especially those who may wish to pursue further research in this area. It provides a thorough insight into the methodology behind the dataset construction, emphasizing the challenges of capturing the diverse dialects spoken by Arabic speakers. This detailed exploration not only highlights the significance of developing a dataset that can effectively cater to the linguistic diversity of Arabic but also offers a comprehensive understanding of the strategies employed to address these challenges.

Comment 6:

A large number of the citations are not from the ASR or signal processing mainstream.

Response 6:

Thank you for your comment. The reason why a significant number of citations are not from the ASR or signal processing mainstream is that our paper covers a broad range of topics, including challenges in Natural Language Processing (NLP) in Arabic, linguistic complexities, and statistics about Arabic speakers. These areas are crucial for providing context and understanding the broader challenges involved in developing effective speech recognition systems for Arabic.

Comment 7:

Several references have no listed source at all, e.,g., refs. 8, 13.

Response 7:

Thank you for bringing this to our attention. We have made the necessary correction, and the revision can be found in the article, highlighted in yellow.

Comment 8:

Nowadays, Cars are equipped .. ->

Response 7:

Thank you for bringing this to our attention. We have made the necessary correction.

Comment 8:

Speech recognition (SR), a branch of .. ->..Automatic speech recognition (ASR), a branch of .

Response 8:

Thank you for bringing this to our attention. We have made the necessary correction.

Comment 9:

..commands without being distracted, .. - while allowing eyes on the road (there is still distraction in thinking and speaking commands)

Response 9:

Yes, we agree with your comment. However, the distraction time is reduced with our approach. Alternative options typically require drivers to use their eyes and hands, which increases overall distraction.

Comments 10:

..research in this field for the different dialects of the Arabic language .. - the text appears to state a lack of Arabic ASR wok, but cites nothing

Response 10 :

Thank you for bringing this to our attention. We have made the necessary correction. we have a cited an article that proves our stat it is in the reference [44].

Comment 11:

.. dataset .. that is representative, robust, .. - how can a dataset be robust? A system can be robust (e,.g., ASR), not a dataset

Response 11:

Thank you for raising this issue. We have revised the text and agree with your comment: while a system can be robust, a dataset cannot be inherently so. We have corrected the text accordingly and now refer to it as "a representative dataset." If you have any further comments, please let us know.

Comment 12:

..systems existed for a long time. -> ..systems have existed for a long time.

..miss recorded audio. -> ..mis-recorded audio.

..been Equipped with more .. -> ..been equipped with more ..

.. voice tone and pith which .. -> .. voice tone and pitch, which ..

..it is initial to convert.. => ..one must first convert..

Response 12:

Thank you for bringing this to our attention. We have made the necessary corrections.

Comment 13:

..which simulates the status of a driver’s voice, .. - how does such a change do this?

Response 13:

Here’s a refined and expanded version of your explanation:

Drivers' voices are often impacted by external noise and their physical state, such as fatigue. To ensure our dataset accurately captures these variations, we employed two different techniques. First, to account for noise, we collected half of the dataset in a noisy environment. Second, to simulate the effects of fatigue and boredom, we asked contributors to record the same audio command 10 times. This repetition led to significant changes in voice characteristics, as illustrated in Figure 7 of the revised article. More details about the dataset collection process can be found in Sub-section 4.1 and Sub-section 3.1.

Comment 14:

Response 14:

Thank you for raising this issue. We have revised the text and agree with your comment, we have deleted this paragraph.

Comment 15:

The references often misuse capital letters; e.g., both Kalman and kalman; refs. 14 and 30 have all capital letters.

Response 15:

Thank you for your insightful comment and valuable thoughts on the subject. We have made the necessary corrections.

Comment 16:

..a thorough investigation was conducted. Therefore, we used MFCC, WMFCC, and SSC. - who did this investigation? Using what criteria? You are also using acronyms before defining them.

Response 16:

Thank you for bringing this to our attention. We have made the necessary corrections. The revised sentence can be found on page 5, line 210.

Comment 17:

..40 feature coefficients of MFCC,

Response 17:

Thank you for your observation. We understand that 13-dimensional MFCCs are standard; however, our comparison as presented in table 1 and 2 between 13-dimensional and 40-dimensional MFCCs showed that the latter significantly enhanced recognition accuracy across all models. For instance, accuracy for CNN improved from 71.44% to 83.33%, and for BiLSTM-CNN, from 73.55% to 81.37%. These improvements demonstrate the effectiveness of using 40-dimensional MFCCs, which is why we opted for this approach.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

In my previous comment 9: The output has 20 units. It is not adequate. If the input utterance is not included in the 20 commands, it will be one of the specified commands. The result is wrong. The authors knew it is inappropriate. However, the authors did nothing. It is inappropriate.

In my previous comments 1-3: 1. This paper only presents an application in command recognition. This is not a challenging task. 2. The application and method are classical and not novel. 3. This study can be regarded as a project report. The authors understood it. However, the authors only removed the word “novel” from the paper. It is not adequate. The authors should address the novelty of this study. I still think this paper lacks novelty.

In my previous comments 6: The authors only used well-known machine learning and deep learning models. However, the transformer is a novel model. The performance should be compared. However, the authors did not conduct experiments for performance comparison.

Comments on the Quality of English Language

The quality of English language is acceptable.

Author Response

Comments 1:

The output has 20 units. It is not adequate. If the input utterance is not included in the 20 commands, it will be one of the specified commands. The result is wrong.

Responses 1:

Thank you for your insightful comments and valuable thoughts on our manuscript. We agree with your observation and have accordingly handled these issues. The answer can be found on page 9, section 3.4, line 382 to line 404.

Comments 2:

This paper only presents an application in command recognition. This is not a challenging task. 2. The application and method are classical and not novel. 3. This study can be regarded as a project report. The authors understood it. However, the authors only removed the word “novel” from the paper. It is not adequate. The authors should address the novelty of this study. I still think this paper lacks novelty.

Response 2:

Thank you for your comment. While we acknowledge that constructing speech recognition models using traditional methods may not be particularly challenging or novel, the innovation in our work lies in two key areas. Firstly, we have developed a new Moroccan Arabic speech dataset. Secondly, we have built an effective and robust model specifically for the automotive field, achieving notable results in both clean and noisy conditions. In this latest revised version, we have also compared the performance of our proposed architecture with state-of-the-art SR methods, Wav2vec2 Transformer. Our proposed architecture not only outperformed existing research in Arabic command recognition systems but also surpassed the performance of current state-of-the-art SR methods.

Comment 3:

The authors only used well-known machine learning and deep learning models. However, the transformer is a novel model. The performance should be compared.

Response 3:

Thank you for your insightful comment. We agree with your observation that Transformers represent the current state of the art (SOTA) in speech recognition. Consequently, we have conducted a comparison between our proposed model (BiLSTM-SNN) and the Wav2Vec2 Transformer model. Detailed information on this comparison can be found in the revisions highlighted in yellow in the article, specifically in section 3.3.3, section 3.3.3.1, and on page 17 from lines 605 to 629. The summarized comparison results are also presented in Figure 10.

Thank you once again for your time and effort. Please feel free to share any further comments or suggestions that could help us further enhance our manuscript. We value your feedback and we are ready to make any necessary improvements.

Reviewer 2 Report

Comments and Suggestions for Authors

Authors have responded well to my earlier comments.

Author Response

Comment 1:

Authors have responded well to my earlier comments.

Response 1:

Thank you once again for your time and effort in reviewing our work. We greatly appreciate your feedback. If you have any additional comments or suggestions that could help us further improve our manuscript, we would be grateful to hear them.

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have improved the quality of this paper. I think it can be accepted for publication.

Comments on the Quality of English Language

The quality of English language is acceptable.

Article Menu

Efficient and Robust Arabic Automotive Speech Command Recognition System

Further Information

Guidelines

MDPI Initiatives

Follow MDPI