Efficient and Robust Arabic Automotive Speech Command Recognition System

Ouali, Soufiyan; El Garouani, Said

doi:10.3390/a17090385

Open AccessArticle

Efficient and Robust Arabic Automotive Speech Command Recognition System

by

Soufiyan Ouali

^*

and

Said El Garouani

Department of Computer Science, Faculty of Science, Université Sidi Mohamed Ben Abdellah, Fez 30000, Morocco

^*

Author to whom correspondence should be addressed.

Algorithms 2024, 17(9), 385; https://doi.org/10.3390/a17090385

Submission received: 30 June 2024 / Revised: 17 August 2024 / Accepted: 21 August 2024 / Published: 2 September 2024

(This article belongs to the Special Issue Artificial Intelligence and Signal Processing: Circuits and Systems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The automotive speech recognition field has become an active research topic as it enables drivers to activate various in-car functionalities without being distracted. However, research in Arabic remains nascent compared to English, French, and German. Therefore, this paper presents a Moroccan Arabic automotive speech recognition system. Our system aims to enhance the driving experience to make it comfortable and safe while assisting individuals with disabilities. We created a speech dataset comprising 20 commonly used car commands. It consists of 5600 instances collected from Moroccan contributors and recorded in clean and noisy environments to increase its representativity. We used MFCC, weighted MFCC, and Spectral Subband Centroids (SSC) for feature extraction, as they demonstrated promising results in noisy settings. For classifier construction, we proposed a hybrid architecture, consisting of Bidirectional Long Short-Term Memory (Bi-LSTM) and the Convolutional Neural Network (CNN). Training our proposed model with WMFCC and SSC features achieved an accuracy of 98.48%, outperforming all baseline models we trained and outperforming the existing solutions in the state-of-the-art literature. Moreover, it shows promising results in a clean and noisy environment and maintains resilience to additive Gaussian noise while using few computational resources.

Keywords:

voice command control; automative speech recognition; Arabic command recognition; smart cars

1. Introduction

Nowadays, cars are equipped with various modalities such as air conditioning, as well as navigation systems that are supposed to make the driving experience more comfortable, easier, and safer. However, to activate these modalities, human interaction is required, mainly by clicking on physical buttons. Researchers have found that this action increases the probability of drivers being distracted from the road. Moreover, the risk increases based on the number and position of buttons in the car [1]. According to research in [2], the probability of a car crash/near-crash increases by 70% when the driver’s eyes are off the road for more than 2 s. Similarly, researchers in reference [3] found a significant increase in accident risk for hand-off wheels while driving.

Touchscreens were proposed as an alternative to buttons. However, a study by Fredrik Diits Vikström (2022) [4] confirms that touchscreens are more distracting than buttons as they increase the eye-off-road time.

Automatic speech recognition (ASR), a branch of artificial intelligence, focuses on developing technologies capable of recognizing spoken language and converting it into written text or executing specific actions based on verbal commands. Considering the advancements in this field, many researchers have suggested developing ASR systems for automobiles. This technology enables drivers to activate various in-car functionalities through voice commands without being distracted, providing a safe and comfortable driving experience. Additionally, it facilitates easy driving for individuals with disabilities. Accordingly, almost all new cars are equipped with a voice command control option. This advancement was primarily in English, French, and German languages. However, research in this field for the Arabic language’s different dialects, including the classical one, is still in its early stages [5].

The Arabic language is the fifth-most widely spoken language globally, with 453 million speakers, and it is considered the official language of 22 countries [6]. It is classified into three types: Classic Arabic (CA), which is the language of the Quran, Modern Standard Arabic (MSA), which is used in professional areas, and Dialectal Arabic, a non-professional form used in daily life communication. These dialects depend on regional geography, and they differ significantly from one another in vocabulary and pronunciation. Commonly, there are five Arabic dialects: Levantine, Iraqi, Gulf, Egyptian, and Maghrebi Arabic.

Considering the impact of the automotive speech recognition field in people’s lives, in this paper, we present our efforts to construct an Arabic Speech Command Recognition system for automobiles based on deep-learning architectures. To achieve this, we have focused on building a Maghrebi dialect dataset, more specifically the Moroccan dialect, and an end-to-end hybrid deep-learning architecture utilizing a combination of Convolutional Neural Network (CNN) and Bidirectional Long Short-Term Memory (BI-LSTM) model.

To the best of our knowledge, this is the first project dedicated to automotive speech recognition of the Arabic language in the Arab world. Therefore, the main contributions of this paper are summarized as follows:

We constructed the first automotive speech recognition system in the Arabic language in the Arab world.
We built the first Moroccan Arabic dataset in the automotive speech recognition field that is representative and simulates real-time conditions.
We developed a hybrid deep-learning model that combines Bidirectional Long Short-Term Memory architecture and a Convolutional Neural Network into one unified architecture Bi-LSTM-CNN for driver command reorganization in the Arabic language.
We developed a model that outperformed the proposed method in the state-of-the-art literature.

This paper is organized as follows. Section 2 presents related works. In Section 3, the methodology of building the dataset and the model is described. Section 4 presents and discusses the results of the experiments. Finally, the conclusion of the paper and the scope of future work are highlighted in Section 5.

2. Literature Review

Speech recognition systems have existed for a long time. They have been applied across various sectors and proven to be highly functional, contributing to the development of robust systems in many industries [7]. These significant advancements have primarily been in English. However, recently there has been a growing interest directed towards the Arabic language.

Ghandoura et al. [8] constructed an Arabic dataset comprising 40 commands derived from the Google Speech Command (GSC) dataset [9] The constructed dataset is not tailored to a specific domain; rather, it comprises general commands. To benchmark the dataset, authors employed MFCC for feature extraction along with CNN for classification, by which they achieved an accuracy of 97.97%.

Ibrahim and Saad (2022) [10] developed an Egyptian Arabic dialect SR system to control a mobile assistant robot by recognizing five commands (stop, left, right, forward, backward). They compared various feature-extraction techniques and machine-learning (ML) models. The best result was achieved by training the SVM classifier with a combination of MFCC, spectral centroid, and signal power coefficient, achieving an accuracy of 95.48%.

Hamza et al. (2009) [11], developed an Arabic system that detects spotted words in a continuous speech to control a manipulator arm (i.e., robot arm TR45). To reduce the additive noise, authors employed the Kalman filter introduced in [12]. For feature extraction, MFCC, MFCC delta, MFCC double delta, and energy coefficients were utilized, and the Hidden Markov Model was used for command classification. The authors improved the recognition ratio by employing high-quality microphones and combining HMM with Gaussian mixture models [13].

Abed and Jasim (2016) [14] constructed an Arabic SR system to control the motion of an intelligent automated mobile robot via voice commands by recognizing five commands (stop, left, right, forward, backward). Features are extracted using MFCC, and the Dynamic Time Warping technique was used to classify the commands. As a result, they achieved an accuracy rate of 87%.

While there have been commendable contributions to the field of Arabic speech recognition, there remains a noticeable gap, particularly due to the linguistic diversity within the Arabic language. The development of robust speech recognition systems is challenging because Arabic is not a monolithic language; it consists of numerous dialects that vary significantly across regions. Consequently, more research is necessary to bridge this gap and create systems that can effectively cater to the wide range of dialects spoken by Arabic speakers.

Furthermore, much of the existing research has been predominantly focused on Modern Standard Arabic (MSA). Although MSA is the formal version of the language, it is not commonly spoken in daily life. The reality is that the vast majority of Arabic speakers communicate in their regional dialects, which differ considerably from MSA. This creates a discrepancy between the speech recognition models being developed and the actual linguistic needs of the users.

Even within the research that addresses dialectal Arabic, many systems are limited to recognizing basic commands, such as “left”, “right”, “start”, and “stop”. While these efforts are respectable, they do not fully address the complexities and nuances of dialectal Arabic in more specialized applications.

Our work sets itself apart by focusing on the creation of a comprehensive dialectal Arabic speech command recognition dataset and system specifically tailored for the automotive field. This system is designed to handle the intricacies of dialectal Arabic, providing more accurate and reliable speech recognition in real-world, noisy environments where drivers may issue commands in their native dialects rather than MSA. This approach not only advances the field but also addresses a critical need to make speech recognition technology more accessible and functional for Arabic speakers across different regions.

3. Methodology

The proposed methodology, as shown in Figure 1, is divided into three main steps. We initialed with the construction of the Moroccan Arabic speech Command Dataset (MASCD), which involved gathering audio samples from a diverse group of participants. Subsequently, we undertook the pre-processing of the collected data e.g., excluding any misrecorded audio. Following this, the feature-extraction process was initiated. Finally, utilizing the constructed dataset, we trained and tested the performances of machine-learning and deep-learning models. In the following section, we will provide a detailed description of each step in our methodology.

3.1. Dataset Building

3.1.1. Command Choice

In order to collect only significant commands suitable for car driving, we surveyed 12 participants (eight male, four female) with at least 6 years of driving experience, posing the following questions:

Q1: Do you think a voice-controlled option will improve the driving experience?
Q2: If you would use voice control in your car, what commands feel safe to use?

Seven participants (n = 7) agreed that this would make the driving experience easier by offering an alternative to button options that could cause distraction while driving. Contrastingly, five participants (n = 5) opposed the idea, expressing concerns about potential accidents if primary functions like controlling lights or adjusting speed were misunderstood. However, they acknowledged that voice commands for secondary functions could enhance the driving experience. In addition, the 12 surveyed participants were tasked with selecting from a comprehensive list of 14 commands the ones they deemed safe for use. This inclusive list encompassed essential primary functions, including speed and brake control, as well as auxiliary functions such as utilizing voice commands for navigation system control or adjusting the media system to alter radio stations and adjust sound volume levels. As illustrated in Figure 2, the acceptance rate of commands can be classified into three distinct categories:

High Acceptance Rate: Navigation, Phone, and Media commands were highly accepted, receiving 100% acceptance rates, while climate control commands had an 83.33% acceptance rate and interior lights received a slightly higher rate of 91.67%. Additionally, the acceptance rate for window commands was 75%.
Low Acceptance Rate: Commands concerning seat positions, exterior lights, mirror adjustments, and gear shifting received low acceptance rates.
Zero Acceptance Rate: Speed, Brakes, and Clutch controlling commands received 0% acceptance rates.

Based on the findings shown in Figure 2, the highly accepted commands predominantly comprised secondary functions aimed at activating various in-car features unrelated to driving maneuvers. These commands typically enhance driver comfort and luxury, adding value to the driving experience. Commands with low and zero acceptance rates are designed to execute actions that may interrupt the driving process. This outcome is anticipated as such. These actions are critical for safe driving and require the driver’s full attention and control. Thus, they should not be controlled through external commands.

Combining the results of our survey with well-established commands used in cars, such as in [15,16,17], we have selected 20 commands designed to perform simple tasks making driving comfortable and easier without compromising the driver’s safety. Table A1 lists the 20 commands included in our paper with their translation into English and pronunciation in the international phonetic alphabets.

3.1.2. Final Dataset Description

The MASCD dataset was created to develop a robust SR system capable of recog-nizing in-car voice commands in the Moroccan Arabic language. The final dataset consists of 20 classes and 5600 labeled audios. Each audio file is around 2 s in length, containing a single command, sampled at 16 kHz, 16 bit per sample, and in mono channel. Each command was recorded 10 times by 28 contributors. Consequently, we have a total of 280 audio files for each command, resulting in 5600 audio files in total (28 contributors × 10 repetitions × 20 commands = 5600). The entire process of data collection, processing, labeling, and organizing took us ~54 days. The dataset size is ~340 MB. The MASCD dataset is publicly available online for research purposes [18].

3.2. Speech Feature Extraction

The main idea of feature extraction is that after the conversion of an Analog signal (sound wave) to a digital signal, a series of preprocessing is conducted to extract relevant and distinguishable characteristics from the speech signal. Many feature-extraction techniques are proposed by researchers, including Linear Predictive Coding, MFCC, and relative spectral processing [19,20].

In the pursuit of selecting a well-established and efficient feature-extraction technique for noisy environments, we conducted a thorough investigation. We reviewed and analyzed numerous articles and literature reviews that conducted comparative studies to ensure we selected the most effective approach [5,21,22,23,24,25,26,27,28]. Therefore, we used Mel-Frequency Cepstral Coefficient (MFCC), Weighted Mel-Frequency Cepstral Coefficient (WMFCC), and Spectral Subband Centroids (SSC).

3.2.1. Mel-Frequency Cepstral Coefficient (MFCC)

While numerous techniques are available for feature extraction, MFCC is widely adopted. In comparison to alternative features, it has been demonstrated to achieve notable results [21,22]. Additionally, it performs well in noisy environments [23].

MFCCs are the coefficients that collectively represent the short-term power spectrum of a sound. Its analysis begins by applying the Fourier Transform on the frame sequence to obtain certain parameters. The process involves converting the power spectrum to a Mel-frequency spectrum, taking the logarithm of that spectrum, and computing its Inverse Fast Fourier transform (IFFT). More detailed information can be found in [29].

3.2.2. Weighted Mel-Frequency Cepstral Coefficient (WMFCC)

In addition to MFCC, which provides static features, researchers demonstrated that considering dynamic features will further enhance the performance of SR systems [30]. Dynamic features are extracted from MFCC derivatives. The first-order derivative of MFCC is a delta coefficient, which provides information about the temporal variations in the speech signal. The second-order derivative is a double-delta coefficient, which holds the acceleration or rate change of the delta coefficients over time.

However, the inclusion of these features will be expensive in terms of computational complexity as it creates a higher-dimensional feature vector. For instance, if we extracted 40 feature coefficients of MFCC, delta, and double delta, we would end up with a feature vector of 120 dimensions. Therefore, we use the innovation of the WMFCC method. As illustrated in Figure 3, WMFCC is a method that averages the MFCC coefficients with its derivatives, thus containing both static and dynamic information of the signal while maintaining the initial vector dimension (40 in our example). Furthermore, researchers in [24,25,26] conducted several experiments comparing MFCC and WMFCC features where they found that WMFCC performs better in noisy environments. Therefore, this prompted us to include it in our experiment.

We extracted 40-dimensional WMFCC coefficients using the equation proposed in [24]:

wc (n) = c (n) + p •△c (n) + q • △△c (n), q < p < 1

(1)

where c (n) is a 40-dimensional MFCC feature vector, Δc (n) is an MFCC delta feature vector, and ΔΔc (n) is an MFCC double delta feature vector. p and q represent the weights assigned to the delta and double-delta features, respectively.

3.2.3. Spectral Subband Centroids (SSCs)

Despite spectral features (e.g., MFCC) having provided reasonable recognition performance, they are known for their sensitivity to additive noise distortion, which is considered a major downside [31]. Therefore, K. K. Paliwal (1997) [32] suggested a complementary feature, SSCs, which are well correlated with formant frequencies and offer information that has not been captured by spectral features. Additionally, they enhance the model’s robustness to additive Gaussian noise. Researchers in [27,28] demonstrated that using SSC in noisy conditions increased the recognition accuracy. Moreover, Thian et al. (2004) [33] demonstrated that the combination of SSC with spectral features yielded more robust results compared to using spectral features solely. Therefore, we decided to include this feature in our experiments.

3.3. Classifier Selection

For our experiment, to select the most performant model suitable for our dataset, we built and tested the performance of ML and DL models.

3.3.1. Machine-Learning Models

For machine-learning models, we evaluated four algorithms that do multi-class classification, all of which displayed promising results in the field of speech recognition. These algorithms are K-Nearest Neighbors (KNN), Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR). In the process of model tuning and hyperparameter selection, we manually identified potential parameters that we believed would benefit our models. Subsequently, we employed grid-search techniques to systematically choose the optimal ones. The resulting parameter values are as follows: For KNN, we utilized 3 neighbors; for RF, we employed 150 estimators with a maximum depth of 250; in SVM, we set the regularization parameter C to 1 and utilized the linear kernel function; and, finally, for the LR model, the regularization parameter C was set to 3, with a maximum number of iterations of 100. Additionally, we utilized L2 (Ridge) regularization and employed the lbfgs solver, as it is more suitable for larger datasets.

3.3.2. Deep-Learning Models

For deep-learning models, we tested the performance of four strong models, namely CNN, LSTM, Bi-LSTM, and our proposed architecture, Bi-LSTM-CNN. All of these models use categorical cross-entropy loss with a SoftMax output layer of 20 units, Adam optimizer, 32 batch size, and a learning-rate scheduler that multiplies the learning rate by 0.2 when the training plateaus for three consecutive epochs. These models are trained for 40 epochs.

Convolutional Neural Networks (CNN)

CNNs have been recently used widely as they have been shown to improve SR performance [34]. In addition to their simple structure and the use of fewer parameters to train the model (relatively), CNNs are known for their ability to learn complex patterns and recognize local and global input-data characteristics, making them suitable for SR applications [35].

The architecture of the built 1D CNN model consists of three stacked hidden layers with 512, 256, and 128 units, respectively. Each layer is followed by Batch Normalization and Max-Pooling layers. The 1D CNN processes sequences of features, such as Mel-Frequency Cepstral Coefficients (MFCCs), where each feature represents a 1D array of temporal data. Finally, a fully connected SoftMax layer with 20 units (corresponding to the number of commands) completes the model. This design allows the network to effectively capture and learn temporal patterns in the sequential feature data.

Long Short-Term Memory Networks (LSTM)

LSTMs are artificial recurrent neural network architectures proposed to solve the gradient-vanishing problem [36]. They have proven to be effective in processing sequential data due to their ability to capture long-term dependencies, which is crucial in SR where the input data is a sequence of audio frames.

The architecture of the LSTM model consists of three stacked hidden layers with 512, 256, and 128 units, respectively. Each layer is followed by a dropout layer with a 0.20 omission probability. Finally, a fully connected SoftMax output layer of 20 units is succeeded.

Bidirectional Long Short-Term Memory Networks (Bi-LSTM)

Bi-LSTMs are an advanced form of LSTM networks that enhance the capacity to capture contextual information from both preceding and succeeding sequences in sequential data. By analyzing the input data in both forward and reverse directions, Bi-LSTMs are especially useful in applications such as speech recognition, where comprehending the context of audio frames from both directions is essential for precise predictions.

The architecture of the Bi-LSTM model consists of three stacked hidden layers with 256, 128, and 64 units, respectively. Each layer is followed by a Batch-Normalization layer. Finally, a fully connected SoftMax output layer of 20 units is succeeded.

BiLSTM-CNN Architecture

To leverage the complementary advantages of CNN and Bi-LSTM, capturing complex patterns, short-term dependencies, and long-term dependencies, we built a hybrid model that combines CNN and Bi-LSTM layers into one unified architecture. As illustrated in Figure 4, our proposed architecture consists of two stacked hidden Bi-LSTM layers with 128 and 64 units, respectively, which are responsible for extracting long dependency features. Each Bi-LSTM layer is followed by a Batch normalization. These Bi-LSTM layers are followed by two stacked hidden CNN layers with 256 and 256 units, respectively. CNN layers are responsible for extracting complex patterns and local features. Each layer is followed by a Batch normalization, max-Poling layers, and a dropout layer with 0.20 omission probability. This is followed by a fully connected Dence layer of 128 units and a SoftMax output layer of 20 units.

3.3.3. Transformers

Transformers, a specific kind of deep neural network with an attention mechanism, were originally developed for natural language processing [37] to build large language models. They have rapidly gained prominence in various fields, including audio classification [38]. The core innovation behind transformers is their multi-head self-attention mechanism [39], which enables the model to weigh the importance of different parts of the input sequence, capturing useful global-context information and complex dependencies [40]. Moreover, it can work with variable input sequence lengths without traditional recurrent architectures’ constraints. Furthermore, in audio classification, transformers excel due to their ability to model long-range dependencies in audio signals. Unlike traditional convolutional or recurrent neural networks, transformers can process entire audio sequences simultaneously, allowing them to capture local and global patterns in the data. This capability is particularly beneficial for tasks where the context and temporal structure of the audio are crucial, such as sound-event detection, speaker identification, and audio classifying across diverse conditions and applications [41].

Pre-trained transformer models like Wave to Vector (wav2vec2) [42], Hidden-Unit BERT (HuBERT) [43], Deep Speech [44], and Whisper [45] have been fine-tuned for ASR tasks, demonstrating state-of-the-art performance on various benchmarks. These models leverage large-scale, self-supervised learning on massive datasets, enabling them to generalize effectively to both new and noisy audio. Therefore, in our study, we will fine-tune the Wav2vec2 models on our dataset and conduct a comparative analysis between its results and those of the machine- and deep-learning models we built.

Fine-Tuning the Wav2vec2 Model

Wav2Vec2 is a speech recognition system that relies on supervised learning but can alternatively adapt to unsupervised learning methods developed by Facebook [42]. The choice of Wav2vec2 over other pre-trained transformer models is motivated by the fact that Wav2Vec2 is an advanced speech recognition model that leverages a transformer-based architecture to learn the speech structure from raw audio. Unlike traditional models that require large amounts of labeled data, Wav2Vec2 uses a self-supervised learning approach, allowing it to learn meaningful speech representations from unlabeled audio by predicting masked segments of the audio waveform, similar to how models like Bidirectional Encoder Representations from Transformers (BERT) [46] predict masked words in the text. Furthermore, the model showed a promising result in the experiments. Through training the model with just 10 min of transcribed speech and 53,000 h of unlabeled speech, wav2vec 2.0 enables speech recognition models at a word-error rate (WER) of 8.6% on noisy speech and 5.2% on clean speech on the standard LibriSpeech benchmark [42].

The Wav2Vec2 model starts by feeding the original audio waveform through a convolutional neural network (CNN) to extract initial feature representations. These representations are latent audio features, each spanning 25 ms, which are subsequently processed by a transformer network. This network leverages self-attention, enabling the model to capture important dependencies and contextual nuances throughout the entire audio sequence. This design empowers Wav2Vec2 to effectively model extensive relationships within audio data, enhancing its accuracy and resilience in diverse speech recognition tasks, especially in challenging environments with noise or limited resources [42].

Fine-tuning is a common technique for raising a pre-trained model’s performance on a specific task by taking the pre-trained model and training it further on a smaller dataset that is specific to the proposed use case. To fine-tune our model for the specialized task of ASR of Moroccan Arabic automotive commands, we have utilized the facebook/wav2vec2-base model available in the HuggingFace library [42]. The fine-tuning training was conducted with a batch size of 32 using the Adam optimizer. The optimizer was set with a learning rate of 0.0002.

3.4. Out-of-Corpus Command Handling

To build a robust command recognition system, one of the main problems that occurs is the unknown command, which may happen because the user does not know the command covered or when the predicted command is not predicted with high confidence. To handle this problem, we set a prediction probability threshold.

We chose this method because our model is designed to predict one of 20 specific voice commands. Due to this relatively limited command set, it is crucial that the model not only correctly identifies the spoken command but does so with a high degree of certainty. To achieve this, we have set the threshold at 75%. If the model predicts a command with a probability lower than 75%, it suggests that the model is uncertain, which could lead to incorrect or unintended actions being triggered. By implementing this threshold, we aim to filter out low-confidence predictions, thereby reducing the likelihood of errors in command recognition.

The decision to set the threshold at 75% rather than a higher value, like 90%, is informed by the conditions in which our model is expected to operate. Specifically, our system needs to function effectively in both noisy and clean environments. In noisy conditions, even correctly predicted commands might have slightly lower confidence levels due to background interference. Setting the threshold too high, such as at 90%, could result in the system rejecting too many valid commands, causing unnecessary frustration for the user. By choosing a 75% threshold, we strike a balance that allows the model to perform reliably across varying conditions while still maintaining a reasonable level of confidence in its predictions. This approach enhances the system’s robustness, ensuring that it can handle a wide range of scenarios without compromising on accuracy or user experience.

4. Experiment Results and Discussion

Multiple experiments were conducted to benchmark our newly introduced MASCD dataset.

We used Google Collab to build our models as it offers free access to CPU and GPU units, provides a large amount of RAM space, and comes up with many built-in libraries such as Sickit Learn and Keras that save time. We utilized the TensorFlow and Keras libraries to construct models. Our codes are publicly available online for research purposes [18].

4.1. Dataset Collection Procedure

To build a natural and representative speech dataset, a rigorous methodology was designed. Subsequently, we identified the essential contributor characteristics and optimal recording environment for our dataset.

Many civilizations, languages, religions, and ancient cultures have crossed into Morocco, such as Phoenician, Latin, Jewish, Christian, Islamic, Amazigh, and Arabic, as well as French, Spanish, and Portuguese [47]. While some of these cultures have diminished over time, their traces have merged with Moroccan culture and blended with it to form its unique identity. This is evident in many civilizational aspects, such as linguistic interaction between Arabic and Amazigh, alongside the Hassani oral culture, which blends the culture of Arab Moroccans with that of Mauritania. Additionally, the presence of foreign languages like French, Spanish, and English further enriches Morocco, making it a country rich in dialects where each region is characterized by its dialect and lexicon. Despite homogenization, the original dialect remains apparent through pronunciation, intonation, and letter articulation [48,49].

As shown in Figure 5, a linguistic study conducted by A. Boukous [50] aimed to categorize the various Moroccan dialects, revealing that there are seven main dialects:

Northern dialect/Jebli.
Northeastern dialect/Rif Amazigh language/Tarifite.
Eastern Moroccan dialect/Oriental Moroccan Arabic/Bedwi.
Nord-Ouest dialect/Standard Moroccan Arabic dialect/Rubi.
languages of the Middle Atlas/Tamazight.
languages of the Souss/Tachelhite.
Sahara Arabic dialects/Hassani Arabic/Ribi.

To construct a representative dataset that captures the variations in speech patterns, accents, intonation, and dialects among individuals, we decided to select four contributors from each distinct region representing the seven dialects (as shown in Figure 6). As a result, the set of participants consists of 28 contributors, with an equal distribution of 14 males and 14 females. The age range within the group is varied, with eight individuals under the age of 25, 14 between 25 and 40, and six over 40. In addition to the seven dialect categorizations, we discovered that variations exist in the way certain sounds or letters are pronounced or used within the same dialect category. For instance, in some areas, the sound represented by the letter “k” might be replaced by the sound represented by “g”, or the sound represented by “kh” might be replaced by “h”. Subsequently, to collect and include all dialect differences, we would require considering over 100 distinct variations, which is beyond the scope of our research. Therefore, we opted for a standardized language that is comprehensible to all Moroccan Arabic speakers. However, we did consider some significant dialectal differences, particularly those where entire words are changed rather than just a single letter or pronunciations, ensuring that the dataset reflects meaningful linguistic distinctions without becoming overly complex. For example, the command “lock the door” is articulated differently across dialects. In the Hassani dialect, it is “قفل الباب—qafala al-bab”, while in the Northeastern dialect, it is “بلع الباب—bala al-bab”, and in other dialects, it is “سد الباب—sadd al-bab”. Similarly, the command “it’s hot” varies across dialects. In the Tamazight dialect, it is “كاين الشوم—kayn ash-shum”, in the Northeastern dialect, it is “كاين الحمان—kayn al-haman”, and in other dialects, it is “كاين الحرارة—kayn al-harara”.

Regarding the recording environment, to ensure the representativity of our datasets, half contributors were tasked with recording commands in a controlled, clean environment, and the other half in a moving car (on the passenger side). This strategy introduces variability by simulating real-world scenarios where commands might be issued under different conditions. By capturing data in both static and dynamic environments, we aim to create a comprehensive dataset that reflects the diverse circumstances in which these commands are likely to be used. Figure 6 represents a comparison of recordings of the same command by one contributor in clean and noisy environments. The x-axis represents time, while the y-axis shows the intensity of the audio signal. In a clean environment, the signal remains stable, indicating minimal background noise. In contrast, the noisy environment shows fluctuating intensity levels due to background noise, affecting the clarity of the recorded command.

Since phones have recently been equipped with more capable microphones, most of the recordings used phone microphones. Out of 14 contributors, two audio recordings used laptop microphones.

To ensure a smooth and stress-free recording process for contributors, each contributor received a list of 20 commands, clear instructions, and an example recording to follow.

Generally, a driver’s voice is impacted by many factors, such as fatigue and stress. To capture these factors, we instructed the contributors to read the provided list 10 times in a row (recording round). As shown in Figure 7, due to fatigue or boredom during the process of recording, we noticed a change in voice tone and pith, which simulates the status of a driver’s voice. This process enhances the dataset’s representativity. In each recording round, the contributor reads the 20 commands consecutively with a 2 s pause in between. After each round, the file is saved. This process is repeated 10 times by each contributor. The average time spent collecting the data from each contributor was about 3 h. As a result of this process, we have collected 10 records from each contributor. Each record is ~5 min long and contains 20 commands with 2 s pauses in between. Following this, we divided each record into smaller audio segments, each lasting 2 s and containing only one command. To ensure accurate splitting, we manually divided the audio segments using Audacity (https://www.audacityteam.org/), free and open-source software for audio processing. Subsequently, each audio segment was saved in a wav format, labeled, and categorized into one of the 20 classes. The audio label follows this format: Contributor-ID_Repetition-number_Command-number.wav (e.g., contributor_2_Repetition_9_ command_ 1_.wav).

4.2. Data Split

In our dataset, each command is recorded by the same contributor 10 times, resulting in significant similarity and dependence between samples. Consequently, using classical data-splitting methods (like random splitting) could lead to a high overfitting rate, as the training and validation sets might end up containing very similar samples.

To ensure that all recordings from the same contributor are within the same subset, and to measure the model’s ability to generalize to new individuals outside the dataset, we used the Stratified Group K-Fold technique available in the scikit-learn library for train-test splitting.

In the Stratified Group K-Fold method, the dataset is split into K folds. Each fold contains groups of samples. One fold is used for validation, and the remaining K-1 folds are used for training. This process is repeated K times so that each fold serves as the validation set once. The overall accuracy is measured by calculating the average accuracy obtained across all the folds. In our case, we have split the dataset into five folds. In each iteration, four folds are used for training (representing 80% of data), and one fold (representing 20% of data) is used for validation.

4.3. Performance Measures

Since we have a balanced dataset, our objective is to construct an efficient model that prevents overfitting, i.e., the model learns to perform well on the training data but does not generalize well to new, unseen data. Therefore, we utilized training and validation accuracy, as well as loss and validation loss, to evaluate the models’ performance.

4.4. Preliminary Results

After the dataset construction, splitting, and feature extraction, we applied the standardization scaling method to transform the dataset, aiming to enhance the stability and performance of the model. Furthermore, to establish an initial baseline for the classification complexity, we tested the eight classifiers using only 13-dimensional MFCC coefficients as input with a minimum number of model hyper-tuning. Table 1 presents the results of the first experiment on the dataset.

In this preliminary experiment, our proposed architecture BiLSTM-CNN achieved the best result, with a validation accuracy of 73.55%. However, this result falls below the expected threshold for consideration. The results suggest that the model struggles to learn distinct command patterns, which is reasonable since many commands share similar rhythm and pitch characteristics, as illustrated in Figure 8. Consequently, the model’s performance was adversely affected. Furthermore, evidence of overfitting is apparent, with a significant gap of approximately 26% observed between training and validation accuracy, alongside a validation loss exceeding 1.00.

4.5. Data Augmentations

To improve the models’ performance, we decided to do two preprocessing techniques. First, we extracted 40-dimensional MFCC coefficients, enabling the model to capture various audio characteristics effectively. Following that, we applied data-augmentation techniques to enhance the size and complexity of the dataset. The data-augmentation process involves creating new synthetic data samples by introducing small perturbations to the initial training set through the injection of various effects. The effect used in our dataset is time stretching. In practical environments, speaking speeds vary. Therefore, two versions of the original recording were created; one with speed multiplied by 1.25, and another with speed multiplied by 0.85. These values were chosen carefully to augment the data while preserving the original sense of the recording. This augmentation process enabled us to construct a dataset consisting of 16,800 utterances, with 840 utterances representing each command.

The use of 40-dimensional MFCCs, instead of the 13 coefficients, contributed to enhancing the performance of both ML and DL models, particularly SVM, CNN, and Bi-LSTM models, as demonstrated in Table 2 (the first rows). For instance, accuracy for CNN improved from 71.44% to 83.33%, and for BiLSTM-CNN, from 73.55% to 81.37%. Furthermore, the data-augmentation process, as shown in Table 3, significantly contributed to enhancing the models’ performance, resulting in notable results.

After this process, we assessed the performance of the models using different feature-extraction techniques, i.e., 40-dimensional MFCCs, 40-dimensional WMFCCs, and 10-dimensional SSCs, individually and in combinations.

To train our models using WMFCC coefficients, we conducted multiple experiments to identify the optimal values for delta and double-delta weights (p and q, as detailed in Section 3.2.2). The optimal values were determined to be p = 0.5 and q = 0.1.

Table 2 presents the results of the models before data augmentation. The RF classifier achieved the best result among the ML models, achieving a validation accuracy of 87.79% when utilizing a combination of WMFCC and SCC coefficients. Our proposed architecture yielded the best result among the DL models, achieving a validation accuracy of 93.67% when trained on WMFCC and SCC features. It is noteworthy that across almost all models, the combination of WMFCC and SCC coefficients consistently outperformed other coefficient combinations. This observation underscores the effectiveness of these features in extracting relevant information, including both static and dynamic characteristics, which serve to differentiate each audio sample. Moreover, utilizing this combination requires fewer coefficients, which will explicitly reduce the computational resources.

Table 3 presents the results of the models after data augmentation. We used the same setting as before but with the augmented dataset. The best result was achieved through a combination of four factors. Firstly, the increase in feature dimensionality enabled most models to capture additional characteristics from the sound audio. Secondly, dataset augmentation played a crucial role in enhancing model performance and reducing overfitting. Thirdly, the combination of WMFCC and SSCFC features as input contributed significantly. Finally, the effectiveness of our built architecture, BiLSTM-CNN, outperformed the other three classifiers, achieving a validation accuracy of 98.48%.

Figure 9 presents the learning curve of the best model (BILSTM-CNN). Considering the dataset size and the characteristics of command records (e.g., some commands containing three words, as shown in Table 1), along with the fact that 50% of the audios in the dataset were recorded in a noisy environment, our model performed well, achieving notable results of 98.48% in validation accuracy.

Figure 10 presents a comparison between the BiLSTM-CNN and the fine-tuned Wav2Vec model based on their performance in different environments: overall, clean, and noisy. Both models showed strong performance in a clean environment, with BiLSTM-CNN slightly outperforming Wav2Vec (99.0% versus 96.27%). However, the difference in performance becomes more pronounced in noisy environments. The BiLSTM-CNN model maintained a high accuracy of 97.90%, while Wav2Vec dropped significantly to 86.05%. Furthermore, the BiLSTM-CNN model outperformed Wav2Vec in overall accuracy, achieving 98.48% compared to Wav2Vec’s 91.16%. This indicates that the BiLSTM-CNN model is more effective at generalizing across different environments, making it a more robust option for the ASR of Moroccan Arabic commands.

Several factors contribute to the superior performance of the BiLSTM-CNN model over the Wav2Vec model in Moroccan Arabic command recognition. The Wav2Vec model was primarily trained on MSA datasets, which differs significantly from Moroccan Arabic regarding phonetic and lexical properties. The Moroccan Arabic dialect has unique phonemes that do not exist in other Arabic dialects. Because the Wav2Vec model has not been exposed to these specific phonemes during training, it struggled to recognize commands correctly in this dialect. Therefore, the lack of extensive, high-quality training data specific to Moroccan Arabic led to poorer performance in this context. Moreover, as shown in Figure 8, some commands are phonetically similar, especially in noisy environments. The Wav2Vec model struggled to differentiate between such commands, contributing to its lower performance than the BiLSTM-CNN model. Furthermore, fine-tuning the Wav2Vec model to learn the properties and characteristics of the Moroccan dialect requires a large and diverse dataset. However, the MASCD is relatively small, limiting the model’s ability to learn the nuances of the dialect effectively. Consequently, the model did not generalize well to Moroccan commands, leading to lower accuracy.

Since Arabic command recognition is a complex task due to language complexity, many words have the same rhythm. Therefore, as illustrated in Table 4, almost all previous works include a few commands covering simple commands (e.g., start, stop) that are easy to recognize. Only the study by Ghandoura et al. [8] achieved a good result with 40 command classifications. However, the dataset used in this research consists of simple commands, primarily formed from single words and digits that are easily recognized. While our dataset is not large and some commands consist of three words, the model we proposed was able to achieve notable results, outperforming other models in both clean and noisy environments.

It is worth noting that for a fair comparison, we investigated the possibility of testing the models in the literature and our model on a standardized dataset. However, despite the efforts, we found that the existing research shared only the results of the model performances and did not provide access to either the models or the dataset used. As a result, the comparison could not be conducted under a standardized condition.

5. Conclusions and Feature Work

In this paper, we introduce MASCD, our newly constructed Moroccan Arabic speech dataset encompassing 20 of the most commonly used commands in automobiles. Moreover, we designed an end-to-end hybrid deep-learning-based BiLSTM-CNN command classifier and compared it with strong ML and DL models, namely LR, SVM, RF, KNN, CNN, LSTM, and Bi-LSTM. Different feature-extraction techniques were investigated to extract the most pertinent information from the data. Accordingly, MFCC, WMFCC, and SSC features were used and compared, individually and in combination.

To establish a baseline understanding of the complexity of the classification task, we conducted an initial experiment using the original dataset (without augmentation). We used 13-dimensional MFCC coefficients as input. In this experiment, all four models achieved a validation accuracy below 73%. Subsequently, we conducted additional experiments using different features and models with the augmented dataset, which led to a performance enhancement of approximately 20% accuracy across the models. The experimental results of this study reveal that combining MFCC with SSC features slightly outperformed MFCC features. On the other hand, combining WMFCC with SSC features significantly outperformed both MFCC and the combination of MFCC SSC. This combination yielded good and robust results, even in noisy conditions as it combines static and dynamic features and adapts well to additive noise, all while using a minimal number of vector dimensions, which implicitly reduces computational complexity. Furthermore, the result shows that our BiLSTM-CNN model outperformed other baseline classifiers and the existing solution in the SOTA, achieving a notable accuracy rate of 98.48%.

To our knowledge, this paper is the first research in the field of automotive speech recognition in the Arabic dialect language. Consequently, it establishes a robust foundation for research and innovation to bridge the gap between English and Arabic research. The primary challenge faced in this work was the collection of a natural and representative dataset. For future work, we intend to expand our dataset by incorporating more commands and contributors to create a massive and representative dataset. Moreover, we intend to include all the Arabic dialects by including contributors from other Arabian countries. Additionally, we aim to explore and build more performant architectures using attention mechanisms and transformer architecture due to their promising results in the field of speech recognition.

Author Contributions

Conceptualization, S.O. and S.E.G.; methodology, S.O.; software, S.O.; validation, S.E.G.; writing—original draft preparation, S.O. and S.E.G.; writing—review and editing, S.O. data curation, S.O.; supervision, S.E.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix A

Table A1. The 20 selected commands with their translation into English and pronunciation.

#	Keyword	Translation	Pronunciation
Com-1	حل البيبان	Unlock doors	Ħall alabibaːn
Com-2	سد البيبان	Lock doors	Sadd alabibaːn
Com-3	حل الكوفر	Open the backdoor	Ħall alakawfar
Com-4	سد الكوفر	Close the backdoor	Sadd alakawfar
Com-5	شعل الراديو	Turn radio on	ʃaʕal alaɾaːdijuː
Com-6	طفي الراديو	Turn radio off	tˤɑfiː alaɾaːdijuː
Com-7	بدل الإذاعة	Change the radio station	Badal ala ʔiðaːʕa
Com-8	نقص الصوت	Decrease volume	naqɑsˤɑ al sˤɑwwat
Com-9	زيد الصوت	Increase volume	zaid ala sˤɑwwat
Com-10	جاوب ابيل	Answer the call	ǯaːwab ʔabi:l
Com-11	قطع ابيل	Decline the call	qɑtˤɑʕ ʔabi:l
Com-12	شعل الضو	Turn light on	ʃaʕal dˤɑwʔ
Com-13	طفي الضو	Turn light of	tˤɑfiː dˤɑwʔ
Com-14	هبط السرجم	lower the side window	Habit saɾaǯam
Com-15	طلع السرجم	Raise the side window	tˤɑlaʕ saɾaǯam
Com-16	كاين الحرارة	It’s hot/release cold air	kaːin ala ħaɾaːɾa
Com-17	كاين البرد	It’s cold/release hot air	kaːin ala bard
Com-18	خدم السويكالاص	Activate windshield wiper	Xadam s-suː ka-las
Com-19	وقف السويكالاص	Deactivate windshield wiper	waqɑf s-suː ka-las
Com-20	حل كابو	Open the bonnet door	Ħall ka:po

References

Dukic, T.; Hanson, L.; Holmqvist, K. Wartenberg Effect of button location on driver’s visual behaviour and safety perception. Ergonomics 2005, 48, 399–410. [Google Scholar] [CrossRef] [PubMed]
Simons-Morton, B.G.; Guo, F.; Klauer, S.G.; Ehsani, J.P.; Pradhan, A.K. Keep Your Eyes on the Road: Young Driver Crash Risk Increases According to Duration of Distraction. J. Adolesc. Health 2014, 54, S61–S67. [Google Scholar] [CrossRef] [PubMed]
Cades, D.; Arndt, S.; Kwasniak, A.M. Driver distraction is more than just taking eyes off the road. ITE J.-Inst. Transp. Eng. 2011, 81, 26–28. [Google Scholar]
Vikström, F.D. Physical Buttons Outperform Touchscreens in New Cars, Test Finds. Available online: https://www.vibilagare.se/english/physical-buttons-outperform-touchscreens-new-cars-test-finds (accessed on 3 January 2024).
Dhouib, A.; Othman, A.; El Ghoul, O.; Khribi, M.K.; Al Sinani, A. Arabic Automatic Speech Recognition: A Systematic Literature Review. Appl. Sci. 2022, 12, 8898. [Google Scholar] [CrossRef]
Arab Countries/Arab League Countries 2024. Available online: https://worldpopulationreview.com/country-rankings/arab-countries (accessed on 21 February 2024).
Huang, X.; Baker, J.; Reddy, R. A historical perspective of speech recognition. Commun. ACM 2014, 57, 94–103. [Google Scholar] [CrossRef]
Ghandoura, A.; Hjabo, F.; Al Dakkak, O. Building and benchmarking an Arabic Speech Commands dataset for small-footprint keyword spotting. Eng. Appl. Artif. Intell. 2021, 102, 104267. [Google Scholar] [CrossRef]
Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar]
Ibrahim, A.E.B.; Saad, R.S.M. Intelligent Categorization of Arabic Commands Utilizing Machine Learning Techniques with Short Effective Features Vector. Int. J. Comput. Appl. 2022, 184, 25–32. [Google Scholar] [CrossRef]
Hamza, A.; Fezari, M.; Bedda, M. Wireless voice command system based on kalman filter and HMM models to control manipulator arm. In Proceedings of the 2009 4th International Design and Test Workshop, IDT 2009, Riyadh, Saudi Arabia, 15–17 November 2009. [Google Scholar] [CrossRef]
Paliwal, K.; Basu, A. A speech enhancement method based on Kalman filtering. In Proceedings of the ICASSP ‘87, IEEE International Conference on Acoustics, Speech, and Signal Processing, Dallas, TX, USA, 6–9 April 1987; Volume 12, pp. 177–180. [Google Scholar] [CrossRef]
El-emary, I.M.M.; Fezari, M.; Attoui, H. Hidden Markov model/Gaussian mixture models (HMM/GMM) based voice command system: A way to improve the control of remotely operated robot arm TR45. Sci. Res. Essays 2011, 6, 341–350. [Google Scholar]
Abed, A.A.; Jasim, A.A. Design and implementation of wireless voice controlled mobile robot. Al-Qadisiyah J. Eng. Sci. 2016, 9, 135–147. [Google Scholar]
Hyundai. Available online: http://webmanual.hyundai.com/STD_GEN5_WIDE/AVNT/EU/English/voicerecognitionsystem.html (accessed on 27 October 2023).
Toyota. Available online: https://toyota-en-us.visteoninfotainment.com/how-to-voice-recognition (accessed on 26 October 2023).
Acura. Available online: https://www.acurainfocenter.com/the-latest/rdx-voice-commands-made-easy (accessed on 28 October 2023).
Soufiyan Ouali, Said El Gerouani, Automative Morrocan Arabic Speech Dataset. Available online: https://github.com/SoufiyaneOuali/Automative-Morrocan-Arabic-Speech-Command-Datset (accessed on 23 February 2024).
Hibare, R.; Vibhute, A. Feature Extraction Techniques in Speech Processing: A Survey. Int. J. Comput. Appl. 2014, 107, 975–8887. [Google Scholar] [CrossRef]
Mohanty, A.; Cherukuri, R.C. A Revisit to Speech Processing and Analysis. Int. J. Comput. Appl. 2020, 175, 1–6. [Google Scholar] [CrossRef]
Bhandari, A.Z.; Melinamath, C. A Survey on Automatic Recognition of Speech via Voice Commands. Int. J. New Innov. Eng. Technol. 2017, 6, 1–4. [Google Scholar]
Kurzekar, P.K.; Deshmukh, R.R.; Waghmare, V.B.; Shrishrimal, P.P. A Comparative Study of Feature Extraction Techniques for Speech Recognition System. Int. J. Innov. Res. Sci. Eng. Technol. 2007, 3297, 2319–8753. [Google Scholar] [CrossRef]
Këpuska, V.Z.; Elharati, H.A.; Këpuska, V.Z.; Elharati, H.A. Robust Speech Recognition System Using Conventional and Hybrid Features of MFCC, LPCC, PLP, RASTA-PLP and Hidden Markov Model Classifier in Noisy Conditions. J. Comput. Commun. 2015, 3, 1–9. [Google Scholar] [CrossRef]
Chapaneri, S.V. Spoken Digits Recognition using Weighted MFCC and Improved Features for Dynamic Time Warping. Int. J. Comput. Appl. 2012, 40, 6–12. [Google Scholar] [CrossRef]
Mukhedkar, A.S.; Alex, J.S.R. Robust feature extraction methods for speech recognition in noisy environments. In Proceedings of the 1st International Conference on Networks and Soft Computing, ICNSC 2014—Proceedings, Guntur, India, 19–20 August 2014; pp. 295–299. [Google Scholar] [CrossRef]
Gupta, S.; Shukla, R.S.; Shukla, R.K. Weighted Mel frequency cepstral coefficient based feature extraction for automatic assessment of stuttered speech using Bi-directional LSTM. Indian J. Sci. Technol. 2021, 14, 457–472. [Google Scholar] [CrossRef]
Kinnunen, T.; Zhang, B.; Zhu, J.; Wang, Y. Speaker verification with adaptive spectral subband centroids. In Advances in Biometrics; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2007; Volume 4642, pp. 58–66. [Google Scholar] [CrossRef]
Přibil, J.; Přibilová, A.; Matoušek, J. GMM-based speaker age and gender classification in Czech and Slovak. J. Electr. Eng. 2017, 68, 3–12. [Google Scholar] [CrossRef]
Majeed, S.A.; Husain, H.; Samad, S.A.; Idbeaa, T.F. Mel frequency cepstral coefficients (mfcc) feature extraction enhancement in the application of speech recognition: A comparison study. J. Theor. Appl. Inf. Technol. 2015, 79, 38–56. Available online: www.jatit.org (accessed on 22 February 2024).
Tyagi, V.; McCowan, I.; Misra, H.; Bourlard, H. Mel-Cepstrum Modulation Spectrum (MCMS) features for robust ASR. In Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, St. Thomas, VI, USA, 30 November–4 December 2003; pp. 399–404. [Google Scholar] [CrossRef]
Dev, A.; Bansal, P. Robust Features for Noisy Speech Recognition using MFCC Computation from Magnitude Spectrum of Higher Order Autocorrelation Coefficients. Int. J. Comput. Appl. 2010, 10, 975–8887. [Google Scholar] [CrossRef]
Paliwal, K.K. Spectral subband centroids as features for speech recognition. In Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, Santa Barbara, CA, USA, 17 December 1997; pp. 124–131. [Google Scholar] [CrossRef]
Thian, N.P.H.; Sanderson, C.; Bengio, S. Spectral subband centroids as complementary features for speaker authentication. In Biometric Authentication; Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Springer: Berlin/Heidelberg, Germany, 2004; Volume 3072, pp. 631–639. [Google Scholar] [CrossRef]
Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [Google Scholar] [CrossRef]
Alsobhani, A.; Alabboodi, H.M.A.; Mahdi, H. Speech Recognition using Convolution Deep Neural Networks. J. Phys. Conf. Ser. 2021, 1973, 012166. [Google Scholar] [CrossRef]
Noh, S.H. Analysis of Gradient Vanishing of RNNs and Performance Comparison. Information 2021, 12, 442. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar]
Zaman, K.; Sah, M.; Direkoglu, C.; Unoki, M. A Survey of Audio Classification Using Deep Learning. IEEE Access 2023, 11, 106620–106649. [Google Scholar] [CrossRef]
Turner, R.E. An Introduction to Transformers. arXiv 2024, arXiv:2304.10557v5. [Google Scholar]
Zhang, Y.; Li, B.; Fang, H.; Meng, Q. Spectrogram transformers for audio classification. In Proceedings of the 2022 IEEE International Conference on Imaging Systems and Techniques (IST), Kaohsiung, Taiwan, 21–23 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
Wyatt, S.; Elliott, D.; Aravamudan, A.; Otero, C.E.; Otero, L.D.; Anagnostopoulos, G.C.; Smith, A.O.; Peter, A.M.; Jones, W.; Leung, S.; et al. Environmental sound classification with tiny transformers in noisy edge environments. In Proceedings of the 2021 IEEE 7th World Forum on Internet of Things (WF-IoT), New Orleans, LA, USA, 14 June–31 July 2021; pp. 309–314. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. arXiv 2021, arXiv:2106.07447. [Google Scholar] [CrossRef]
Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep Speech: Scaling up end to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv 2022, arXiv:2212.04356. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Pennell, C.R. Morocco: From Empire to Independence; Oneworld Publications: London, UK, 2009. [Google Scholar]
Hachimi, A. Dialect Leveling, Maintenance and Urban Identitiy in Morocco Fessi Immigrants in Casablanca; University of Hawai’i at Manoa: Honolulu, HI, USA, 2005. [Google Scholar]
Horisons de France. Maroc, Atlas Historique, Géographique, Economique. 1935. Available online: https://www.cemaroc.com/t147-maroc-atlas-historique-geographique-economique-1935 (accessed on 9 May 2024).
Boukous, A. Revitalisation de l’amazighe Enjeux et stratégies. Lang. Soc. 2013, 143, 9–26. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of our proposed system.

Figure 2. In-car voice-command acceptance rate.

Figure 3. WMFCC-extraction steps.

Figure 4. Proposed BiLSTM-CNN Architecture.

Figure 5. Linguistic map of Morocco and contributors’ geographical location.

Figure 6. Comparison of command recordings in clean and noisy environments. (a) Spectrogram of Command 1, Contributor 1 in a clean environment. (b) Spectrogram of Command 1, Contributor 1 in noisy environment.

Figure 7. Waveform variation in Command 14 by the same contributor 3. A comparison between Repetitions 1 (a) and Repetitions 10 (b) reveals a decrease in amplitude from 0.50 (Rep 1) to 0.20 (Rep 10), accompanied by an elevated noise ratio.

Figure 8. Comparison between command Mel Spectrogram characteristics. Command 14 (a) and Command 19 (b) are pronounced by different contributors yet exhibit nearly identical characteristics.

Figure 9. The learning curve of the proposed model BiLSTM-CNN (best model) using WMFCC and SSC as input achieves a validation accuracy of 98.48%.

Figure 10. Model-accuracy comparison: overall, clean, and noisy environments.

Table 1. Command classification rates for the ML and DL Models using 13-MFCC features of the MASCD dataset without data augmentations (DL trained for 40 epochs; last column is the total number of parameters in each model).

Model	Train Acc%	Val Acc%	Train Loss	Val Loss	Params
KNN	79.99	51.96	-	-	-
RF	98	58.912	-	-	-
SVM	40.29	36.72	-	-	-
LR	37.57	36.54	-	-	-
CNN	98.62	71.44	0.0450	1.0101	765,844
LSTM	97.30	56.86	0.1648	1.7204	512,020
Bi-LSTM	95.98	66.31	0.0274	1.3029	1,386,004
BiLSTM-CNN	99.26	73.55	0.0112	1.0070	813,588

Table 2. Command classification rates of the models using 40 MFCC, 40 WMFCC, and 10 SSCFC coefficients as an input (without data augmentation, DL models are trained for 40 epochs. Model-performance metric: Validation accuracy).

Features/Model	KNN%	RF%	SVM%	LR%	CNN%	LSTM%	BiLSTM%	BiLSTM-CNN%
MFCC	57.48	70.85	52.05	45.27	83.33	55.97	79.95	81.37
SCC	42.07	55.53	26.56	24.24	58.56	41.44	58.73	62.30
WMFCC	65.86	85.83	57.75	46.35	92.06	75.85	90.96	91.62
MFCC + SCC	57.66	71.66	55.08	49.73	81.46	54.19	77.90	78.07
WMFCC + SCC	65.33	87.79	70.94	59.36	92.16	78.52	91.05	93.67
MFCC + WMFCC	64.94	84.48	81.18	73.51	91.44	69.85	87.24	89.65
MFCC + WMFCC + SCC	63.99	84.76	83.96	73.53	89.75	67.02	90.46	90.55

Table 3. Command classification rates of the four models using 40 MFCC, 40 WMFCC, and 10 SSCFC coefficients as an input (with data augmentation, DL models are trained for 40 epochs. Model-performance metric: Validation accuracy).

Features/Model	KNN%	RF%	SVM%	LR%	CNN%	LSTM%	BiLSTM%	BiLSTM-CNN%
MFCC	74	85	50	42	88.99	77.58	91.49	92.08
SCC	65	69	23	22	67.29	64.30	73.93	76.06
WMFCC	79	88	53	43	93.27	83.96	93.72	97.17
MFCC + SCC	75	87	54	47	89.30	70.72	90.42	93.82
WMFCC + SCC	77	89	65	53	94.13	84.00	94.63	98.48
MFCC + WMFCC	78	88	78	67	93.05	80.26	94.30	96.66
MFCC + WMFCC + SCC	78	89	80	69	93.00	79.41	94.16	95.75

Table 4. A comparison between state of art command recognition systems in the literature and our proposed system.

Features/ Article Ref	Hamza et al. (2009) [11]	Hamza et al. (2009) [11]	Ibrahim and Saad (2022) [10]	Ghandoura et al. (2021) [8]	Our Model
Dataset Size	4800	4800	720	12,000	5600
Number of Commands	12	12	5	40	20
Model used	HMM	HMM &GMM	SVM	CNN	BiLSTM—CNN
Performance (Accuracy)	76%	82%	95.48%	97.97%	98.48%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ouali, S.; El Garouani, S. Efficient and Robust Arabic Automotive Speech Command Recognition System. Algorithms 2024, 17, 385. https://doi.org/10.3390/a17090385

AMA Style

Ouali S, El Garouani S. Efficient and Robust Arabic Automotive Speech Command Recognition System. Algorithms. 2024; 17(9):385. https://doi.org/10.3390/a17090385

Chicago/Turabian Style

Ouali, Soufiyan, and Said El Garouani. 2024. "Efficient and Robust Arabic Automotive Speech Command Recognition System" Algorithms 17, no. 9: 385. https://doi.org/10.3390/a17090385

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient and Robust Arabic Automotive Speech Command Recognition System

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Dataset Building

3.1.1. Command Choice

3.1.2. Final Dataset Description

3.2. Speech Feature Extraction

3.2.1. Mel-Frequency Cepstral Coefficient (MFCC)

3.2.2. Weighted Mel-Frequency Cepstral Coefficient (WMFCC)

3.2.3. Spectral Subband Centroids (SSCs)

3.3. Classifier Selection

3.3.1. Machine-Learning Models

3.3.2. Deep-Learning Models

Convolutional Neural Networks (CNN)

Long Short-Term Memory Networks (LSTM)

Bidirectional Long Short-Term Memory Networks (Bi-LSTM)

BiLSTM-CNN Architecture

3.3.3. Transformers

Fine-Tuning the Wav2vec2 Model

3.4. Out-of-Corpus Command Handling

4. Experiment Results and Discussion

4.1. Dataset Collection Procedure

4.2. Data Split

4.3. Performance Measures

4.4. Preliminary Results

4.5. Data Augmentations

5. Conclusions and Feature Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI