Next Article in Journal
Rebar Fabrication Plan to Enhance Production Efficiency for Simultaneous Multiple Projects
Next Article in Special Issue
Threatening URDU Language Detection from Tweets Using Machine Learning
Previous Article in Journal
Investigation of the Energy Characteristics of a Circuit under the Charge of a Supercapacitor and an Equivalent Linear Capacitor
Previous Article in Special Issue
Computational Linguistics with Deep-Learning-Based Intent Detection for Natural Language Understanding
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Spoken Language Identification System Using Convolutional Recurrent Neural Network

Computer Engineering Department, College of Computer and Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2022, 12(18), 9181; https://doi.org/10.3390/app12189181
Submission received: 8 August 2022 / Revised: 7 September 2022 / Accepted: 9 September 2022 / Published: 13 September 2022
(This article belongs to the Special Issue Recent Trends in Natural Language Processing and Its Applications)

Abstract

:
Following recent advancements in deep learning and artificial intelligence, spoken language identification applications are playing an increasingly significant role in our day-to-day lives, especially in the domain of multi-lingual speech recognition. In this article, we propose a spoken language identification system that depends on the sequence of feature vectors. The proposed system uses a hybrid Convolutional Recurrent Neural Network (CRNN), which combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN) network, for spoken language identification on seven languages, including Arabic, chosen from subsets of the Mozilla Common Voice (MCV) corpus. The proposed system exploits the advantages of both CNN and RNN architectures to construct the CRNN architecture. At the feature extraction stage, it compares the Gammatone Cepstral Coefficient (GTCC) feature and Mel Frequency Cepstral Coefficient (MFCC) feature, as well as a combination of both. Finally, the speech signals were represented as frames and used as the input for the CRNN architecture. After conducting experiments, the results of the proposed system indicate higher performance with combined GTCC and MFCC features compared to GTCC or MFCC features used individually. The average accuracy of the proposed system was 92.81% in the best experiment for spoken language identification. Furthermore, the system can learn language-specific patterns in various filter size representations of speech files.

Graphical Abstract

1. Introduction

Several issues have impeded the development of Automatic Speech Recognition (ASR) systems. The most important of these is spoken Language Identification (LID) in multi-lingual ASR systems [1]. The purpose of spoken LID systems is to classify spoken language from a given audio sample. As such, in the absence of automatic language detection, speech utterances cannot be correctly processed, and grammatical rules cannot be applied [2]. Language has no geographical boundaries in the contemporary era. More than five thousand languages are spoken worldwide, and each has distinct properties at different levels, from acoustics to semantics [3,4,5]. Western countries have made significant progress in using applications based on spoken language recognition. However, these applications have not gained much traction in multi-lingual and multi-dialectal Arabic countries, including Saudi Arabia, due to the complexity of the native Arabic language [6,7]. Spoken LID systems seek to determine and classify the language that is spoken within a speech utterance [8]. The problem of identifying the spoken language of a given audio file is of considerable interest to a range of multi-lingual speech processing applications, including speech translation [9], multi-lingual speech recognition [10,11,12], spoken document retrieval [13], and defense and surveillance applications [14]. This article’s main contribution lies in considering seven spoken language identification, including Arabic, using public audio speech corpus, and the joint use of the MFCC and GTCC under the CRNN framework. The contribution can be summarized as follows: (1) We present a hybrid CRNN that combines a CNN’s descriptive capabilities with an LSTM architecture’s capacity to capture sequence data; (2) We compare the effect of fine-tuning the features; (3) We conduct comprehensive tests and experiments with the proposed architecture, demonstrating its applicability in a variety of contexts, as well as its extension to new languages; and (4) We use system errors to infer the degree of similarities in considered languages.
The article is structured as follows: in Section 2, we provide a literature review in the field of spoken LID systems; in Section 3, the chosen speech corpora are presented; Section 4 showcases the proposed system; in Section 5, we evaluate the proposed system using extensive experiments; in Section 6, we present and discuss the results; and finally, in Section 7, we present concluding remarks and recommendations for future research.

2. Literature Review

A comprehensive survey of the literature on state-of-the-art spoken LID, focusing especially on speech features and architectures, was conducted by Shen et al. [15]. Most of the architectures used spectral and acoustic features for spoken LID [16,17,18]. However, there are few reported attempts for spoken Arabic language identification among the set of languages. In 1999, Nofal et al. [19] proposed a spoken LID system for identifying just two languages: Arabic and English. Zazo et al. [20] presented an LSTM-Recurrent Neural Network (LSTM-RNN) architecture for the spoken LID system. The system outperformed a reference I-vector system by 26.00% on a subset of the National Institute of Standards and Technology (NIST) Language Recognition Evaluation (LRE) 2009 corpus for eight languages. Furthermore, the results demonstrated that an LSTM-RNN architecture could identify languages with an accuracy of 70.90%. Kumar [16] proposed Fourier Parameter (FP) features for the spoken LID system. The performance of the FP features was analyzed and compared with the legacy Mel Frequency Cepstral Coefficient (MFCC) features. They used the Indian Institute of Technology Kharagpur Multi-lingual Indian Language Speech Corpus (IITKGP-MLILSC) and the Oriental Language Recognition Speech Corpus (AP18-OR) to identify the spoken language. Finally, they developed three architectures with the extracted FP and MFCC features: Support Vector Machine (SVM), feed-forward Artificial Neural Network (ANN), and Deep Neural Network (DNN). The results demonstrated that the proposed FP features effectively recognize different spoken languages from speech signals. A performance improvement was also observed when combining FP and MFCC.
Draghici et al. [21] investigated the previous architectures for spoken LID systems, which utilize the CNN and CRNN architectures. Additionally, they used a set of seven languages. Despite the increasing complexity of the architectures, it was successful, achieving an accuracy of 71.00% for the CNN architecture and 83.00% for the CRNN architecture. Kim and Park [11] proposed a spoken LID system and investigated another feature that characterizes language-specific properties: speech rhythm. They evaluated two corpora published by different organizations. The first corpus is the Speech Information Technology and Industry Promotion Center (SiTEC). The second corpus is the Mozilla Common Voice (MCV) corpus. Despite the low computational complexity of the system, the results were either poorer than or similar to the performance of traditional approaches, with an error rate of up to 65.66%. Guha et al. [22] developed a hybrid Feature Selection (FS) algorithm using the versatile Harmony Search (HS) algorithm and the Naked Mole-Rat (NMR) algorithm for Human-Computer Interaction (HCI)-based applications by attempting to classify various languages from the three corpora: the Collection of Single Speaker Speech corpus (CSS10) for ten languages, the VoxForge corpus for six languages, and the Madras (IIT-M) speech corpus for ten languages. This study involved extracting their Mel spectrogram features and Relative Spectral Transform-Perceptual Linear Prediction (RASTA-PLP) features, and these features were used as input for five classifiers: SVM, k-Nearest Neighbor (k-NN), Multi-layer Perceptron (MLP), Naïve Bayes (NB), and RF. Accuracies of 99.89% on CSS10, 98.22% on VoxForge, and 99.75% on the IIT-Madras speech corpus databases were achieved using RF. Sangwan et al. [23] presented a system that used hybrid features and ANN learning architectures for spoken LID of four languages. First, they used RASTA-PLP and MFCC features with RASTA-PLP features. The classifier selected for the spoken LID system was a Feed-Forward Back Propagation Neural Network (FFBPNN). The results indicate better performance with combined MFCC and RASTA-PLP features compared to the individual use of RASTA-PLP features. The proposed classification yielded an accuracy of 95.30%. Garain et al. [24] suggested a deep learning-based system called FuzzyGCP for spoken LID. This system integrates the classification principles of Deep Convolutional Neural Networks (DCNN), a Deep Dumb Multi-Layer Perceptron (DDMLP), and Semi-Supervised Generative Adversarial Networks (SSGAN). They tested the system in four corpora containing two Indic and two foreign language corpora. As a result, the system achieved an accuracy of 98.00% on the Multi-lingual Corpus of Sentence-Aligned Spoken Utterances (MaSS) corpus. Moreover, it achieved the worst performance of 67.00% on the VoxForge corpus. Shen et al. [25] proposed an RNN Transducer (RNN-T) architecture for the spoken LID system. The system exploits phonetically-aware acoustic features and explicit linguistic features for LID tasks. Experiments were conducted on eight languages from the Multi-lingual LibriSpeech (MLS) corpus. The results showed that the proposed system significantly improved the performance on spoken LID tasks, with relative improvements of 12.00% to 59.00% and 16.00% to 24.00% on in-domain and cross-domain corpora, respectively.
Spoken LID systems have been widely investigated using multiple architectures in terms of feature extraction and classifier learning. However, they are limited to most human languages [26,27]. Additionally, while many researchers have successfully used deep learning techniques to solve spoken LID problems, as presented in [22,23,24], the central problem remains that the Arabic language has only received limited attention in the spoken LID research community among the set of languages [19,28,29]. Therefore, this article seeks to improve the performance of spoken LID tasks for six spoken languages from industrial countries in conjunction with Arabic [30]. These are seven languages of the ten most in-demand foreign languages across the globe [31]. A summary of the literature on spoken LID systems is presented in Table 1.

3. Selected Speech Corpus

The corpus selected for this article is from the Mozilla Common Voice (MCV) corpora. MCV is a multi-lingual corpus of speech intended for speech technology research and development. It is designed for ASR purposes and is also well known for being useful in other domains; e.g., LID and gender classification [34]. Each input in the corpora consists of an individual MP3 file and a corresponding text file [35]. The most recent release includes 87 languages, but the publishers periodically add more voices and languages. Over 200,000 male and female speakers have participated, resulting in 18,243 h of collected audio.
To achieve scale and sustainability, the MCV project uses crowdsourcing for data collection and validation [36]. As an example use case for MCV, we present our spoken LID experiments using subsets of MCV for seven target languages: Arabic, German, English, Spanish, French, Russian, and Chinese. These are the first experiments undertaken on most of these languages for spoken LID. The aim was to compare the spoken LID performance and outcomes of Arabic with other languages. Table 2 presents the characteristics of the MCV corpus. Table 3 summarizes the experimental corpus.

4. Proposed Spoken Language Identification System

This section outlines the motivation for the article and describes the proposed spoken LID system presented in Figure 1.

4.1. Motivation

The motivation for this article is to present a novel spoken LID system for seven languages, including Arabic, that investigates and analyzes spoken LID problems with the most cutting-edge systems available.

4.2. Preprocessing

Data preprocessing is critical for the success of deep learning systems [37,38]. In this article, data processing was done by detecting the boundaries of speech. In particular, speech signals were segmented and silent parts removed depending on speech boundaries using R2020b-MATLAB’s detect speech function [39]. This was done to prevent the silent parts from influencing the classification task. Figure 2 shows plots of the detected speech boundaries.

4.3. Selected Features

To reduce signal redundancy and improve accuracy in the spoken LID system, we propose combining MFCC and GFCC features [40].

4.3.1. MFCC

Mel Frequency Cepstral Coefficient (MFCC) is the most popular feature extraction method in speech processing [41] and the best feature extraction technique [42]. The process for elaborating a character vector of MFCC is given as follows.
First, a pre-emphasis filter is applied to the signal. It is then divided into frames, with the windowing function subsequently applied to the frames. When obtaining the Discrete Fourier Transform (DFT) of each frame, the amplitude of the spectrum is used, and this information is passed to a Mel domain through the Filter Bank (Mel scale); this scale is a simulated human listener. After the logarithm of the signal is obtained, the Discrete Cosine Transform (DCT) is applied from the obtained vector. In turn, the number of desired coefficients per frame is taken [43].

4.3.2. GTCC

Gammatone Cepstral Coefficient (GTCC) is often referred to as Frequency Cepstrum Coefficient (GFCC) or Gammatone Cepstrum Coefficient (GTCC). The Gammatone filter is inspired by biological life, in which the representation of the human auditory filter response in the cochlea is very similar to the magnitude response of a Gammatone filter [44]. The method of extracting GTCC is similar to MFCC, with two main differences. The first is the frequency scale used, where GTCC provides greater resolution at low frequencies than MFCC. This is because it is based on the equivalent rectangular bandwidth scale, while MFCC is based on Mel-scale. The GTCC uses the cubic root, whereas the MFCC employs the log.
Due to its high accuracy and low complexity, MFCC is frequently used for sound processing systems. However, it does not have high resistance to noise. Recent investigations have revealed that the properties of GTCC, by contrast, are that it is particularly resistant to noise and acoustic change [45]. With these considerations in mind, the primary goal is to combine the characteristics of GTCC and MFCC to improve the overall testing accuracy of the system.

4.4. Proposed CRNN Architecture

CNN has been used widely in various application areas, including image classification and speech recognition. In these domains, A CNN has achieved state-of-the-art levels of performance. The convolutional layer in a CNN facilitates feature extraction, which is performed using convolutional processing and filtering. A sliding window (filter) is applied to the inputs to perform this filtering operation. By calculating the convolution product between the window and the considered input portion, the applied filter will represent the characteristic extracted by the network, with its parameters estimated during network training. The objective of subsampling layers is to reduce the dimensionality of the characteristics that result from the convolution without significant loss of information. Standard methods are max-pooling and average-pooling.
In the RNN family, the LSTM network is a unique subclass [46]. This network has been shown to yield remarkable results for learning long-time dependencies. Recursion is the critical feature of recurrent networks. In particular, this network creates loops in the network diagram that can preserve information. As a result, they can essentially “remember” previous states and use that knowledge to determine what will happen next. This property makes them ideal for working with time series data. LSTM units are used for the construction of recurrent networks. Each LSTM unit is a cell with four gates: the input gate, the external input gate, the forget gate, and the output gate. The most important aspect of the cell is its internal state, which enables it to store and preserve values over time.
This article proposes a CRNN architecture that combines CNN with LSTM, as demonstrated above. CNN is an excellent network for feature extractions, while the LSTM has proved its ability to identify the language in sequence-to-sequence series. Figure 3 shows the architecture of the proposed CRNN.
The proposed CRNN architecture, as shown in Figure 3, contains CNN layers for feature extraction on the input data, as well as LSTM to support long-term temporal dynamics. The novelty of the proposed CRNN architecture is in using three layers to convert the type of input data depending on the networks used in CRNN: a sequence folding layer, a sequence unfolding layer, and a flatten layer.
The sequence folding layer converts the sequences of images to an array of images [47] for CNN to be able to extract spatial features from the input array. The sequence unfolding layer converts this array of images back to image sequences [48], and the flatten layer converts image sequences to feature vectors [49] for input to the LSTM layer. Finally, to carry out the final prediction, the results are provided as input, fully connected layers, and output layers in the prediction block [50].

4.5. Layers Description

(1)
The Sequence Input Layer for 2-D image sequence input contains the vector of three elements. Depending on the used parameters, this article includes 26 dimensions: 13 for GTCC and 13 for MFCC; there are 50 feature vectors per sequence (Because the features that were extracted from segments have different feature vectors depending on speech file length, we need the framing to buffer the feature vectors into fixed sequences of size 50 frames with 47 overlaps); hence the input size is [26 50 1], where 26, 50, and 1 correspond to the height, width, and the number of channels of the image, respectively.
(2)
To use Convolutional Layers to extract features (that is, to apply convolutional operations to each speech frame independently), we use a Sequence Folding Layer followed by convolutional layers.
(3)
The Batch Normalization Layer follows the Convolution Layer, where Batch Normalization is responsible for the convergence of learned representations. Then the ELU Layer.
(4)
Average Pooling 2 d = 1 × 1 with stride [10 10] and padding [0 0 0 0]. A 2-D average pooling layer performs downsampling by dividing the input into rectangular pooling regions and computing the average values of each region.
(5)
We use a Sequence Unfolding Layer and a Flatten Layer to restore the sequence structure and reshape the output to vector sequences.
(6)
We include the LSTM Layers to classify the resulting vector sequences.
(7)
The Dropout Layer, with a dropout possibility of 40%, always follows every LSTM layer.
(8)
The Fully Connected Layer contains seven neurons.
(9)
The Softmax Layer applies a softmax function to the input.
(10)
The Classification Output Layer acts as an output layer for the proposed system.

5. Experiments

This section provides detail regarding performance metrics and the different proposed experiments.

5.1. Performance Evaluation

Two options are presented in this article for evaluating the system’s performance. The first is the evaluation adopted on the sequence level, and the second is the evaluation based on the file level. Each speech file contained different numbers of sequences. Therefore, to compute the complete speech file accuracy, as in [51], the accuracy of each sequence was calculated. Choosing the right metrics is crucial for accurately measuring the performance of trained architectures using a testing corpus. It is always desirable to ensure the exactness of the system by computing various metrics. This article used four performance metrics: accuracy (A), precision (P), recall (R), and F-measure (F1) [26] to determine the effectiveness of the CRNN architecture for the spoken LID problem. For all metrics, TP represents all true positives, TN denotes all true negatives, FP shows all false positives, and FN represents all false negatives [32].

5.1.1. Accuracy (A)

As shown in Equation (1), accuracy is the ratio of truly predicted levels to total levels.
A = T P + T N T P + T N + F P + F N × 100

5.1.2. Precision (P)

As shown in Equation (2), precision is the fraction of truly positive predicted levels to overall positive predicted levels. A low precision value indicates a high false-positive rate and vice versa.
P = T P T P + F P × 100

5.1.3. Recall (R)

As shown in Equation (3), recall (also known as sensitivity) is the fraction of truly positive projected levels to overall positive actual levels. A low recall value suggests a high false-negative rate and vice versa.
R = T P T P + F N × 100

5.1.4. F-Measure (F1)

As shown in Equation (4), a weighted average of recall and precision is known as the F-measure.
F 1 = 2 × P × R P + R
The variables P and R represent precision and recall in this equation.

5.2. Experimental Setup

The proposed system contains adjustable parameters, including filter size, the number of filters, the number of layers, and the number of hidden units (for LSTM), which may be tuned optimally to reduce the prediction error. In contrast, these differences in parameters achieved complexity in the proposed CRNN architecture. We performed 12 experiments with different parameters to study what combination of variables would achieve the highest accuracy for the system. Table 4 provides a snapshot of the configurations and results of a selection of the experiments undertaken. GTCC and MFCC were generated with MATLAB for each audio file in the database using Hamming windows with a window length of 30 msec and a window step of 20 msec. This resulted in [26 50 1] matrices, which were fed into the architecture as inputs. An NVIDIA GeForce RTX 2080 Ti graphics processing unit with 11 GB RAM was employed for training, with a batch size of 128 samples for 15 epochs and a learning rate of 0.0009, using Adam’s adaptive gradient descent method as the optimizer.

6. Results and Discussion

6.1. Feature Comparison Results

In this article, a comparison of the accuracy of GTCC and MFCC was undertaken for spoken LID in the proposed system. The accuracy of GTCC compared with MFCC was evaluated in Experiment 1 of Table 4 for all files using a considered MCV corpus with the execution of five runs. Table 5 shows the experimental results for GTCC and MFCC, as well as combined features.
The highest average testing accuracy was achieved when concatenating GTCC and MFCC as system features. The average precision of GTCC for spoken LID in Arabic was 94.88%, Chinese 94.20%, English 91.30%, French 90.58%, German 74.84%, Russian 92.98%, and Spanish 89.86%. In contrast, the average precision of MFCC for spoken LID in Arabic was 94.82%, Chinese 94.88%, English 89.76%, French 92.14%, German 76.78%, Russian 95.32%, and Spanish 91.72%. The results indicate that GTCC outperformed MFCC in Arabic and English spoken LID. However, the most effective results were obtained when GTCC was combined with MFCC. These results are illustrated in Figure 4. However, Table 5 shows that consumption time is a major issue in spoken LID systems when using more than one feature, especially in the real-time setting.
The average precision when combining GTCC with MFCC in Experiment 1 for spoken LID in Arabic is 95.78%, Chinese 94.47%, English 92.35%, French 92.49%, German 80.95%, Russian 95.25%, and Spanish 93.49%. Depending on the testing accuracies and experiment consumption time, the systems in Experiments 1 and 11 achieved the highest accuracy and the lowest execution time, as shown in Table 4. The proposed architecture’s accuracy in Experiment 1 reached 92.21% in around 163 min, while in Experiment 11 it reached 93.00% in 221 min. Each experiment was performed for five runs per sequence and per whole file.

6.2. Spoken Language Identification Results

Table 6 summarizes the results for Experiments 1 and 11, as listed in Table 4. Table 6 shows that the overall testing accuracy per file was better than per sequence, suggesting a level of similarity between the languages in the file sequences. The average per-file accuracy of five runs was 91.76% in Experiment 1 and 92.81% in Experiment 11. The highest accuracy per sequence was 83.73% in Experiment 1 and 84.08% in Experiment 11, while the worst accuracy was 82.18% in Experiment 1 and 83.58% in Experiment 11. On the other hand, the highest per-file accuracy was 92.21% in Experiment 1 and 93.11% in Experiment 11, while the lowest accuracy was 91.25% in Experiment 1 and 92.46% in Experiment 11. As shown in Table 6, the standard deviation is low, especially in Experiment 11, which means that the accuracies of all runs are too close to the average accuracy. In addition, the system is yielding more stability regarding classification robustness.
The choice between accuracy or time is a well-known dilemma, and many studies have dealt with the discussion of the interpretability vs. performance trade-off. It is not always true that more complex models produce more accurate results. However, this can be incorrect when the given data is structured and has meaningful features. The statement “models that are more complicated are more accurate” can be valid in cases when the function being approximated is complex, the given data is widely distributed among suitable values for each variable, and the given data is adequate to build a complex model. The trade-off between interpretability and performance is evident in this circumstance [52,53].
In our work, the Experiment 11 results were favorable compared to the Experiment 1 results regarding the proposed system’s accuracy, while Experiment 1 outperformed Experiment 11 regarding the speed of execution. That means Experiment 1 is around 0.9% less accurate ((93 − 92.21)/92.21) but 27% faster ((221 − 162)/221) than Experiment 11. Figure 5 illustrates the language-identified accuracy results for the two experiments per sequence and file. The best language-identified accuracy was the Russian language, followed by Arabic. The worst language-identified accuracy was Spanish.
In particular, two conventional architectures also revealed difficulties in discriminating between English and Spanish due to their similarity [11]. The Spanish language is also similar to the German language [54]. In contrast, the French language is similar to Spanish [21]. German and English are more likely to be confused, but English has a slightly stronger bias toward French [2]. All in all, the learned representations of the architecture are pretty distinctive for each language. For more details, Table 7 and Table 8 present the confusion matrices of the average five runs per file for Experiments 1 and 11. In addition, the average accuracy, precision, recall, and F-measure results are presented.

6.3. Discussion

From the Experiment 1 results in Table 7, five Arabic test files were identified as Chinese and eight as German. Fourteen Chinese test files were identified as English, nine as French, fifteen as German, five as Russian, and six as Spanish. Sixteen English test files were identified as German and six as Spanish. Moreover, twenty-one French test files were identified as German and five as Spanish. In addition, six Spanish test files were identified as Chinese, eight as English, seven as French, twenty-six as German, and six as Russian. This experiment confuses Chinese and Spanish, English and Spanish, and Spanish and French. As mentioned previously, these are similar languages. Additionally, based on the results of the confusion matrices in Table 7, we can conclude that the German language is confused with Arabic, Chinese, English, French, and Spanish, as in the predicted German column. This implies a considerable similarity between German and these languages. We can also conclude that German is the only language that serves as a source of confusion for Arabic, as in the actual Arabic row. However, Arabic causes minimal confusion for all other languages, as in the predicted Arabic column.
From the Experiment 11 results in Table 8, seven Arabic test files were identified as German. Ten Chinese test files were identified as English, eight as French, and fourteen as German. Sixteen English test files were identified as German and five as Spanish. Twenty-one French test files were identified as German. Seven Spanish test files were identified as English, eight as French, and twenty-four as German. As the results indicate, this experiment still confuses English and Spanish. In contrast, the system was effective at distinguishing between Spanish and Chinese, as well as between French and Spanish. We can draw the same conclusions from Experiment 11 as from Experiment 1. In particular, German was the main source of confusion for all six other languages, as seen in Table 8. In addition, Arabic was not a source of confusion for any of the other six languages.
From Table 7 and Table 8, German achieved the lowest P values (80.95% and 81.87% in Experiments 1 and 11, respectively) and the highest R values (96.45% and 97.55%, respectively), as well as the lowest F1 values: (88.02% and 89.02%, respectively). The impact of the German language is clear from the confusion matrices in these tables, where most spoken LID error files of any language, and Spanish in particular, were linked to German. Interpreting the relationships between these languages requires further linguistic research. Nevertheless, Table 7 and Table 8 indicate that the proposed system is ideal for identifying Arabic spoken language and distinguishing it from the other six languages. Moreover, the system can discriminate among similar languages with the best accuracy. Table 9 presents the studies that used CRNN, CNN, LSTM architectures, or Mozilla corpus to compare with our proposed system.
As shown in Table 9, we clarified the features, corpus, languages, the proposed classifiers used in each study, and the number of classes to make a general comparison of the available studies in the spoken LID field. Compared to these studies, we found that our proposed system applied to languages of industrial countries, including Arabic, obtained the best results.

7. Conclusions

In this article, we proposed a spoken LID system using a CRNN architecture that works on combined GTCC and MFCC features of speech signals. At the feature extraction stage, the GTCC and MFCC features were compared, as well as a combination of both. The average precision of GTCC for spoken LID in Arabic was 94.88%, Chinese 94.20%, English 91.30%, French 90.58%, German 74.84%, Russian 92.98%, and Spanish 89.86%. In Experiment 1, by contrast, the average precision of MFCC for spoken LID in Arabic was 94.82%, Chinese 94.88%, English 89.76%, French 92.14%, German 76.78%, Russian 95.32%, and Spanish 91.72%. The results indicate that GTCC outperformed MFCC in Arabic and English spoken LID. We performed extensive experiments with different parameters and reported a state-of-the-art result in that the system in Experiment 11 achieved the highest overall average accuracy of 92.81% for identifying the seven spoken languages considered in the selected corpus.
The average precision when combining GTCC and MFCC in Experiment 11 for spoken LID in Arabic was 95.84%, Chinese 96.82%, English 93.85%, French 92.64%, German 81.87%, Russian 96.79%, and Spanish 94.46%. Furthermore, our experiments indicate that the proposed system gives the best results for identifying Arabic and Russian, discriminates among similar languages, and is extensible to new languages. Additionally, this article provides a good reference for people interested in developing Arabic-related ASR systems.
In future research, we can extend the number of languages and compare our CRNN architecture with other variations of deep learning architectures. Many other speech corpora can also be used for evaluation. The immediate benefit of an effective multi-lingual spoken LID system is that it can be developed based on the system’s output for different purposes, including multi-lingual automatic translation.

Author Contributions

Conceptualization, A.A.A., M.A.Q., A.H.M. and Y.A.A.; methodology, A.A.A. and M.A.Q.; software, A.A.A. and M.A.Q.; validation, A.A.A., M.A.Q., A.H.M. and Y.A.A.; formal analysis, A.A.A., M.A.Q., A.H.M. and Y.A.A.; investigation, A.A.A. and Y.A.A.; resources, A.A.A. and M.A.Q.; data curation, A.A.A.; writing—original draft preparation, A.A.A., M.A.Q. and A.H.M.; writing—review and editing, A.A.A. and Y.A.A.; visualization, A.A.A.; supervision, Y.A.A.; project administration, A.A.A.; funding acquisition, Y.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Researchers Supporting Project number (RSP-2021/322), King Saud University, Riyadh, Saudi Arabia.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank Researchers Supporting Project at King Saud University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lounnas, K.; Satori, H.; Hamidi, M.; Teffahi, H.; Abbas, M.; Lichouri, M. CLIASR: A Combined Automatic Speech Recognition and Language Identification System. In Proceedings of the 2020 1st International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Meknes, Morocco, 16–19 April 2020; pp. 1–5. [Google Scholar] [CrossRef]
  2. Bartz, C.; Herold, T.; Yang, H.; Meinel, C. Language Identification Using Deep Convolutional Recurrent Neural Networks. arXiv 2017. [Google Scholar] [CrossRef]
  3. Fromkin, V.; Rodman, R.; Hyams, N.M. An Introduction to Language, 10th ed.; Wadsworth/Cengage Learning: Boston, MA, USA, 2014. [Google Scholar]
  4. The World’s Major Languages; Routledge Handbooks Online: London, UK, 2008. [CrossRef]
  5. Crystal, D. The Cambridge Encyclopedia of Language, 3rd ed.; Cambridge University Press: Cambridge, NY, USA, 2010. [Google Scholar]
  6. Shaalan, K.; Siddiqui, S.; Alkhatib, M.; Monem, A.A. Challenges in Arabic Natural Language Processing. In Systems Computational Linguistics, Speech and Image Processing for Arabic Language; World Scientific: Singapore, 2018; pp. 59–83. [Google Scholar] [CrossRef]
  7. Alotaibi, Y.A.; Muhammad, G. Study on pharyngeal and uvular consonants in foreign accented Arabic for ASR. Comput. Speech Lang. 2010, 24, 219–231. [Google Scholar] [CrossRef]
  8. Spoken Language Recognition: From Fundamentals to Practice. IEEE J. Mag. IEEE Xplore 2013, 101, 1136–1159. Available online: https://ieeexplore.ieee.org/document/6451097 (accessed on 26 February 2022).
  9. Waibel, A.; Geutner, P.; Tomokiyo, L.; Schultz, T.; Woszczyna, M. Multilinguality in speech and spoken language systems. Proc. IEEE 2000, 88, 1297–1313. [Google Scholar] [CrossRef]
  10. Schultz, T.; Waibel, A. Language-independent and language-adaptive acoustic modeling for speech recognition. Speech Commun. 2001, 35, 31–51. [Google Scholar] [CrossRef]
  11. Kim, H.; Park, J.-S. Automatic Language Identification Using Speech Rhythm Features for Multi-Lingual Speech Recognition. Appl. Sci. 2020, 10, 2225. [Google Scholar] [CrossRef]
  12. Liu, D.; Xu, J.; Zhang, P.; Yan, Y. A unified system for multilingual speech recognition and language identification. Speech Commun. 2020, 127, 17–28. [Google Scholar] [CrossRef]
  13. Chelba, C.; Hazen, T.; Saraclar, M. Retrieval and browsing of spoken content. IEEE Signal Process. Mag. 2008, 25, 39–49. [Google Scholar] [CrossRef]
  14. Walker, K.; Strassel, S. The RATS radio traffic collection system. In Odyssey Speaker and Language Recognition Workshop; ISCA: Cape Town, South Africa, 2012. [Google Scholar]
  15. Shen, P.; Lu, X.; Li, S.; Kawai, H. Knowledge Distillation-Based Representation Learning for Short-Utterance Spoken Language Identification. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 2674–2683. [Google Scholar] [CrossRef]
  16. Srinivas, N.S.S.; Sugan, N.; Kar, N.; Kumar, L.S.; Nath, M.K.; Kanhe, A. Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters. Circuits Syst. Signal Process. 2019, 38, 5018–5067. [Google Scholar] [CrossRef]
  17. He, K.; Xu, W.; Yan, Y. Multi-Level Cross-Lingual Transfer Learning With Language Shared and Specific Knowledge for Spoken Language Understanding. IEEE Access 2020, 8, 29407–29416. [Google Scholar] [CrossRef]
  18. Padi, B.; Mohan, A.; Ganapathy, S. Towards Relevance and Sequence Modeling in Language Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1223–1232. [Google Scholar] [CrossRef]
  19. Nofal, M.; Abdel-Reheem, E.; El Henawy, H. Arabic/English automatic spoken language identification. In Proceedings of the 1999 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM 1999). Conference Proceedings (Cat. No.99CH36368), Victoria, BC, Canada, 22–24 August 1999; pp. 400–403. [Google Scholar] [CrossRef]
  20. Zazo, R.; Lozano-Diez, A.; Gonzalez-Dominguez, J.; Toledano, D.T.; Gonzalez-Rodriguez, J. Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks. PLoS ONE 2016, 11, e0146917. [Google Scholar] [CrossRef]
  21. Draghici, A.; Abeßer, J.; Lukashevich, H. A study on spoken language identification using deep neural networks. In Proceedings of the 15th International Conference on Audio Mostly, New York, NY, USA, 15–17 September 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 253–256. Available online: https://doi.org/10.1145/3411109.3411123 (accessed on 17 June 2022).
  22. Guha, S.; Das, A.; Singh, P.K.; Ahmadian, A.; Senu, N.; Sarkar, R. Hybrid Feature Selection Method Based on Harmony Search and Naked Mole-Rat Algorithms for Spoken Language Identification From Audio Signals. IEEE Access 2020, 8, 182868–182887. [Google Scholar] [CrossRef]
  23. Sangwan, P.; Deshwal, D.; Dahiya, N. Performance of a language identification system using hybrid features and ANN learning algorithms. Appl. Acoust. 2021, 175, 107815. [Google Scholar] [CrossRef]
  24. Garain, A.; Singh, P.K.; Sarkar, R. FuzzyGCP: A deep learning architecture for automatic spoken language identification from speech signals. Expert Syst. Appl. 2021, 168, 114416. [Google Scholar] [CrossRef]
  25. Shen, P.; Lu, X.; Kawai, H. Transducer-based language embedding for spoken language identification. arXiv 2022, arXiv:2204.03888. [Google Scholar]
  26. Das, A.; Guha, S.; Singh, P.K.; Ahmadian, A.; Senu, N.; Sarkar, R. A Hybrid Meta-Heuristic Feature Selection Method for Identification of Indian Spoken Languages From Audio Signals. IEEE Access 2020, 8, 181432–181449. [Google Scholar] [CrossRef]
  27. Ma, Z.; Yu, H. Language Identification with Deep Bottleneck Features. arXiv 2020, arXiv:1809.08909. Available online: http://arxiv.org/abs/1809.08909 (accessed on 23 February 2022).
  28. Alshutayri, A.; Albarhamtoshy, H. Arabic Spoken Language Identification System (ASLIS): A Proposed System to Identifying Modern Standard Arabic (MSA) and Egyptian Dialect. In Proceedings of the Informatics Engineering and Information Science Conference, Kuala Lumpur, Malaysia, 12–14 November 2011; Springer: Berlin/Heidelberg, Germany; pp. 375–385. [Google Scholar] [CrossRef]
  29. Mohammed, E.M.; Sayed, M.S.; Moselhy, A.M.; Abdelnaiem, A.A. LPC and MFCC Performance Evaluation with Artificial Neural Network for Spoken Language Identification. Int. J. Signal Process. Image Process. Pattern Recognit. 2013, 6, 55. [Google Scholar]
  30. Pimentel, I. The Top 10 Languages in Higher Demand for Business. Available online: https://blog.acolad.com/the-top-10-languages-in-higher-demand-for-business (accessed on 21 August 2022).
  31. “10 Foreign Languages in Demand across the Globe”. Education World, 19 November 2018. Available online: https://www.educationworld.in/foreign-languages-in-demand-across-the-globe/ (accessed on 21 August 2022).
  32. Sisodia, D.S.; Nikhil, S.; Kiran, G.S.; Sathvik, P. Ensemble Learners for Identification of Spoken Languages using Mel Frequency Cepstral Coefficients. In Proceedings of the 2nd International Conference on Data, Engineering and Applications (IDEA), Bhopal, India, 28–29 February 2020; pp. 1–5. [Google Scholar] [CrossRef]
  33. Singh, G.; Sharma, S.; Kumar, V.; Kaur, M.; Baz, M.; Masud, M. Spoken Language Identification Using Deep Learning. Comput. Intell. Neurosci. 2021. [Google Scholar] [CrossRef]
  34. Alashban, A.A.; Alotaibi, Y.A. Speaker Gender Classification in Mono-Language and Cross-Language Using BLSTM Network. In Proceedings of the 2021 44th International Conference on Telecommunications and Signal Processing (TSP), Brno, Czech Republic, 26–28 July 2021; pp. 66–71. [Google Scholar] [CrossRef]
  35. Mozilla Common Voice. Available online: https://commonvoice.mozilla.org/ (accessed on 27 February 2022).
  36. Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common Voice: A Massively-Multilingual Speech Corpus. arXiv 2020, arXiv:1912.06670. Available online: http://arxiv.org/abs/1912.06670 (accessed on 27 December 2021).
  37. Automatic Speech Recognition: A Deep Learning Approach—PDF Drive. Available online: http://www.pdfdrive.com/automatic-speech-recognition-a-deep-learning-approach-e177783075.html (accessed on 30 March 2022).
  38. Alashban, A.A.; Alotaibi, Y.A. Language Effect on Speaker Gender Classification Using Deep Learning. In Proceedings of the 2022 2nd International Conference on Artificial Intelligence and Signal Processing (AISP), Vijayawada, India, 12–14 February 2022; pp. 1–6. [Google Scholar] [CrossRef]
  39. Detect Boundaries of Speech in Audio Signal—MATLAB detectSpeech—MathWorks Switzerland. Available online: https://ch.mathworks.com/help/audio/ref/detectspeech.html (accessed on 17 August 2022).
  40. Journal, I. Extracting Mfcc and Gtcc Features for Emotion Recognition from Audio Speech Signals. Available online: https://www.academia.edu/8088548/EXTRACTING_MFCC_AND_GTCC_FEATURES_FOR_EMOTION_RECOGNITION_FROM_AUDIO_SPEECH_SIGNALS (accessed on 31 March 2022).
  41. Kotsakis, R.; Matsiola, M.; Kalliris, G.; Dimoulas, C. Investigation of Spoken-Language Detection and Classification in Broadcasted Audio Content. Information 2020, 11, 211. [Google Scholar] [CrossRef]
  42. Dua, S.; Kumar, S.S.; Albagory, Y.; Ramalingam, R.; Dumka, A.; Singh, R.; Rashid, M.; Gehlot, A.; Alshamrani, S.S.; AlGhamdi, A.S. Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Appl. Sci. 2022, 12, 6223. [Google Scholar] [CrossRef]
  43. Nisar, S.; Shahzad, I.; Khan, M.A.; Tariq, M. Pashto spoken digits recognition using spectral and prosodic based feature extraction. In Proceedings of the 2017 Ninth International Conference on Advanced Computational Intelligence (ICACI), Doha, Qatar, 4–6 February 2017; pp. 74–78. [Google Scholar] [CrossRef]
  44. Liu, G.K. Evaluating Gammatone Frequency Cepstral Coefficients with Neural Networks for Emotion Recognition from Speech. arXiv 2018, arXiv:1806.09010. [Google Scholar]
  45. Liu, J.-M.; You, M.; Li, G.-Z.; Wang, Z.; Xu, X.; Qiu, Z.; Xie, W.; An, C.; Chen, S. Cough signal recognition with Gammatone Cepstral Coefficients. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013; pp. 160–164. [Google Scholar] [CrossRef]
  46. Alcaraz, J.C.; Moghaddamnia, S.; Peissig, J. Efficiency of deep neural networks for joint angle modeling in digital gait assessment. EURASIP J. Adv. Signal Process 2021, 2021, 10. [Google Scholar] [CrossRef]
  47. Sequence Folding Layer—MATLAB—MathWorks Switzerland. Available online: https://ch.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.sequencefoldinglayer.html#mw_e600a552-2ab0-48a8-b1d9-ae672b821805 (accessed on 18 August 2022).
  48. Sequence Unfolding Layer—MATLAB—MathWorks Switzerland. Available online: https://ch.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.sequenceunfoldinglayer.html?searchHighlight=unfolding%20layer&s_tid=srchtitle_unfolding%20layer_1 (accessed on 18 August 2022).
  49. Flatten Layer—MATLAB—MathWorks Switzerland. Available online: https://ch.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.flattenlayer.html?searchHighlight=flatten%20layer&s_tid=srchtitle_flatten%20layer_1 (accessed on 18 August 2022).
  50. Time Series Forecasting Using Hybrid CNN—RNN. Available online: https://ch.mathworks.com/matlabcentral/fileexchange/91360-time-series-forecasting-using-hybrid-cnn-rnn (accessed on 30 March 2022).
  51. Qamhan, M.A.; Altaheri, H.; Meftah, A.H.; Muhammad, G.; Alotaibi, Y.A. Digital Audio Forensics: Microphone and Environment Classification Using Deep Learning. IEEE Access 2021, 9, 62719–62733. [Google Scholar] [CrossRef]
  52. Arrieta, A.B.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  53. Saeed, W.; Omlin, C. Explainable AI (XAI): A Systematic Meta-Survey of Current Challenges and Future Opportunities. arXiv 2021. [Google Scholar] [CrossRef]
  54. Božinović, N.; Perić, B. The role of typology and formal similarity in third language acquisition (German and Spanish). Stran-Jez. 2021, 50, 9–30. [Google Scholar] [CrossRef]
Figure 1. Spoken LID system.
Figure 1. Spoken LID system.
Applsci 12 09181 g001
Figure 2. Data preprocessing step.
Figure 2. Data preprocessing step.
Applsci 12 09181 g002
Figure 3. CRNN architecture.
Figure 3. CRNN architecture.
Applsci 12 09181 g003
Figure 4. The average precision of each feature.
Figure 4. The average precision of each feature.
Applsci 12 09181 g004
Figure 5. Language identification accuracies.
Figure 5. Language identification accuracies.
Applsci 12 09181 g005
Table 1. Review of previous literature on spoken language identification with results.
Table 1. Review of previous literature on spoken language identification with results.
ReferenceYearFeaturesClassifiersCorpusClassesLanguagesResults (%)
Zazo et al. [20]2016MFCC-SDCLSTM-RNNNIST LRE8Dari, English, French, Chinese, Pashto, Russian, Spanish, and UrduAccuracy = 70.90
Bartz et al. [2]2017SpectrogramCRNNEuropean Parliament Statements and News Channels on YouTube6English, French, German, Chinese, Russian, and SpanishAccuracy = 91.00
Ma and Yu. [27]2018DNN-BNLSTMAP17-OLR10Tibetan, Japanese, Kazakh, Korean, Indonesian, Mandarin, Cantonese, Vietnamese, Uyghur, and RussianER = 50.00
Kumar. [16]2019FP-MFCCANNIITKGP-MLILSC and AP18-OLR10Russian, Vietnamese, Indonesian, Cantonese, Japanese, Kazakh, Korean, Tibetan, Uyghur, and MandarinAccuracy = 70.80
Kim and Park [11]2020RhythmR-Vector with
I-Vector
SiTEC2English and KoreanER = 4.73
Mozilla3Chinese, English, and SpanishER = 47.38
Guha et al. [22]2020Mel spectrogram-RASTA-PLPRFCSS1010French, Chinese, German, Dutch, Spanish, Greek, Finnish, Japanese, Russian, and HungarianAccuracy = 99.89
VoxForge6English, French, German, Italian, Russian, and SpanishAccuracy = 98.22
IIT-Madras10Assamese, English, Bangla, Gujarati, Tamil, Hindi, Telugu, Kannada, Marathi, and MalayalamAccuracy = 99.75
Draghici et al. [21]2020Mel spectrogramCRNNEU Repo6English, French, German, Greek, Italian, and SpanishAccuracy = 83.00
Sisodia et al. [32]2020MFCC-DFCCExtra TreesVoxForge5Dutch, English, French, German, and PortugueseAccuracy = 85.71
Garain et al. [24]2021MFCC-Spectral Bandwidth- Spectral Contrast- Spectral Roll-Off- Spectral Flatness- Spectral Centroid- Polynomial-TonnetzFuzzyGCPMaSS8Basque, English, Finnish, French, Hungarian, Romanian, Russian, and SpanishAccuracy = 98.00
VoxForge5French, German, Italian, Portuguese, and SpanishAccuracy = 67.00
Sangwan et al. [23]2021MFCC-RASTA-PLPFFBPNNNew Corpus4English, Hindi, Malayalam, and TamilAccuracy = 95.30
Singh et al.
[33]
2021Log-Mel spectrogramsCNNMozilla4Estonian, Tamil, Turkish, and MandarinAccuracy = 80.21
Shen et al. [25]2022Acoustic-LinguisticRNN-TMLS8Dutch, English, French, German, Polish, Portuguese, Italian, and SpanishER = 5.44
Table 2. Mozilla corpus characteristics.
Table 2. Mozilla corpus characteristics.
ParameterValue
Sampling Rate48 kHz
Date19 January 2022
Validated Hours14,122
Recorded Hours18,243
Languages87
Table 3. Selected Mozilla corpus description.
Table 3. Selected Mozilla corpus description.
LanguageNumber of Speech FilesTraining/Validation 80%, Testing 20%
Training (90%) from 80%Validation (10%) from 80%Testing (20%)
Arabic20001440Arabic2000
German20001440German2000
English20001440English2000
Spanish20001440Spanish2000
French20001440French2000
Russian20001440Russian2000
Chinese20001440Chinese2000
Total14,00010,080Total14,000
Table 4. Snapshot of selected experiment configurations and results.
Table 4. Snapshot of selected experiment configurations and results.
# Experiment# Convolution LayersParameters# Hidden Units in LSTM LayerTraining Accuracy (%)Validation Accuracy (%)Testing Accuracy (%)Time Taken
15Filter Size = 512899.7494.4692.21162 min 40 s
# Filters = 12
25Filter Size = 1012897.9993.1292.00251 min 37 s
# Filters = 32
33Filter Size1 = 1012899.4293.8392.14170 min 4 s
# Filters1 = 32
Filter Size2 = 5
# Filters2 = 12
Filter Size3 = 15
# Filters3 = 64
45Filter Size1 = 412898.9494.2992.57236 min 41 s
# Filters1 = 32
Filter Size2 = 6
# Filters2 = 12
Filter Size3 = 8
# Filters3 = 64
Filter Size4 = 10
# Filters4 = 32
Filter Size5 = 12
# Filters5 = 12
55Filter Size = 106498.6993.5792.71253 min 12 s
# Filters = 32
65Filter Size1 = 512898.9092.0591.42196 min 39 s
# Filters1 = 4
Filter Size2 = 6
# Filters2 = 8
Filter Size3 = 7
# Filters3 = 16
Filter Size4 = 8
# Filters4 = 32
Filter Size5 = 7
# Filters5 = 64
73Filter Size1 = 1212898.4490.9890.18142 min 14 s
# Filters1 = 16
Filter Size2 = 8
# Filters2 = 24
Filter Size3 = 5
# Filters3 = 32
85Filter Size = 56497.5492.590.89195 min 33 s
# Filters = 32
93Filter Size1 = 1012897.9292.8692.14355 min 34 s
# Filters1 = 32
Filter Size2 = 10
# Filters2 = 64
Filter Size3 = 10
# Filters3 = 128
105Filter Size = 103298.5793.0392.32862 min 51 s
# Filters = 32
115Filter Size1 = 1012898.9694.0193.00221 min 29 s
# Filters1 = 32
Filter Size2 = 5
# Filters2 = 12
Filter Size3 = 15
# Filters3 = 64
Filter Size4 = 5
# Filters4 = 12
Filter Size5 = 10
# Filters5 = 32
125Filter Size = 512897.7891.4292.57195 min 2 s
# Filters = 32
In contrast, the # symbol means number, and min and s abbreviations mean minute and second.
Table 5. Comparison of GTCC and MFCC for spoken LID in the proposed system.
Table 5. Comparison of GTCC and MFCC for spoken LID in the proposed system.
Features:GTCCMFCCGTCC-MFCC
Per Files (%)Testing Accuracy (%)Time TakenTesting Accuracy (%)Time TakenTesting Accuracy (%)Time Taken
Experiment 1
Run # 189.07142 min 49 s89.21137 min 48 s92.21162 min 40 s
Run # 290.00145 min 31 s89.93139 min 2 s91.79161 min 36 s
Run # 388.68145 min 43 s90.11139 min 52 s91.25163 min 11 s
Run # 489.21145 min 54 s90.82140 min 52 s91.86164 min 57 s
Run # 588.64147 min 3 s90.57141 min 15 s91.68153 min 46 s
Average89.12145 min 36 s90.13138 min 34 s91.76160 min 38 s
Table 6. Summary results for selected experiments.
Table 6. Summary results for selected experiments.
Testing AccuracyPer Sequences (%)Per Files (%)
GTCC and MFCCExperiment 1Experiment 11Experiment 1Experiment 11
Run # 183.7383.7792.2193.00
Run # 282.3183.6191.7992.61
Run # 382.1883.9391.2592.86
Run # 482.2884.0891.8693.11
Run # 582.4483.5891.6892.46
Average82.5983.7991.7692.81
Standard Deviation0.640.210.350.27
The Best Accuracy83.7384.0892.2193.11
The Worst Accuracy82.1883.5891.2592.46
Table 7. The average five-run confusion matrices per file in Experiment 1.
Table 7. The average five-run confusion matrices per file in Experiment 1.
Predicted
ArabicChineseEnglishFrenchGermanRussianSpanish
ActualArabic377534821
Chinese4.6345149.815.656
English44.2366.821616
French231364.821.235
German212.23385.824
Russaian311.634385.42
Spanish168.67.8266.2344.4
P(%)95.7894.4792.3592.4980.9595.2593.49
R(%)94.2586.2591.7091.2096.4596.3586.10
F1(%)95.0190.1792.0291.8488.0295.8089.64
Accuracy(%)91.76
Table 8. The average five-run confusion matrices per file in Experiment 11.
Table 8. The average five-run confusion matrices per file in Experiment 11.
Predicted
ArabicChineseEnglishFrenchGermanRussianSpanish
ActualArabic378.243.64712.2
Chinese4.4353.410.68.814.44.44
English3.83.236921615
French31.62367.821.81.22.6
German1.8012390.214
Russian3.4104.23385.43
Spanish01.878.224.24.2354.6
P(%)95.8496.8293.8592.6481.8796.7994.46
R(%)94.5588.3592.2591.9597.5596.3588.65
F1(%)95.1992.3993.0492.2989.0296.5791.46
Accuracy(%)92.81
Table 9. Comparison of proposed CRNN language identification architecture with other studies using different corpora with varying numbers of classes.
Table 9. Comparison of proposed CRNN language identification architecture with other studies using different corpora with varying numbers of classes.
Study and Ref.FeaturesClassifiersCorpusClassesLanguagesResults (%)
Bartz et al. [2]SpectrogramCRNNEuropean Parliament Statements and News Channels on YouTube6English, French, German, Chinese, Russian, and SpanishAccuracy = 91.00
Kim and Park. [11]RhythmR-vector with SVMSiTEC2English and KoreanER = 2.26
Mozilla3Chinese, English, and SpanishER = 53.27
Zazo et al. [20]MFCC-SDCLSTM-RNNNIST LRE8Dari, English, French, Chinese, Pashto, Russian, Spanish, and UrduAccuracy = 70.90
Ma and Yu [27]DNN-BNLSTMAP17-OLR10Tibetan, Japanese, Kazakh, Korean, Indonesian, Mandarin, Cantonese, Vietnamese, Uyghur, and RussianER = 50.00
Draghici et al. [21]Mel-spectrogramsCRNNEU Repo6English, French, German, Greek, Italian, and SpanishAccuracy = 83.00
Singh et al.
[33]
Log-Mel spectrogramsCNNMozilla4Estonian, Tamil, Turkish, and MandarinAccuracy = 80.21
Proposed SystemGTCC-MFCCCRNNMozilla7Arabic,German,English,Spanish,French,Russian,andChineseAccuracy = 92.81
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Alashban, A.A.; Qamhan, M.A.; Meftah, A.H.; Alotaibi, Y.A. Spoken Language Identification System Using Convolutional Recurrent Neural Network. Appl. Sci. 2022, 12, 9181. https://doi.org/10.3390/app12189181

AMA Style

Alashban AA, Qamhan MA, Meftah AH, Alotaibi YA. Spoken Language Identification System Using Convolutional Recurrent Neural Network. Applied Sciences. 2022; 12(18):9181. https://doi.org/10.3390/app12189181

Chicago/Turabian Style

Alashban, Adal A., Mustafa A. Qamhan, Ali H. Meftah, and Yousef A. Alotaibi. 2022. "Spoken Language Identification System Using Convolutional Recurrent Neural Network" Applied Sciences 12, no. 18: 9181. https://doi.org/10.3390/app12189181

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop