2. Related Work
The information on the internet is often multilingual and multi-dialectal speech or text. Furthermore, deep learning models are applied to process multilingual media with their efficient and intuitive design approaches. Language identification techniques can be traced back to the 1970s. However, there was a lack of adequate and robust datasets at that time, so development and diffusion were not very well established because of the limited resources and inefficient models. It was not until the development of neural machine learning techniques that multilingual processing and language identification research opened up new opportunities. Language identification technologies can be roughly divided into phoneme feature-based language identification, feature-based language identification, and deep learning-based language identification.
Phoneme-based identification uses the features of speech phonemes, a method of converting speech into factor sequences and then performing statistical analysis. However, the factor feature LID model has a complex structure and is unsuitable for systems with high requirements for real-time performance.
Language identification methods based on underlying acoustic features usually use the MFCC (Mel frequency cepstrum coefficients) [
2], LPCC (Linear Prediction Cepstral Coefficients) [
3], PLP (Perceptual Linear Predictive Coefficients) [
4], and SDC (Shift Differential Cepstrum) [
5] features as the underlying acoustic features for model construction. These features are extracted in the frequency domain by performing FFT (fast Fourier transform) on the original signal sequence. Furthermore, the main model approaches for LDI based on underlying acoustic features include the Gaussian mixture model, the general background model [
6,
7], the SVM model [
8], and i-vector, x-vector, etc.
Deep neural network-based LID methods originated in 2009 when Montavon [
9] et al. used neural networks to extract features. In 2014, Lei [
10] et al. proposed CNN networks. DNN [
11] was also applied to short-time speech, and the application of long short term memory networks [
12] (LSTM) on language identification led to breakthroughs in identification performance. In 2016, Geng et al. [
13] introduced the attention mechanism model to a language identification system. In 2017, Bartz et al. [
14] used a convolutional network combined with a recurrent neural network approach (CRNN) for language identification. In 2018, Suwon Shon et al. [
15] extracted three acoustic features, MFCC, Frank, and spectrogram, using a twin neural network. In 2019, Kexin Xie and Shengyou Qian [
16] proposed a method based on the combination of the Gate Recurrent Unit (GRU) and Hidden Markov Model (HMM). They studied the extraction of speech feature parameters to recognize the Hunan dialects category. In 2020, Aitor Arronte Alvarez et al. [
17] proposed an end-to-end network Res-BLSTM combining residual block and bidirectional long short term memory (BLSTM) for Arabic dialect identification.
Our contributions:
- (1)
Introducing the idea of residuals into the baseline model [
18]. This makes a smooth context connection.
- (2)
Introducing Coordinate Attention [
19] into the baseline model to highlight the valuable features and suppressed irrelevant features. Because this method embeds the location information into the channel, it reduces the computational overload.
- (3)
Designing a multi-scale residual network. The relevant feature information in the model image is encoded and decoded from a global perspective.
- (4)
The multi-headed self-attention [
20], which helps the network capture relevant rich features.
- (5)
Conducting experiments on two datasets. The experimental results show that the proposed model has good robustness.
Author Contributions
Conceptualization, A.Z. and M.A.; methodology, A.Z.; software, A.Z.; validation, A.Z.; formal analysis, M.A.; investigation, A.Z. and M.A.; resources, A.Z., M.A., and A.H.; data curation, A.Z.; writing—original draft preparation, A.Z.; writing—review and editing, M.A. and A.H.; visualization, A.Z., M.A., and A.H. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Strengthening Plan of the National Defense Science and Technology Foundation of China (2021-JCJQ-JJ-0059) and the Natural Science Foundation of China (U2003207).
Data Availability Statement
In this paper, dataset 1 is from the Common Voice (
https://commonvoice.mozilla.org/zh-CN accessed on 19 November 2022). Common Voice is a large and publicly available speech dataset. Dataset 2 is from the Oriental language dataset provided by the AP17-OLR Oriental Language Recognition Contest.
Acknowledgments
The authors are very thankful to the editor and the reviewers for their valuable comments and suggestions for improving the paper.
Conflicts of Interest
The authors declare no conflictss of interest.
References
- Cao, H.B.; Zhao, J.M.; Qin, J. A Comparative Study of Multiple Methods for Oriental Language Identification. Available online: https://kns.cnki.net/kcms/detail/detail.aspx?dbcode=CPFD&dbname=CPFDLAST2018&filename=SEER201710001023&uniplatform=NZKPT&v=5Z10lhs1awqErVfB9k7dSEo5jDOKYebegcP8YjqKucFnRKP0s8c_7BHI6YaNf8tgq5EyMTbaW_w%3d (accessed on 19 November 2022).
- Dave, N. Feature extraction methods LPC, PLP and MFCC in speech recognition. Int. J. Adv. Res. Eng. Technol. 2013, 1, 1–4. [Google Scholar]
- Srivastava, S.; Nandi, P.; Sahoo, G.; Chandra, M. Formant based linear prediction coefficients for speaker identification. In Proceedings of the 2014 International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 20–21 February 2014; pp. 685–688. [Google Scholar]
- Revathi, A.; Jeyalakshmi, C. Robust speech recognition in noisy environment using perceptual features and adaptive filters. In Proceedings of the 2017 2nd International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 19–20 October 2017; pp. 692–696. [Google Scholar]
- Sangeetha, J.; Jothilakshmi, S. Automatic continuous speech recogniser for Dravidian languages using the auto associative neural network. Int. J. Comput. Vis. Robot. 2016, 6, 113–126. [Google Scholar] [CrossRef]
- Kohler, M.A.; Kennedy, M. Language identification using shifted delta cepstra. In Proceedings of the 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002, Tulsa, OK, USA, 4–7 August 2002; Volume 3, p. III–69. [Google Scholar]
- Torres-Carrasquillo, P.A.; Singer, E.; Kohler, M.A.; Greene, R.J.; Reynolds, D.A.; Deller, J.R., Jr. Approaches to language identification using Gaussian mixture models and shifted delta cepstral features. In Proceedings of the Interspeech, Denver, CO, USA, 16–20 September 2002. [Google Scholar]
- Campbell, W.M.; Singer, E.; Torres-Carrasquillo, P.A.; Reynolds, D.A. Language recognition with support vector machines. In Proceedings of the ODYSSEY04-The Speaker and Language Recognition Workshop, Toledo, Spain, 31 May–3 June 2004. [Google Scholar]
- Montavon, G. Deep Learning for Spoken Language Identification. Available online: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=1b17f0926b373ef49245a28fdddd3c9e90006e60 (accessed on 19 December 2022).
- Lei, Y.; Ferrer, L.; Lawson, A.; McLaren, M.; Scheffer, N. Application of Convolutional Neural Networks to Language Identification in Noisy Conditions. In Proceedings of the Odyssey, Joensuu, Finland, 16–19 June 2014. [Google Scholar]
- Lopez-Moreno, I.; Gonzalez-Dominguez, J.; Plchot, O.; Martinez, D.; Gonzalez-Rodriguez, J.; Moreno, P. Automatic language identification using deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 5337–5341. [Google Scholar]
- Gelly, G.; Gauvain, J.L. Spoken Language Identification Using LSTM-Based Angular Proximity. In Proceedings of the Interspeech, Stocholm, Sweden, 20–24 August 2017; pp. 2566–2570. [Google Scholar]
- Geng, W.; Wang, W.; Zhao, Y.; Cai, X.; Xu, B. End-to-End Language Identification Using Attention-Based Recurrent Neural Networks. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 2944–2948. [Google Scholar]
- Bartz, C.; Herold, T.; Yang, H.; Meinel, C. Language identification using deep convolutional recurrent neural networks. In Proceedings of the International Conference on Neural Information Processing, Long Beach, CA, USA, 4–9 December 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 880–889. [Google Scholar]
- Shon, S.; Ali, A.; Glass, J. Convolutional neural networks and language embeddings for end-to-end dialect recognition. arXiv 2018, arXiv:1803.04567. [Google Scholar]
- Kexin, X.; Hu, D.; Xiao, Z.; Chen, T.; Shengyou, Q. Hunan dialect recognition based on GRU-HMM acoustic model. Comput. Digit. Eng. 2019, 47, 493–496. [Google Scholar]
- Alvarez, A.A.; Issa, E.S.A. Learning intonation pattern embeddings for arabic dialect identification. arXiv 2020, arXiv:2008.00667. [Google Scholar]
- XL, M.; Ablimit, M.; Hamdulla, A. Multiclass Language Identification Using CNN-Bigru-Attention Model on Spectrogram of Audio Signals. In Proceedings of the 2021 IEEE 2nd International Conference on Pattern Recognition and Machine Learning (PRML), Chengdu, China, 16–18 July 2021; pp. 214–218. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Wang, D.; Li, L.; Tang, D.; Chen, Q. AP16-OL7: A multilingual database for oriental languages and a language recognition baseline. In Proceedings of the 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju, Republic of Korea, 13–16 December 2016; pp. 1–5. [Google Scholar]
- Tang, Z.; Wang, D.; Chen, Y.; Chen, Q. AP17-OLR challenge: Data, plan, and baseline. In Proceedings of the 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 749–753. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).