DFNet: Decoupled Fusion Network for Dialectal Speech Recognition
Abstract
:1. Introduction
- Current studies on dialect recognition often overlook the inherent relationship between Mandarin and dialects at the acoustic and semantic levels, thereby constraining the enhancement of recognition accuracy. To cope with this problem, we have developed an innovative Decoupled Fusion Network (DFNet), which leverages common features in the extensive Mandarin dataset to improve the recognition of different dialects.
- One of the core components of DFNet is the feature decoupled module. This module accurately identifies and separates the acoustic features unique to Mandarin and its dialects, as well as the features they share. This greatly enhances the model’s ability to capture dialect-specific phonemes while also revealing the similarities between different languages. Further, the heterogeneous information-weighted fusion module we designed effectively combines the decoupled Mandarin-shared features with dialect-specific features. This fusion enhances the model’s comprehension of dialect features.
- Extensive experiments have verified the superior performance of DFNet in processing dialect data. In tests on multiple dialect datasets, our network achieves significant progress in reducing word error rate, especially when paired with the use of resource-rich Mandarin data. This pairing improves the performance of the Henan and Guangdong recognition tasks by 2.64% and 2.68%, respectively. In addition, the results of the ablation experiments further corroborate the importance of the individual components in the DFNet architecture, demonstrating the promising broad application of our approach in dealing with multi-dialect speech recognition.
2. Related Works
2.1. Dialects Speech Recognition
2.2. Decoupled and Fusion Learning
3. Method
3.1. Model Framework
3.2. Feature Decoupled
3.3. Weighted Fusion of Heterogeneous Information
3.4. Decoding Method
3.5. Loss Optimization Objective
4. Experimental Setup
4.1. Introduction to the Dataset
4.2. Dataset Setup
4.3. Model Configurations
5. Results
5.1. Comparative Experiment
5.2. Ablation Experiment
5.3. Add Mandarin Experiments
5.4. Feature Visualization Experiment
5.5. Different Fusion Methods
5.6. Ways to Maximize Differences
5.7. Hyperparameter Analysis
6. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bhukya, S. Effect of gender on improving speech recognition system. Int. J. Comput. Appl. 2018, 179, 22–30. [Google Scholar] [CrossRef]
- Ramoji, S.; Ganapathy, S. Supervised I-vector modeling for language and accent recognition. Comput. Speech Lang. 2020, 60, 101030. [Google Scholar] [CrossRef]
- Singh, G.; Sharma, S.; Kumar, V.; Kaur, M.; Baz, M.; Masud, M. Spoken language identification using deep learning. Comput. Intell. Neurosci. 2021, 2021, 5123671. [Google Scholar] [CrossRef] [PubMed]
- Byrne, W.; Beyerlein, P.; Huerta, J.M.; Khudanpur, S.; Marthi, B.; Morgan, J.; Peterek, N.; Picone, J.; Vergyri, D.; Wang, T. Towards language independent acoustic modeling. In Proceedings of the 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), Istanbul, Turkey, 5–9 June 2000; Volume 2, pp. II1029–II1032. [Google Scholar]
- Kumar, A.; Verma, S.; Mangla, H. A survey of deep learning techniques in speech recognition. In Proceedings of the 2018 International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida, India, 12–13 October 2018; pp. 179–185. [Google Scholar]
- Dong, L.; Zhou, S.; Chen, W.; Xu, B. Extending recurrent neural aligner for streaming end-to-end speech recognition in mandarin. arXiv 2018, arXiv:1806.06342. [Google Scholar]
- Kibria, S.; Rahman, M.S.; Selim, M.R.; Iqbal, M.Z. Acoustic analysis of the speakers’ variability for regional accent-affected pronunciation in Bangladeshi bangla: A study on Sylheti accent. IEEE Access 2020, 8, 35200–35221. [Google Scholar] [CrossRef]
- Deng, K.; Cao, S.; Ma, L. Improving accent identification and accented speech recognition under a framework of self-supervised learning. arXiv 2021, arXiv:2109.07349. [Google Scholar]
- Chen, J.; Wang, Y.; Wang, D. A feature study for classification-based speech separation at low signal-to-noise ratios. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1993–2002. [Google Scholar] [CrossRef]
- Shi, X.; Yu, F.; Lu, Y.; Liang, Y.; Feng, Q.; Wang, D.; Qian, Y.; Xie, L. The accented english speech recognition challenge 2020: Open datasets, tracks, baselines, results and methods. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6918–6922. [Google Scholar]
- Chandra, E.; Sunitha, C. A review on Speech and Speaker Authentication System using Voice Signal feature selection and extraction. In Proceedings of the 2009 IEEE International Advance Computing Conference, Patiala, India, 6–7 March 2009; pp. 1341–1346. [Google Scholar]
- Liu, Z.T.; Rehman, A.; Wu, M.; Cao, W.H.; Hao, M. Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 2021, 563, 309–325. [Google Scholar] [CrossRef]
- Zhu, C.; An, K.; Zheng, H.; Ou, Z. Multilingual and crosslingual speech recognition using phonological-vector based phone embeddings. In Proceedings of the 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, 13–17 December 2021; pp. 1034–1041. [Google Scholar]
- Chen, M.; Yang, Z.; Liang, J.; Li, Y.; Liu, W. Improving deep neural networks based multi-accent Mandarin speech recognition using i-vectors and accent-specific top layer. In Proceedings of the Interspeech, Dresden, Germany, 6–10 September 2015; pp. 3620–3624. [Google Scholar]
- Nallasamy, U.; Metze, F.; Schultz, T. Enhanced polyphone decision tree adaptation for accented speech recognition. In Proceedings of the Thirteenth Annual Conference of the International Speech Communication Association, Portland, OR, USA, 9–13 September 2012. [Google Scholar]
- Jain, A.; Singh, V.P.; Rath, S.P. A Multi-Accent Acoustic Model Using Mixture of Experts for Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 779–783. [Google Scholar]
- Qian, Y.; Gong, X.; Huang, H. Layer-wise fast adaptation for end-to-end multi-accent speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2842–2853. [Google Scholar] [CrossRef]
- Seide, F.; Li, G.; Yu, D. Conversational speech transcription using context-dependent deep neural networks. In Proceedings of the Twelfth Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011. [Google Scholar]
- Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech recognition using deep neural networks: A systematic review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
- Sim, K.C.; Narayanan, A.; Misra, A.; Tripathi, A.; Pundak, G.; Sainath, T.N.; Haghani, P.; Li, B.; Bacchiani, M. Domain Adaptation Using Factorized Hidden Layer for Robust Automatic Speech Recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 892–896. [Google Scholar]
- Seki, H.; Yamamoto, K.; Akiba, T.; Nakagawa, S. Rapid speaker adaptation of neural network based filterbank layer for automatic speech recognition. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 574–580. [Google Scholar]
- Chen, X.; Meng, Z.; Parthasarathy, S.; Li, J. Factorized neural transducer for efficient language model adaptation. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 8132–8136. [Google Scholar]
- Bansal, S.; Kamper, H.; Livescu, K.; Lopez, A.; Goldwater, S. Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv 2018, arXiv:1809.01431. [Google Scholar]
- Zuluaga-Gomez, J.; Ahmed, S.; Visockas, D.; Subakan, C. CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice. arXiv 2023, arXiv:2305.18283. [Google Scholar]
- Shor, J.; Emanuel, D.; Lang, O.; Tuval, O.; Brenner, M.; Cattiau, J.; Vieira, F.; McNally, M.; Charbonneau, T.; Nollstadt, M.; et al. Personalizing ASR for dysarthric and accented speech with limited data. arXiv 2019, arXiv:1907.13511. [Google Scholar]
- Li, B.; Sainath, T.N.; Sim, K.C.; Bacchiani, M.; Weinstein, E.; Nguyen, P.; Chen, Z.; Wu, Y.; Rao, K. Multi-dialect speech recognition with a single sequence-to-sequence model. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4749–4753. [Google Scholar]
- Na, H.J.; Park, J.S. Accented speech recognition based on end-to-end domain adversarial training of neural networks. Appl. Sci. 2021, 11, 8412. [Google Scholar] [CrossRef]
- Jain, A.; Upreti, M.; Jyothi, P. Improved Accented Speech Recognition Using Accent Embeddings and Multi-task Learning. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 2454–2458. [Google Scholar]
- Li, R.; Jiao, Q.; Cao, W.; Wong, H.S.; Wu, S. Model adaptation: Unsupervised domain adaptation without source data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9641–9650. [Google Scholar]
- Guan, H.; Liu, M. Domain adaptation for medical image analysis: A survey. IEEE Trans. Biomed. Eng. 2021, 69, 1173–1185. [Google Scholar] [CrossRef] [PubMed]
- Gaman, M.; Hovy, D.; Ionescu, R.T.; Jauhiainen, H.; Jauhiainen, T.; Lindén, K.; Ljubešić, N.; Partanen, N.; Purschke, C.; Scherrer, Y.; et al. A report on the VarDial evaluation campaign 2020. In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects, Barcelona, Spain, 13 December 2020; pp. 1–14. [Google Scholar]
- Zhang, W.; Li, X.; Deng, Y.; Bing, L.; Lam, W. A survey on aspect-based sentiment analysis: Tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng. 2022, 35, 11019–11038. [Google Scholar] [CrossRef]
- Yang, K.; Yang, D.; Zhang, J.; Wang, H.; Sun, P.; Song, L. What2comm: Towards communication-efficient collaborative perception via feature decoupling. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 7686–7695. [Google Scholar]
- Sang, M.; Xia, W.; Hansen, J.H. Deaan: Disentangled embedding and adversarial adaptation network for robust speaker representation learning. In Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6169–6173. [Google Scholar]
- He, R.; Lee, W.S.; Ng, H.T.; Dahlmeier, D. An interactive multi-task learning network for end-to-end aspect-based sentiment analysis. arXiv 2019, arXiv:1906.06906. [Google Scholar]
- Pappagari, R.; Wang, T.; Villalba, J.; Chen, N.; Dehak, N. x-vectors meet emotions: A study on dependencies between emotion and speaker recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7169–7173. [Google Scholar]
- Lin, Y.; Wang, L.; Dang, J.; Li, S.; Ding, C. Disordered speech recognition considering low resources and abnormal articulation. Speech Commun. 2023, 155, 103002. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In Proceedings of the 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), Seoul, Republic of Korea, 1–3 November 2017; pp. 1–5. [Google Scholar]
- Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv 2019, arXiv:1904.08779. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Yao, Z.; Wu, D.; Wang, X.; Zhang, B.; Yu, F.; Yang, C.; Peng, Z.; Chen, X.; Xie, L.; Lei, X. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. arXiv 2021, arXiv:2102.01547. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.0376230. [Google Scholar]
- Gao, Z.; Li, Z.; Wang, J.; Luo, H.; Shi, X.; Chen, M.; Li, Y.; Zuo, L.; Du, Z.; Xiao, Z.; et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv 2023, arXiv:2305.11013. [Google Scholar]
- Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 173–182. [Google Scholar]
- Zhu, X.; Zhang, F.; Gao, L.; Ren, X.; Hao, B. Research on Speech Recognition Based on Residual Network and Gated Convolution Network. Comput. Eng. Appl. 2022, 58, 185–191. [Google Scholar]
- Yang, Y.; Shen, F.; Du, C.; Ma, Z.; Yu, K.; Povey, D.; Chen, X. Towards universal speech discrete tokens: A case study for asr and tts. In Proceedings of the ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10401–10405. [Google Scholar]
Dataset | Total Duration (Hours) | Sampling Rate (Hz) | Style |
---|---|---|---|
Aishell-1 | 178 | 16,000 | reading |
Henan | 30 | 16,000 | reading |
Gongdong | 30 | 16,000 | reading |
ID | Language | Decoupled | Fusion | Test CER (%) ↓ | |||
---|---|---|---|---|---|---|---|
Attention | Attention Rescoring | CTC Greedy Search | CTC Prefix Beam Search | ||||
C1 | Henan | × | × | 24.51 | 15.13 | 16.14 | 16.13 |
C2 | Henan + Mandarin | × | × | 25.10 | 15.71 | 16.77 | 16.79 |
C3 | Guangdong | × | × | 38.10 | 24.47 | 25.90 | 25.84 |
C4 | Guangdong + Mandarin | × | × | 41.01 | 35.27 | 36.23 | 36.16 |
Model | CER (%) ↓ |
---|---|
WeNet [43] | 14.33 |
FunASR [45] | 17.19 |
DeepSpeech [46] | 20.98 |
ResNet-GCFN [47] | 22.36 |
Icefall [48] | 16.09 |
DFNet (ours) | 11.69 |
Model | CER (%) ↓ |
---|---|
WeNet [43] | 24.47 |
FunASR [45] | 26.19 |
DeepSpeech [46] | 32.16 |
ResNet-GCFN [47] | 35.44 |
Icefall [48] | 24.61 |
DFNet (ours) | 21.79 |
ID | Language | Decoupled | Fusion | Test CER (%) ↓ | |||
---|---|---|---|---|---|---|---|
Attention | Attention Rescoring | CTC Greedy Search | CTC Prefix Beam Search | ||||
A1 | Henan + Mandarin | ✓ | ✓ | 16.59 | 11.69 | 12.64 | 12.53 |
A2 | Henan + Mandarin | ✓ | × | 20.42 | 13.92 | 14.99 | 14.88 |
A3 | Henan + Mandarin | × | ✓ | 17.29 | 12.52 | 13.83 | 13.72 |
A4 | Henan + Mandarin | × | × | 25.10 | 15.71 | 16.77 | 16.79 |
B1 | Guangdong + Mandarin | ✓ | ✓ | 25.57 | 21.79 | 22.93 | 22.89 |
B2 | Guangdong + Mandarin | ✓ | × | 27.57 | 23.26 | 24.73 | 24.68 |
B3 | Guangdong + Mandarin | × | ✓ | 25.31 | 22.75 | 23.84 | 23.82 |
B4 | Guangdong + Mandarin | × | × | 41.01 | 35.27 | 36.23 | 36.16 |
Fusion | CER (%) ↓ |
---|---|
Add | 19.11 |
Concatenate | 19.43 |
Softmax Attention | 16.38 |
Linear Attention | 15.61 |
Heterogeneous Information-Weighted Fusion (ours) | 11.69 |
Mode | CER (%) ↓ |
---|---|
Add | 30.16 |
Concatenate | 28.84 |
Softmax Attention | 25.69 |
Linear Attention | 24.27 |
Heterogeneous Information-Weighted Fusion (ours) | 21.79 |
Mode | Shared and Exclusive Feature Similarity ↓ |
---|---|
L1 Paradigm | 0.24 |
L2 Paradigm | 0.32 |
KL Dispersion | 0.24 |
Cosine Similarity (ours) | 0.15 |
Mode | Shared and Exclusive Feature Similarity ↓ |
---|---|
L1 Paradigm | 0.32 |
L2 Paradigm | 0.26 |
KL Dispersion | 0.29 |
Cosine Similarity (ours) | 0.19 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhu, Q.; Gao, L.; Qin, L. DFNet: Decoupled Fusion Network for Dialectal Speech Recognition. Mathematics 2024, 12, 1886. https://doi.org/10.3390/math12121886
Zhu Q, Gao L, Qin L. DFNet: Decoupled Fusion Network for Dialectal Speech Recognition. Mathematics. 2024; 12(12):1886. https://doi.org/10.3390/math12121886
Chicago/Turabian StyleZhu, Qianqiao, Lu Gao, and Ling Qin. 2024. "DFNet: Decoupled Fusion Network for Dialectal Speech Recognition" Mathematics 12, no. 12: 1886. https://doi.org/10.3390/math12121886
APA StyleZhu, Q., Gao, L., & Qin, L. (2024). DFNet: Decoupled Fusion Network for Dialectal Speech Recognition. Mathematics, 12(12), 1886. https://doi.org/10.3390/math12121886