JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer

Li, Xuchen; Wang, Yiqun; Liu, Xiaoyang; Su, Kun; Li, Zhaochen; Wang, Yitian; Jiang, Bin; Xie, Kang; Liu, Jie

doi:10.3390/app15031670

Open AccessArticle

JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer

by

Xuchen Li

¹,

Yiqun Wang

¹,

Xiaoyang Liu

²

,

Kun Su

¹,

Zhaochen Li

¹,

Yitian Wang

¹,

Bin Jiang

¹

,

Kang Xie

^3,* and

Jie Liu

^4,*

¹

Shenzhen Research Institute, Shandong University, Shenzhen 518057, China

²

Department of Computer Science and Engineering, Chongqing University of Technology, Chongqing 400054, China

³

Key Lab of Information Network Security, Ministry of Public Security, Shanghai 201204, China

⁴

School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai 264209, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1670; https://doi.org/10.3390/app15031670

Submission received: 18 November 2024 / Revised: 22 January 2025 / Accepted: 2 February 2025 / Published: 6 February 2025

(This article belongs to the Special Issue AI for Sustainability and Innovation—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Jiao-Liao Mandarin, a distinguished dialect in China, reflects the linguistic features and cultural heritage of the Jiao-Liao region. However, the labor-intensive and costly nature of manual transcription limits the scale of transcribed corpora, posing challenges for speech recognition. We present JLMS25, a transcribed corpus for Jiao-Liao Mandarin, alongside a novel multi-dialect knowledge transfer (MDKT) framework for low-resource speech recognition. By leveraging phonetic and linguistic knowledge from neighboring dialects, the MDKT framework improves recognition in resource-constrained settings. It comprises an acoustic feature extractor, a dialect feature extractor, and two modules—WFAdapter (weight decomposition adapter) and AttAdapter (attention-based adapter)—to enhance adaptability and mitigate overfitting. The training involves a three-phase strategy: multi-dialect AID-ASR multi-task learning in phase one, freezing the dialect feature extractor in phase two, and fine-tuning only the adapters in phase three. Experiments on the Jiao-Liao Mandarin subset of the KeSpeech dataset and JLMS25 dataset show that MDKT outperforms full-parameter fine-tuning, reducing Character Error Rate (CER) by 5.4% and 7.7% and Word Error Rate (WER) by 6.1% and 10.8%, respectively.

Keywords:

speech recognition; multi-dialect; Jiao-Liao Mandarin; multi-task learning; adaptation; low-resource; knowledge transfer

1. Introduction

Automatic speech recognition (ASR) is a critical technology within human–computer interaction, tasked with converting spoken language into written text. Rapid advancements in deep learning have significantly propelled ASR, particularly in high-resource languages such as English and Mandarin, driven by the availability of extensive speech datasets [1,2,3]. However, a notable challenge persists in the recognition of dialects like Jiao-Liao Mandarin, a significant variant of Mandarin predominantly spoken in the Jiaodong and Liaodong Peninsulas of China, with a speaker population exceeding 30 million. This dialect exhibits distinct phonetic characteristics that diverge significantly from Standard Mandarin, complicating the development of ASR models tailored to its nuances. Presently, the sole open-source dataset available for Jiao-Liao Mandarin speech recognition is the KeSpeech dataset (https://github.com/KeSpeech/kespeech (accessed on 13 July 2024)) [4], which contains a minuscule fraction of relevant data. Furthermore, the textual corpus within this dataset predominantly features Standard Mandarin, with scant coverage of Jiao-Liao Mandarin’s distinctive vocabulary and regional idiomatic expressions. This significant shortcoming hampers the in-depth research and development of Jiao-Liao Mandarin speech recognition technology, underscoring the pressing need for more comprehensive and dialect-specific datasets to drive future innovation.

To address the challenge of limited data resources, we constructed the JLMS dataset, comprising 25 h of speech data specifically designed for the Jiao-Liao Mandarin speech recognition task. Unlike other datasets that include Jiao-Liao Mandarin speech data, JLMS uniquely incorporates a wide range of idiomatic expressions characteristic of this dialect.

Furthermore, we utilized a substantial corpus of annotated data in Standard Mandarin and its related dialects to develop a speech recognition framework named MDKT, specifically designed to tackle the challenges associated with Jiao-Liao Mandarin speech recognition. Compared to the baseline, this framework significantly enhances recognition accuracy and addresses the difficulties in Jiao-Liao Mandarin speech recognition.

Therefore, this research strives to achieve the following objectives:

Dataset Compilation—To create a comprehensive 25-h speech recognition dataset tailored for Jiao-Liao Mandarin. This dataset will include a substantial number of idiomatic expressions, enriching the corpus with the distinctive linguistic nuances of the dialect.
Framework Development—To design and implement an advanced speech recognition framework. This framework will enhance the backbone model by incorporating the WFAdapter and AttAdapter modules. The three-phase training strategy effectively addresses and overcomes the unique challenges of Jiao-Liao Mandarin speech recognition, significantly enhancing the system’s accuracy and robustness in recognizing this dialect.

The rest of this paper is organized as follows: Section 2 presents related work, Section 3 offers a comprehensive description of the dataset construction process and its associated statistics, Section 4 presents the proposed multi-dialect speech recognition framework, and Section 5 provides evaluation results for both the proposed framework and existing ASR systems using the new dataset. Finally, Section 6 concludes the paper.

2. Related Work

Our model is closely related to the fields of machine learning, ensemble learning, and intelligent computing [5,6,7], and the literature in these fields is reviewed in this section.

The conventional approach to adapting a pre-trained model to a specific target dataset typically involves full parameter fine-tuning [1,8]. Johnson et al. [9] integrated language categories into the output, while Abdel-Hamid et al. [10] incorporated speaker codes into the model’s input. Li et al. [1] developed a technique for addressing multi-dialect speech recognition using a single sequence-to-sequence model. Each method relies on full fine-tuning. However, full fine-tuning presents significant limitations [11,12,13], including inefficiency and an increased risk of overfitting.

Various strategies have been developed to address this issue and minimize the number of parameters involved in fine-tuning through parameter sharing. Huang et al. [14] proposed the shared-hidden-layer multilingual DNN (SHLMDNN), which utilizes distinct output layers for individual languages. Houlsby et al. [15] proposed the integration of adapters within the model. Radhakrishnan et al. [16] explored the use of a small number of adapters for efficient fine-tuning in Arabic dialect recognition tasks. Gu et al. [17] used adapters to capture the target speaker’s vocal characteristics for personalized recognition while preserving the generalization capability of the ASR model. In contrast, the effectiveness of adapters, particularly in neural machine translation (NMT) tasks, was confirmed by Bapna et al. [12]. However, the number of parameters increases rapidly with expansion of target tasks or languages [18,19].

Pham et al. [18,19] employed weight factorization, a multilingual algorithm, to enable distinct parameters for each language while mitigating the parameter growth associated with adding new target languages. This approach enhances the linear transformation functions of the neural network by decomposing each weight matrix into shared and language-dependent components. Let

X

represent the input and

Y

the output of weight factorization. The formula for weight factorization is presented as follows:

Y = {(W_{s} \cdot (r_{m} \cdot {s_{m}}^{⊤}))}^{⊤} \cdot X + {(r_{a} \cdot {s_{a}}^{⊤})}^{⊤} \cdot X,

(1)

where

W_{s}

represents the shared weights, while

r_{m}

,

s_{m}

,

r_{a}

, and

s_{a}

denote the language-dependent weights. Pham et al. [20] demonstrated the effectiveness of this approach in an ASR task.

However, integrating this algorithm into existing pre-trained models presents challenges, and its performance under low-resource conditions still needs to improve. We propose MDKT, a speech recognition framework specifically designed to overcome the challenges of Jiao-Liao Mandarin. MDKT improves the backbone model by introducing two key modules, WFAdapter and AttAdapter, which utilize dialect-dependent parameters to enhance adaptability across dialects and facilitate effective multi-dialect knowledge transfer. WFAdapter reduces the number of fine-tuning parameters by decomposing weights into shared and dialect-specific components. Meanwhile, AttAdapter, positioned before the classifier, integrates phonetic and linguistic knowledge from multiple dialects, enabling robust knowledge transfer for Jiao-Liao Mandarin speech recognition tasks. The MDKT framework employs a three-phase training strategy, beginning with a primary training phase to establish the model’s foundation, followed by two fine-tuning phases that optimize performance across different dialects. This structured approach ensures adaptability and effectiveness even under low-resource conditions.

3. JLMS25 Dataset

The JLMS25 dataset is the first monophonic phonetic corpus designed explicitly for the Jiao-Liao Mandarin speech recognition task. It encompasses a wide range of speech data related to idiomatic expressions in Jiao-Liao Mandarin. The dataset is currently available for free download online (https://github.com/Jiao-Liao-Mandarin/JLMS-dataset (accessed on 3 November 2024)).

As illustrated in Figure 1, the dataset construction process comprised the following steps, with strict measures in place to ensure the confidentiality of all volunteers’ personal information at every stage:

(1) Corpus Construction. To ensure the richness of the corpus, the transcriptions in the JLMS25 dataset were sourced from two distinct origins:

a collection of commonly used expressions and folk proverbs from the Jiao-Liao Mandarin-speaking region, reflecting the unique linguistic features and acoustic variations of the area. Examples of folk proverbs and their meanings are shown in Table 1.
transcriptions from a subset of the AISHELL-1 [21] speech corpus, significantly broadening the thematic scope of the corpus to include topics such as ‘finance’, ‘technology’, ‘sports’, ‘entertainment’, and ‘news’.

(2) Volunteer Recruitment. Volunteers were extensively recruited from cities in the Jiaodong Peninsula and Liaodong region, including Qingdao, Yantai, Weihai, Dalian, Dandong, and Yingkou. The selected volunteers, who frequently converse in the local dialect and have no long-term history of residing outside the region, provided a comprehensive reflection of the phonetic characteristics of and acoustic variations in Jiao-Liao Mandarin across different areas.

(3) Speech Collection. The data collection process was carried out both offline and online. During the offline phase, volunteers were placed in a noise-free environment and asked to read each sentence sequentially from a provided script while an Audio-Technica AT2020 high-fidelity microphone recording device was activated. A brief pause of one second was observed after each sentence to facilitate later segmentation. The recorded results were then segmented and verified. Each recording was segmented sentence-by-sentence using Adobe Audition. In the online phase, volunteers recorded the data in quiet environments using smart devices.

(4) Post-processing and Correction. The audio files were processed into 16-bit depth, 16 kHz sampling rate WAV format. To ensure data quality, each audio entry was reviewed. Recordings containing errors, such as mispronunciations or omissions, were excluded from the dataset.

We conducted a statistical analysis of the JLMS25 dataset, which comprises 25 h of speech data from 46 volunteers with a male-to-female ratio of 10:13, likely due to a higher inclination among females to participate in volunteer activities. The age distribution of the volunteers, as shown in Figure 2a, reveals a balanced representation of young, middle-aged, and older adults, providing a diverse range of voices that may contribute to the model’s robustness. Figure 2b highlights a broad range of themes, with News and Idioms being the most prominent categories. This thematic diversity is expected to enhance the robustness of the speech recognition model, enabling it to effectively process a wide variety of topics and expressions in Jiao-Liao Mandarin.

4. Approach

The MDKT architecture primarily consists of an acoustic sequence feature extractor, a dialect feature extractor to capture the distinctive attributes of each dialect, WFAdapter modules, and an AttAdapter module. The structure of the model is illustrated in Figure 3. The source code is publicly accessible online (https://github.com/mixxs/Jiao-Liao_Speech_Recognition (accessed on 15 November 2024)).

4.1. Feature Extraction

The MDKT architecture employs a sequence feature extractor and a dialect feature extractor to capture the contextual and dialect-dependent features of speech signals. The sequence feature extractor is a module capable of extracting contextual information from acoustic features, such as speech waveforms or Mel Frequency Cepstral Coefficients (MFCCs). The dialect feature extractor, on the other hand, is a dialect classification network with hidden layer outputs serving as dialect features. The dialect features represent the unique acoustic characteristics of each specific dialect, encapsulating the variations among dialects. By enhancing the sequence features, the dialect features increase the distinctiveness of intermediate results corresponding to speech from various dialects. To further augment the distinction among dialect features of other dialects, we employed both Triplet loss [22] and cross-entropy loss for training the dialect feature extractor.

4.2. Adapter Tuning

We employed WFAdapter and AttAdapter to enhance the model’s adaptability to multiple dialects, enabling the model to better extract knowledge from various dialects and thus improve its performance in Jiao-Liao Mandarin speech recognition. Figure 4 illustrates the structure of WFAdapter and AttAdapter.

WFAdapter primarily consists of two linear layers that employ weight factorization. Inspired by Houlsby et al. [15], the first layer maps the input from ds dimensions to di dimensions, while the second layer maps the features back to ds dimensions. Residual connections [23] merge the input and output to prevent degradation. Nonlinear transformations are performed using Exponential Linear Units [24], and normalization is conducted using layer normalization [25]. The weight factorization [18] algorithm decomposes weights into a shared weight matrix and four dialect-dependent weight vectors, further reducing the number of dialect-dependent parameters involved in fine-tuning WFAdapter.

AttAdapter integrates dialect and sequence features. To enable the algorithm to consider the unique phonetic features of dialects, AttAdapter employs dialect features as keys and values of the multi-head attention mechanism [26], with speech sequence features serving as queries. A fully connected layer transforms dialect features before multi-head attention to ensure that the dimensionality of dialect features matches that of sequence features. Parameters within AttAdapter are dialect-dependent. By incorporating dialect features into sequence features, AttAdapter increases the variability in the hidden layer outputs across different dialects, thereby enhancing the model’s adaptability to various dialects.

4.3. Three-Phase Training Strategy

The training task was divided into three distinct phases. In the first phase, the framework employed multi-task learning to acquire knowledge from both multi-dialect speech recognition and multi-dialect recognition tasks. This knowledge was then transferred to Jiao-Liao Mandarin through fine-tuning in the subsequent phase. However, due to the limited availability of dialect-dependent parameters in the model, directly fine-tuning them could adversely affect MDKT’s performance. To improve the model’s fitting capability and minimize the risk of overfitting, the fine-tuning process was further split into two stages. In the second phase, due to the scarcity of Jiao-Liao Mandarin data, the parameters of the dialect feature extractor were frozen to prevent overfitting. The quantized outputs of the dialect feature extractor were then used as dialect features:

QDF = \frac{\sum_{i = 1}^{n} d_{i}}{n},

(2)

where QDF represents a quantified dialect feature,

d_{i}

represents the output of the dialect feature extractor, and n denotes the number of data entries for each target dialect in the training set.

In the final stage, only the parameters related to the WFAdapter and AttAdapter modules remained active, while all other parameters were frozen. This approach effectively utilized the limited Jiao-Liao Mandarin data while improving the model’s stability and overall performance.

5. Experiment

5.1. Datasets and Metrics

The datasets utilized in this study consist of the publicly available KeSpeech dataset [5] and the self-constructed JLMS25 dataset. In the first stage of the experiment, data from seven dialects within the KeSpeech dataset, excluding Jiao-Liao Mandarin, were employed. These dialects include Beijing Mandarin, Southwestern Mandarin, Zhongyuan Mandarin, Northeastern Mandarin, Lan-Yin Mandarin, Jiang-Huai Mandarin, and Ji-Lu Mandarin. To mitigate data imbalance in KeSpeech, 30,000 Mandarin speech samples were randomly selected from its training set and incorporated into the training process. During the fine-tuning stage for Jiao-Liao Mandarin, the second and third experimental stages were independently conducted using the Jiao-Liao Mandarin subset of the KeSpeech dataset and the JLMS25 dataset, respectively. For the experiments based on the KeSpeech dataset, only its training set and test data were utilized for model training and testing. For JLMS25, the data were divided into training and test sets with a 9:1 volume ratio, ensuring a balanced distribution for both model training and testing.

We used Character Error Rate (CER) and Word Error Rate (WER) as the primary evaluation metrics for assessing speech recognition performance. Since the model generates a sequence of Chinese characters, using CER as an evaluation metric is both straightforward and widely accepted. For WER, both the model’s output and the reference text were segmented into sequences of Chinese words. The CER and WER are calculated using the following formulas:

CER = \frac{(S_{c} + D_{c} + I_{c})}{N_{c}},

(3)

WER = \frac{(S_{w} + D_{w} + I_{w})}{N_{w}},

(4)

where

S_{c}

,

D_{c}

and

I_{c}

are the numbers of substituted, deleted, and inserted characters, respectively, and Nc represents the total number of characters in the reference text.

S_{w}

,

D_{w}

, and

I_{w}

are the numbers of substituted, deleted, and inserted words, respectively, and N_w represents the total number of words in the reference text.

5.2. Experimental Setup

Given the significant advantages of transformer architecture [26] in extracting contextual information, our MDKT experiments utilized the pre-trained transformer-based wav2vec2.0 (https://huggingface.co/wbbbbb/wav2vec2-large-chinese-zh-cn (accessed on 1 November 2024)) [27] as the backbone model for the speech sequence feature extractor. To capture dialect-specific features, we employed the ECAPA-TDNN [28], which enhances the Time Delay Neural Network (TDNN) [29] with the SE-Res2Block [28], improving its effectiveness. The experimental configuration included three WFAdapter modules and one AttAdapter module, with each WFAdapter module set to an internal dimension of 512.

During the training phase, data augmentation techniques were applied to all datasets with a 50% probability. These techniques included gain adjustment, pitch shifting, time stretching, and the addition of Gaussian noise, aimed at enhancing the diversity and robustness of the training data. Optimization was performed using AdamW [30] with a tri-phase learning rate schedule: an initial warm-up for the first 10% of updates, a stable phase for the subsequent 20%, and a linear decay for the remaining updates. The peak learning rate was set at 1 × 10⁻⁴. Due to limited GPU memory, the initial training phase involved training the distinct components of the model separately: the ASR encoder and WFAdapter were trained using CTC loss [31], while AttAdapter was trained using MSE loss to ensure its output aligns closely with the acoustic sequence features.

5.3. Main Results

The experiments were conducted on the KeSpeech and JLMS25 datasets. As shown in Table 2, we first compared MDKT with models from other literature, including SHL-MDNN, JMAA [3], full fine-tuning of the backbone model, and adapters. JMAA exhibited underfitting due to a limited number of parameters. The performance of SHL-MDNN and full fine-tuning, based on wav2vec 2.0, showed no significant differences, while MDKT demonstrated superior performance on both categories of Jiao-Liao Mandarin speech data.

We also conducted a comparison with the adapter module. Due to computational resource limitations, we could not pre-train the wav2vec 2.0 model on the diverse dialect speech data from the KeSpeech dataset, which prevents a direct comparison with the method proposed by Bapna et al. [12]. However, to evaluate the effectiveness of the WFAdapter and AttAdapter modules, we compared them with the adapter module under experimental conditions. Both utilize a three-stage training method, with adapter modules added after the ASR encoder. With a similar number of language-dependent parameters, we compared the performance of a three-layer adapter against our method. In terms of parameters, the adapter had 6.8 M language-dependent parameters, while our method had 7.5 M, increasing by only 0.7 M. Table 2 shows that this model architecture significantly enhanced the performance of the backbone model on the low-resource Jiao-Liao Mandarin speech dataset. Our method outperformed the backbone model trained using multi-dialect data. After fine-tuning on the Jiao-Liao Mandarin speech data, CER decreased by 5.4% and 7.7%, while WER reduced by 6.1% and 10.8% on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively. The combination of WFAdapter and AttAdapter outperformed the individual adapters, with CER reducing by 0.4% and 5.4%, and WER by 0.1% and 7.9% on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively. These results clearly demonstrate that WFAdapter and AttAdapter effectively enhance the model’s adaptability to various dialects while mitigating the risk of overfitting.

5.4. Ablation Study

To elucidate the contributions of individual components, we conducted ablation studies on MDKT. We evaluated the performance of several configurations: the original wav2vec 2.0 base model, the wav2vec 2.0 model augmented with three WFAdapter layers (wav2vec 2.0 + WFA), the base model enhanced with a single AttAdapter layer (wav2vec 2.0 + AttA), MDKT with three training phases (MDKT-tri), and MDKT with only the first and third training phases, excluding the second training stage (MDKT-bi). The findings are presented in Table 3.

As shown in Table 3, integrating WFAdapter into the wav2vec 2.0 model resulted in a 3.2% and 4.2% reduction in CER and a 4.2% and 7.1% reduction in WER on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively, compared to full fine-tuning. Additionally, incorporating AttAdapter into the wav2vec 2.0 model led to a 2.7% and 5.4% decrease in CER and a 4.6% and 7.5% decrease in WER on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively. These findings indicate that integrating dialectal and sequence features through AttAdapter is beneficial, while conveying multi-dialect knowledge via WFAdapter enhances model performance. A comparison between MDKT-tri and MDKT-bi demonstrated that the three-stage training strategy effectively improves model performance in the target dialect.

6. Conclusions

In this study, we constructed and introduced JLMS25, the first publicly available speech dataset specifically curated for dialect recognition in the Jiao-Liao region of China. JLMS25 comprehensively covers common daily topics and incorporates a substantial amount of specialized vocabulary and regional idioms, offering extensive coverage of Jiao-Liao Mandarin’s phonetic and tonal features, thereby contributing to the advancement of Jiao-Liao Mandarin speech recognition. To address the issue of insufficient Jiao-Liao Mandarin speech data, which limits model performance, we introduced the MDKT model architecture. The MDKT architecture incorporates WFAdapter and AttAdapter modules to enhance model performance. A three-phase training strategy is employed to transfer multi-dialect knowledge to low-resource dialects. In the first phase, a multi-dialect AID-ASR multi-task learning strategy is used for training. In the second phase, the dialect feature extractor is frozen, and training continues to fine-tune the model. In the third phase, all modules, except for the adapters, are frozen, and fine-tuning is conducted on the adapters. Experimental results demonstrate that this architecture significantly improves the performance of the ASR backbone model in data-scarce environments.

Although the model performance on low-resource datasets has improved, the method still faces challenges. Future work will focus on expanding the Jiao-Liao Mandarin dataset to cover more linguistic variations, simplifying the training framework, and improving model performance. These efforts aim to address current limitations, advance speech recognition technologies for resource-scarce dialects, and contribute to the preservation of Jiao-Liao Mandarin.

Author Contributions

X.L. (Xuchen Li): data curation, formal analysis. Y.W. (Yiqun Wang): methodology. X.L. (Xiaoyang Liu): visualization. K.S.: validation. Z.L.: resources. Y.W. (Yitian Wang): resources. B.J.: validation. K.X.: funding acquisition. J.L.: supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Key Lab of Information Network Security, Ministry of Public Security and Shenzhen Fundamental Research Program under Grant JCYJ20230807094104009.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, B.; Sainath, T.N.; Sim, K.C.; Bacchiani, M.; Weinstein, E.; Nguyen, P.; Chen, Z.; Wu, Y.; Rao, K. Multi-Dialect Speech Recognition with a Single Sequence-to-Sequence Model. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 4749–4753. [Google Scholar]
Yang, Y.; Xu, H.; Huang, H.; Chng, E.S.; Li, S. Speech-text based multi-modal training with bidirectional attention for improved speech recognition. In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–9 June 2023; pp. 1–5. [Google Scholar]
Yang, X.; Audhkhasi, K.; Rosenberg, A.; Thomas, S.; Ramabhadran, B.; Hasegawa-Johnson, M. Joint modeling of accents and acoustics for multi-accent speech recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 1–5. [Google Scholar]
Tang, Z.; Wang, D.; Xu, Y.; Sun, J.; Lei, X.; Zhao, S.; Wen, C.; Tan, X.; Xie, C.; Zhou, C.; et al. KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdialects. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, Online, 6–14 December 2021. [Google Scholar]
Liu, X.; Ye, S.; Fiumara, G.; De Meo, P. Influence Nodes Identifying Method via Community-based Backward Generating Network Framework. IEEE Trans. Netw. Sci. Eng. 2024, 11, 236–253. [Google Scholar] [CrossRef]
Liu, X.; Miao, C.; Fiumara, G.; De Meo, P. Information Propagation Prediction Based on Spatial–Temporal Attention and Heterogeneous Graph Convolutional Networks. IEEE Trans. Comput. Soc. Syst. 2024, 11, 945–958. [Google Scholar] [CrossRef]
Liu, X.; Feng, H.; Zhang, X.; Zhou, X.; Bouyer, A. Graph contrast learning for recommendation based on relational graph convolutional neural network. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102168. [Google Scholar] [CrossRef]
Neubig, G.; Hu, J. Rapid Adaptation of Neural Machine Translation to New Languages. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 875–880. [Google Scholar]
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viegas, F.; Wattenberg, M.; Corrado, G. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef]
Abdel-Hamid, O.; Jiang, H. Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7942–7946. [Google Scholar]
Michel, P.; Neubig, G. Extreme Adaptation for Personalized Neural Machine Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 2, pp. 312–318. [Google Scholar]
Bapna, A.; Arivazhagan, N.; Firat, O. Simple, Scalable Adaptation for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1538–1548. [Google Scholar]
Yoo, S.; Song, I.; Bengio, Y. A highly adaptive acoustic model for accurate multi-dialect speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 5716–5720. [Google Scholar]
Huang, J.T.; Li, J.; Yu, D.; Deng, L.; Gong, Y. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7304–7308. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
Radhakrishnan, S.; Yang CH, H.; Khan, S.A.; Kiani, N.A.; Gomez-Cabrero, D.; Tegner, J.N. A parameter-efficient learning approach to arabic dialect identification with pre-trained general-purpose speech model. arXiv 2023, arXiv:2305.11244. [Google Scholar]
Gu, Y.; Du, Z.; Zhang, S.; He, Y. Personality-memory Gated Adaptation: An Efficient Speaker Adaptation for Personalized End-to-end Automatic Speech Recognition. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024; pp. 2870–2874. [Google Scholar]
Pham, N.Q.; Nguyen, T.N.; Stüker, S.; Waibel, A. Efficient Weight Factorization for Multilingual Speech Recognition. In Proceedings of the Interspeech, Brno, Czech Republic, 30 August–3 September 2021; pp. 2421–2425. [Google Scholar]
Leng, J.; Liu, W.; Guo, Q. Stock movement prediction model based on gated orthogonal recurrent units. Intell. Syst. Appl. 2022, 16, 200156. [Google Scholar] [CrossRef]
Pham, N.Q.; Waibel, A.; Niehues, J. Adaptive multilingual speech recognition with pretrained models. arXiv 2022, arXiv:2205.12304. [Google Scholar]
Bu, H.; Du, J.; Na, X.; Wu, B.; Zheng, H. AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline. arXiv 2017, arXiv:1709.05522. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A unified embedding for face recognition and clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Clevert, D.A. Fast and Accurate Deep Network Learning by Exponential Linear Units. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Ba, J.L. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Vaswani, A. Attention is all you need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv 2020, arXiv:2006.11477. [Google Scholar]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proceedings of the Interspeech, Shanghai, China, 25–29 October 2020. [Google Scholar]
Waibel, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K.J. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoust. Speech Signal Process. 1989, 37, 328–339. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv 2017, arXiv:1711.05101.2017(5). [Google Scholar]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]

Figure 1. Dataset Construction Workflow.

Figure 2. Data statistics. (a) Percentage distribution of volunteers’ ages. (b) Percentage distribution of transcription topics in JLMS25.

Figure 3. The structure of MDKT.

Figure 4. The structure of the proposed WFAdapter (a) and AttAdapter (b).

Table 1. Examples of folk proverbs and their meanings.

Text Content	Meaning
草包打擂台风雨无阻	A jest at the punctuality of a machine.
亲侄儿不糙起棵小树儿	The nephew has a deep affection for his aunt and uncle.
强扭的瓜不甜	A task done reluctantly will not be fulfilling.

Table 2. Comparison of various approaches.

Model	KeSpeech		JLMS25		Mean
Model	CER	WER	CER	WER	CER	WER
SHL-MDNN	0.363	0.552	0.271	0.444	0.317	0.498
JMAA	0.682	0.895	0.619	0.852	0.651	0.879
Full fine-tuning	0.364	0.533	0.281	0.467	0.323	0.5
adapters	0.314	0.473	0.258	0.438	0.286	0.455
MDKT	0.310	0.472	0.204	0.359	0.257	0.411

Table 3. The results of the ablation study.

Model	KeSpeech		JLMS25		Mean
Model	CER	WER	CER	WER	CER	WER
Wav2vec2.0	0.364	0.553	0.281	0.467	0.323	0.5
MDKT-bi	0.319	0.481	0.275	0.471	0.297	0.476
Wav2vec2.0 + AttA	0.332	0.496	0.239	0.396	0.286	0.446
Wav2vec2.0 + WFA	0.337	0.507	0.227	0.392	0.282	0.449
MDKT-tri	0.310	0.472	0.204	0.359	0.257	0.411

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, X.; Wang, Y.; Liu, X.; Su, K.; Li, Z.; Wang, Y.; Jiang, B.; Xie, K.; Liu, J. JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer. Appl. Sci. 2025, 15, 1670. https://doi.org/10.3390/app15031670

AMA Style

Li X, Wang Y, Liu X, Su K, Li Z, Wang Y, Jiang B, Xie K, Liu J. JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer. Applied Sciences. 2025; 15(3):1670. https://doi.org/10.3390/app15031670

Chicago/Turabian Style

Li, Xuchen, Yiqun Wang, Xiaoyang Liu, Kun Su, Zhaochen Li, Yitian Wang, Bin Jiang, Kang Xie, and Jie Liu. 2025. "JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer" Applied Sciences 15, no. 3: 1670. https://doi.org/10.3390/app15031670

APA Style

Li, X., Wang, Y., Liu, X., Su, K., Li, Z., Wang, Y., Jiang, B., Xie, K., & Liu, J. (2025). JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer. Applied Sciences, 15(3), 1670. https://doi.org/10.3390/app15031670

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

JLMS25 and Jiao-Liao Mandarin Speech Recognition Based on Multi-Dialect Knowledge Transfer

Abstract

1. Introduction

2. Related Work

3. JLMS25 Dataset

4. Approach

4.1. Feature Extraction

4.2. Adapter Tuning

4.3. Three-Phase Training Strategy

5. Experiment

5.1. Datasets and Metrics

5.2. Experimental Setup

5.3. Main Results

5.4. Ablation Study

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI