1. Introduction
Automatic speech recognition (ASR) is a critical technology within human–computer interaction, tasked with converting spoken language into written text. Rapid advancements in deep learning have significantly propelled ASR, particularly in high-resource languages such as English and Mandarin, driven by the availability of extensive speech datasets [
1,
2,
3]. However, a notable challenge persists in the recognition of dialects like Jiao-Liao Mandarin, a significant variant of Mandarin predominantly spoken in the Jiaodong and Liaodong Peninsulas of China, with a speaker population exceeding 30 million. This dialect exhibits distinct phonetic characteristics that diverge significantly from Standard Mandarin, complicating the development of ASR models tailored to its nuances. Presently, the sole open-source dataset available for Jiao-Liao Mandarin speech recognition is the KeSpeech dataset (
https://github.com/KeSpeech/kespeech (accessed on 13 July 2024)) [
4], which contains a minuscule fraction of relevant data. Furthermore, the textual corpus within this dataset predominantly features Standard Mandarin, with scant coverage of Jiao-Liao Mandarin’s distinctive vocabulary and regional idiomatic expressions. This significant shortcoming hampers the in-depth research and development of Jiao-Liao Mandarin speech recognition technology, underscoring the pressing need for more comprehensive and dialect-specific datasets to drive future innovation.
To address the challenge of limited data resources, we constructed the JLMS dataset, comprising 25 h of speech data specifically designed for the Jiao-Liao Mandarin speech recognition task. Unlike other datasets that include Jiao-Liao Mandarin speech data, JLMS uniquely incorporates a wide range of idiomatic expressions characteristic of this dialect.
Furthermore, we utilized a substantial corpus of annotated data in Standard Mandarin and its related dialects to develop a speech recognition framework named MDKT, specifically designed to tackle the challenges associated with Jiao-Liao Mandarin speech recognition. Compared to the baseline, this framework significantly enhances recognition accuracy and addresses the difficulties in Jiao-Liao Mandarin speech recognition.
Therefore, this research strives to achieve the following objectives:
Dataset Compilation—To create a comprehensive 25-h speech recognition dataset tailored for Jiao-Liao Mandarin. This dataset will include a substantial number of idiomatic expressions, enriching the corpus with the distinctive linguistic nuances of the dialect.
Framework Development—To design and implement an advanced speech recognition framework. This framework will enhance the backbone model by incorporating the WFAdapter and AttAdapter modules. The three-phase training strategy effectively addresses and overcomes the unique challenges of Jiao-Liao Mandarin speech recognition, significantly enhancing the system’s accuracy and robustness in recognizing this dialect.
The rest of this paper is organized as follows:
Section 2 presents related work,
Section 3 offers a comprehensive description of the dataset construction process and its associated statistics,
Section 4 presents the proposed multi-dialect speech recognition framework, and
Section 5 provides evaluation results for both the proposed framework and existing ASR systems using the new dataset. Finally,
Section 6 concludes the paper.
2. Related Work
Our model is closely related to the fields of machine learning, ensemble learning, and intelligent computing [
5,
6,
7], and the literature in these fields is reviewed in this section.
The conventional approach to adapting a pre-trained model to a specific target dataset typically involves full parameter fine-tuning [
1,
8]. Johnson et al. [
9] integrated language categories into the output, while Abdel-Hamid et al. [
10] incorporated speaker codes into the model’s input. Li et al. [
1] developed a technique for addressing multi-dialect speech recognition using a single sequence-to-sequence model. Each method relies on full fine-tuning. However, full fine-tuning presents significant limitations [
11,
12,
13], including inefficiency and an increased risk of overfitting.
Various strategies have been developed to address this issue and minimize the number of parameters involved in fine-tuning through parameter sharing. Huang et al. [
14] proposed the shared-hidden-layer multilingual DNN (SHLMDNN), which utilizes distinct output layers for individual languages. Houlsby et al. [
15] proposed the integration of adapters within the model. Radhakrishnan et al. [
16] explored the use of a small number of adapters for efficient fine-tuning in Arabic dialect recognition tasks. Gu et al. [
17] used adapters to capture the target speaker’s vocal characteristics for personalized recognition while preserving the generalization capability of the ASR model. In contrast, the effectiveness of adapters, particularly in neural machine translation (NMT) tasks, was confirmed by Bapna et al. [
12]. However, the number of parameters increases rapidly with expansion of target tasks or languages [
18,
19].
Pham et al. [
18,
19] employed weight factorization, a multilingual algorithm, to enable distinct parameters for each language while mitigating the parameter growth associated with adding new target languages. This approach enhances the linear transformation functions of the neural network by decomposing each weight matrix into shared and language-dependent components. Let
represent the input and
the output of weight factorization. The formula for weight factorization is presented as follows:
where
represents the shared weights, while
,
,
, and
denote the language-dependent weights. Pham et al. [
20] demonstrated the effectiveness of this approach in an ASR task.
However, integrating this algorithm into existing pre-trained models presents challenges, and its performance under low-resource conditions still needs to improve. We propose MDKT, a speech recognition framework specifically designed to overcome the challenges of Jiao-Liao Mandarin. MDKT improves the backbone model by introducing two key modules, WFAdapter and AttAdapter, which utilize dialect-dependent parameters to enhance adaptability across dialects and facilitate effective multi-dialect knowledge transfer. WFAdapter reduces the number of fine-tuning parameters by decomposing weights into shared and dialect-specific components. Meanwhile, AttAdapter, positioned before the classifier, integrates phonetic and linguistic knowledge from multiple dialects, enabling robust knowledge transfer for Jiao-Liao Mandarin speech recognition tasks. The MDKT framework employs a three-phase training strategy, beginning with a primary training phase to establish the model’s foundation, followed by two fine-tuning phases that optimize performance across different dialects. This structured approach ensures adaptability and effectiveness even under low-resource conditions.
3. JLMS25 Dataset
The JLMS25 dataset is the first monophonic phonetic corpus designed explicitly for the Jiao-Liao Mandarin speech recognition task. It encompasses a wide range of speech data related to idiomatic expressions in Jiao-Liao Mandarin. The dataset is currently available for free download online (
https://github.com/Jiao-Liao-Mandarin/JLMS-dataset (accessed on 3 November 2024)).
As illustrated in
Figure 1, the dataset construction process comprised the following steps, with strict measures in place to ensure the confidentiality of all volunteers’ personal information at every stage:
(1) Corpus Construction. To ensure the richness of the corpus, the transcriptions in the JLMS25 dataset were sourced from two distinct origins:
a collection of commonly used expressions and folk proverbs from the Jiao-Liao Mandarin-speaking region, reflecting the unique linguistic features and acoustic variations of the area. Examples of folk proverbs and their meanings are shown in
Table 1.
transcriptions from a subset of the AISHELL-1 [
21] speech corpus, significantly broadening the thematic scope of the corpus to include topics such as ‘finance’, ‘technology’, ‘sports’, ‘entertainment’, and ‘news’.
(2) Volunteer Recruitment. Volunteers were extensively recruited from cities in the Jiaodong Peninsula and Liaodong region, including Qingdao, Yantai, Weihai, Dalian, Dandong, and Yingkou. The selected volunteers, who frequently converse in the local dialect and have no long-term history of residing outside the region, provided a comprehensive reflection of the phonetic characteristics of and acoustic variations in Jiao-Liao Mandarin across different areas.
(3) Speech Collection. The data collection process was carried out both offline and online. During the offline phase, volunteers were placed in a noise-free environment and asked to read each sentence sequentially from a provided script while an Audio-Technica AT2020 high-fidelity microphone recording device was activated. A brief pause of one second was observed after each sentence to facilitate later segmentation. The recorded results were then segmented and verified. Each recording was segmented sentence-by-sentence using Adobe Audition. In the online phase, volunteers recorded the data in quiet environments using smart devices.
(4) Post-processing and Correction. The audio files were processed into 16-bit depth, 16 kHz sampling rate WAV format. To ensure data quality, each audio entry was reviewed. Recordings containing errors, such as mispronunciations or omissions, were excluded from the dataset.
We conducted a statistical analysis of the JLMS25 dataset, which comprises 25 h of speech data from 46 volunteers with a male-to-female ratio of 10:13, likely due to a higher inclination among females to participate in volunteer activities. The age distribution of the volunteers, as shown in
Figure 2a, reveals a balanced representation of young, middle-aged, and older adults, providing a diverse range of voices that may contribute to the model’s robustness.
Figure 2b highlights a broad range of themes, with News and Idioms being the most prominent categories. This thematic diversity is expected to enhance the robustness of the speech recognition model, enabling it to effectively process a wide variety of topics and expressions in Jiao-Liao Mandarin.
4. Approach
The MDKT architecture primarily consists of an acoustic sequence feature extractor, a dialect feature extractor to capture the distinctive attributes of each dialect, WFAdapter modules, and an AttAdapter module. The structure of the model is illustrated in
Figure 3. The source code is publicly accessible online (
https://github.com/mixxs/Jiao-Liao_Speech_Recognition (accessed on 15 November 2024)).
4.1. Feature Extraction
The MDKT architecture employs a sequence feature extractor and a dialect feature extractor to capture the contextual and dialect-dependent features of speech signals. The sequence feature extractor is a module capable of extracting contextual information from acoustic features, such as speech waveforms or Mel Frequency Cepstral Coefficients (MFCCs). The dialect feature extractor, on the other hand, is a dialect classification network with hidden layer outputs serving as dialect features. The dialect features represent the unique acoustic characteristics of each specific dialect, encapsulating the variations among dialects. By enhancing the sequence features, the dialect features increase the distinctiveness of intermediate results corresponding to speech from various dialects. To further augment the distinction among dialect features of other dialects, we employed both Triplet loss [
22] and cross-entropy loss for training the dialect feature extractor.
4.2. Adapter Tuning
We employed WFAdapter and AttAdapter to enhance the model’s adaptability to multiple dialects, enabling the model to better extract knowledge from various dialects and thus improve its performance in Jiao-Liao Mandarin speech recognition.
Figure 4 illustrates the structure of WFAdapter and AttAdapter.
WFAdapter primarily consists of two linear layers that employ weight factorization. Inspired by Houlsby et al. [
15], the first layer maps the input from
ds dimensions to
di dimensions, while the second layer maps the features back to
ds dimensions. Residual connections [
23] merge the input and output to prevent degradation. Nonlinear transformations are performed using Exponential Linear Units [
24], and normalization is conducted using layer normalization [
25]. The weight factorization [
18] algorithm decomposes weights into a shared weight matrix and four dialect-dependent weight vectors, further reducing the number of dialect-dependent parameters involved in fine-tuning WFAdapter.
AttAdapter integrates dialect and sequence features. To enable the algorithm to consider the unique phonetic features of dialects, AttAdapter employs dialect features as keys and values of the multi-head attention mechanism [
26], with speech sequence features serving as queries. A fully connected layer transforms dialect features before multi-head attention to ensure that the dimensionality of dialect features matches that of sequence features. Parameters within AttAdapter are dialect-dependent. By incorporating dialect features into sequence features, AttAdapter increases the variability in the hidden layer outputs across different dialects, thereby enhancing the model’s adaptability to various dialects.
4.3. Three-Phase Training Strategy
The training task was divided into three distinct phases. In the first phase, the framework employed multi-task learning to acquire knowledge from both multi-dialect speech recognition and multi-dialect recognition tasks. This knowledge was then transferred to Jiao-Liao Mandarin through fine-tuning in the subsequent phase. However, due to the limited availability of dialect-dependent parameters in the model, directly fine-tuning them could adversely affect MDKT’s performance. To improve the model’s fitting capability and minimize the risk of overfitting, the fine-tuning process was further split into two stages. In the second phase, due to the scarcity of Jiao-Liao Mandarin data, the parameters of the dialect feature extractor were frozen to prevent overfitting. The quantized outputs of the dialect feature extractor were then used as dialect features:
where QDF represents a quantified dialect feature,
represents the output of the dialect feature extractor, and
n denotes the number of data entries for each target dialect in the training set.
In the final stage, only the parameters related to the WFAdapter and AttAdapter modules remained active, while all other parameters were frozen. This approach effectively utilized the limited Jiao-Liao Mandarin data while improving the model’s stability and overall performance.
5. Experiment
5.1. Datasets and Metrics
The datasets utilized in this study consist of the publicly available KeSpeech dataset [
5] and the self-constructed JLMS25 dataset. In the first stage of the experiment, data from seven dialects within the KeSpeech dataset, excluding Jiao-Liao Mandarin, were employed. These dialects include Beijing Mandarin, Southwestern Mandarin, Zhongyuan Mandarin, Northeastern Mandarin, Lan-Yin Mandarin, Jiang-Huai Mandarin, and Ji-Lu Mandarin. To mitigate data imbalance in KeSpeech, 30,000 Mandarin speech samples were randomly selected from its training set and incorporated into the training process. During the fine-tuning stage for Jiao-Liao Mandarin, the second and third experimental stages were independently conducted using the Jiao-Liao Mandarin subset of the KeSpeech dataset and the JLMS25 dataset, respectively. For the experiments based on the KeSpeech dataset, only its training set and test data were utilized for model training and testing. For JLMS25, the data were divided into training and test sets with a 9:1 volume ratio, ensuring a balanced distribution for both model training and testing.
We used Character Error Rate (CER) and Word Error Rate (WER) as the primary evaluation metrics for assessing speech recognition performance. Since the model generates a sequence of Chinese characters, using CER as an evaluation metric is both straightforward and widely accepted. For WER, both the model’s output and the reference text were segmented into sequences of Chinese words. The CER and WER are calculated using the following formulas:
where
,
and
are the numbers of substituted, deleted, and inserted characters, respectively, and
Nc represents the total number of characters in the reference text.
,
, and
are the numbers of substituted, deleted, and inserted words, respectively, and
Nw represents the total number of words in the reference text.
5.2. Experimental Setup
Given the significant advantages of transformer architecture [
26] in extracting contextual information, our MDKT experiments utilized the pre-trained transformer-based wav2vec2.0 (
https://huggingface.co/wbbbbb/wav2vec2-large-chinese-zh-cn (accessed on 1 November 2024)) [
27] as the backbone model for the speech sequence feature extractor. To capture dialect-specific features, we employed the ECAPA-TDNN [
28], which enhances the Time Delay Neural Network (TDNN) [
29] with the SE-Res2Block [
28], improving its effectiveness. The experimental configuration included three WFAdapter modules and one AttAdapter module, with each WFAdapter module set to an internal dimension of 512.
During the training phase, data augmentation techniques were applied to all datasets with a 50% probability. These techniques included gain adjustment, pitch shifting, time stretching, and the addition of Gaussian noise, aimed at enhancing the diversity and robustness of the training data. Optimization was performed using AdamW [
30] with a tri-phase learning rate schedule: an initial warm-up for the first 10% of updates, a stable phase for the subsequent 20%, and a linear decay for the remaining updates. The peak learning rate was set at 1 × 10
−4. Due to limited GPU memory, the initial training phase involved training the distinct components of the model separately: the ASR encoder and WFAdapter were trained using CTC loss [
31], while AttAdapter was trained using MSE loss to ensure its output aligns closely with the acoustic sequence features.
5.3. Main Results
The experiments were conducted on the KeSpeech and JLMS25 datasets. As shown in
Table 2, we first compared MDKT with models from other literature, including SHL-MDNN, JMAA [
3], full fine-tuning of the backbone model, and adapters. JMAA exhibited underfitting due to a limited number of parameters. The performance of SHL-MDNN and full fine-tuning, based on wav2vec 2.0, showed no significant differences, while MDKT demonstrated superior performance on both categories of Jiao-Liao Mandarin speech data.
We also conducted a comparison with the adapter module. Due to computational resource limitations, we could not pre-train the wav2vec 2.0 model on the diverse dialect speech data from the KeSpeech dataset, which prevents a direct comparison with the method proposed by Bapna et al. [
12]. However, to evaluate the effectiveness of the WFAdapter and AttAdapter modules, we compared them with the adapter module under experimental conditions. Both utilize a three-stage training method, with adapter modules added after the ASR encoder. With a similar number of language-dependent parameters, we compared the performance of a three-layer adapter against our method. In terms of parameters, the adapter had 6.8 M language-dependent parameters, while our method had 7.5 M, increasing by only 0.7 M.
Table 2 shows that this model architecture significantly enhanced the performance of the backbone model on the low-resource Jiao-Liao Mandarin speech dataset. Our method outperformed the backbone model trained using multi-dialect data. After fine-tuning on the Jiao-Liao Mandarin speech data, CER decreased by 5.4% and 7.7%, while WER reduced by 6.1% and 10.8% on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively. The combination of WFAdapter and AttAdapter outperformed the individual adapters, with CER reducing by 0.4% and 5.4%, and WER by 0.1% and 7.9% on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively. These results clearly demonstrate that WFAdapter and AttAdapter effectively enhance the model’s adaptability to various dialects while mitigating the risk of overfitting.
5.4. Ablation Study
To elucidate the contributions of individual components, we conducted ablation studies on MDKT. We evaluated the performance of several configurations: the original wav2vec 2.0 base model, the wav2vec 2.0 model augmented with three WFAdapter layers (wav2vec 2.0 + WFA), the base model enhanced with a single AttAdapter layer (wav2vec 2.0 + AttA), MDKT with three training phases (MDKT-tri), and MDKT with only the first and third training phases, excluding the second training stage (MDKT-bi). The findings are presented in
Table 3.
As shown in
Table 3, integrating WFAdapter into the wav2vec 2.0 model resulted in a 3.2% and 4.2% reduction in CER and a 4.2% and 7.1% reduction in WER on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively, compared to full fine-tuning. Additionally, incorporating AttAdapter into the wav2vec 2.0 model led to a 2.7% and 5.4% decrease in CER and a 4.6% and 7.5% decrease in WER on the KeSpeech Jiao-Liao Mandarin and JLMS25 datasets, respectively. These findings indicate that integrating dialectal and sequence features through AttAdapter is beneficial, while conveying multi-dialect knowledge via WFAdapter enhances model performance. A comparison between MDKT-tri and MDKT-bi demonstrated that the three-stage training strategy effectively improves model performance in the target dialect.
6. Conclusions
In this study, we constructed and introduced JLMS25, the first publicly available speech dataset specifically curated for dialect recognition in the Jiao-Liao region of China. JLMS25 comprehensively covers common daily topics and incorporates a substantial amount of specialized vocabulary and regional idioms, offering extensive coverage of Jiao-Liao Mandarin’s phonetic and tonal features, thereby contributing to the advancement of Jiao-Liao Mandarin speech recognition. To address the issue of insufficient Jiao-Liao Mandarin speech data, which limits model performance, we introduced the MDKT model architecture. The MDKT architecture incorporates WFAdapter and AttAdapter modules to enhance model performance. A three-phase training strategy is employed to transfer multi-dialect knowledge to low-resource dialects. In the first phase, a multi-dialect AID-ASR multi-task learning strategy is used for training. In the second phase, the dialect feature extractor is frozen, and training continues to fine-tune the model. In the third phase, all modules, except for the adapters, are frozen, and fine-tuning is conducted on the adapters. Experimental results demonstrate that this architecture significantly improves the performance of the ASR backbone model in data-scarce environments.
Although the model performance on low-resource datasets has improved, the method still faces challenges. Future work will focus on expanding the Jiao-Liao Mandarin dataset to cover more linguistic variations, simplifying the training framework, and improving model performance. These efforts aim to address current limitations, advance speech recognition technologies for resource-scarce dialects, and contribute to the preservation of Jiao-Liao Mandarin.