End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios

Zhang, Lusheng; Wu, Shie; Wang, Zhongxun

doi:10.3390/electronics14040802

Open AccessArticle

End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios

by

Lusheng Zhang

^1,2,

Shie Wu

^1,2 and

Zhongxun Wang

^1,2,*

¹

School of Physics and Electronic Information, Yantai University, Yantai 264005, China

²

Shandong Data Open Innovation Application Laboratory of Smart Grid Advanced Technology, Yantai University, Yantai 264005, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(4), 802; https://doi.org/10.3390/electronics14040802

Submission received: 19 January 2025 / Revised: 9 February 2025 / Accepted: 10 February 2025 / Published: 19 February 2025

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid development of Automatic Speech Recognition (ASR) technology, end-to-end speech recognition systems have gained significant attention due to their ability to directly convert raw speech signals into text. However, such systems heavily rely on large amounts of labeled speech data, which severely limits model training performance and generalization, especially in low-resource language environments. To address this issue, this paper proposes an end-to-end speech recognition approach based on deep fusion, which tightly integrates an external language model (LM) with the end-to-end model during the training phase, effectively compensating for the lack of linguistic prior knowledge. Unlike traditional shallow fusion methods, deep fusion enables the model and the external LM to share representations and jointly optimize during training, thereby enhancing recognition performance under low-resource conditions. Experiments conducted on the Common Voice dataset show that, in a 10 h extremely low-resource scenario, the deep fusion method reduces the character error rate (CER) from 51.1% to 17.65%. In a 100 h scenario, it achieves a relative reduction of approximately 2.8%. Furthermore, ablation studies on model layers demonstrate that even with a reduced number of encoder and decoder layers to decrease model complexity, deep fusion continues to effectively leverage external linguistic priors, significantly improving performance in low-resource speech recognition tasks.

Keywords:

end-to-end speech recognition; deep fusion; language model; low-resource speech recognition

1. Introduction

With the rapid advancement of information technology, Automatic Speech Recognition (ASR) [1] has become an indispensable component in human–computer interaction. In recent years, end-to-end speech recognition systems [2] have garnered significant attention due to their ability to directly convert raw speech signals into text, thereby streamlining the intricate processing procedures inherent in traditional speech recognition systems. However, such systems typically rely on extensively annotated speech datasets to achieve optimal performance [3,4,5]. For Low-resource Language Speech Recognition (LLSR), the scarcity of data severely constrains the training efficacy and generalization capabilities of models [6]. The foremost advantage of end-to-end speech recognition lies in its integration of multiple stages from traditional speech recognition, yet this simultaneously intensifies the demand for large-scale annotated data. In real-world scenarios, low-resource languages often encounter challenges such as diverse accents, significant variations in speech rate, and complex contextual factors. Without sufficiently large and diverse annotated datasets, the recognition performance of models deteriorates markedly.

To address the data scarcity issue in low-resource speech recognition, researchers have proposed a multitude of approaches. Traditional methods—such as data augmentation [7], transfer learning [8,9,10,11], and multi-task learning [12]—have been shown to enhance model robustness and generalization to some extent. Meanwhile, self-supervised learning [13,14,15] and semi-supervised learning [16,17] leverage unlabeled or partially labeled speech data to facilitate the extraction of acoustic features and the learning of language structures. In 2020, a research team at Facebook introduced self-supervised learning models, Wav2Vec 2.0 [18] and HuBERT [19], which were pretrained on unlabeled data from high-resource languages like English. These models effectively transferred the learned representations to low-resource languages, thereby significantly enhancing recognition performance in scenarios with limited annotated data. In 2021, the STraTA team proposed a method that combined task augmentation with self-training. By generating pseudo-label data and expanding the training dataset, they achieved substantial improvements in sample efficiency across 12 low-resource benchmark tasks [20]. In 2022, the Zhang WeiQiang team from Tsinghua University applied unsupervised pre-training models to a zero-resource speech recognition task. Remarkably, they achieved an average Word Error Rate (WER) of 33% without utilizing any speech data from the target language [21]. In 2023, Bartelds et al. enhanced performance on the Gronings language by employing self-training in conjunction with training data generated via Text-to-Speech (TTS) technology, resulting in an improvement of up to 25.5% in accuracy [22]. Most recently in 2024, Andrés Piñeiro-Martín and colleagues fine-tuned the Whisper multilingual ASR model using weighted cross-entropy and data augmentation techniques. Compared to a fine-tuned model that did not incorporate these strategies, this approach reduced the Word Error Rate (WER) in low-resource languages by 6.69% [23].

Although the aforementioned studies have achieved notable progress under conditions of data scarcity, they still fall short in fully exploring and leveraging linguistic prior knowledge. End-to-end systems tend to rely heavily on large-scale annotated datasets to learn sufficiently robust language models, particularly when confronted with complex and variable speech inputs (e.g., dialects, accents, or domain-specific vocabulary). In end-to-end speech recognition, integrating an external LM has always been an effective method for improving system performance. The most common approach is shallow fusion [24,25], where, during inference, the decoding distribution of the end-to-end model is weighted and summed with the probabilities from the external LM. However, this method only utilizes the LM during inference and has limited interaction between the acoustic network and the language model [26]. To address the drawback of insufficient utilization of language priors in low-resource conditions, this paper proposes an end-to-end speech recognition method based on deep fusion. This method tightly integrates the external language model with the decoding network (RNN-T) during the training phase and jointly optimizes them. It inherits the concept of neural network language model fusion initially applied in fields such as natural language processing, speech recognition, and machine translation. Unlike shallow fusion, which simply applies weighted sums during inference, this method deeply couples the LM’s hidden state with the decoder’s output through a gating mechanism and a joint network. This effectively supplements language-level prior knowledge under conditions with very limited labeled data, overcoming the limitation of traditional shallow fusion, which can only apply simple weighting during inference. Systematic experiments were conducted using the Common Voice public dataset. The results demonstrate that, in an extremely low-resource scenario of 10 h of data, the application of deep fusion reduced the character error rate (CER) significantly from 51.1% to 17.65%. In a scenario with 100 h of data, an approximate relative reduction of 2.8% in CER was observed. Furthermore, ablation studies on the number of model layers revealed that even when the number of encoder and decoder layers is reduced to lower model complexity, deep fusion continues to effectively exploit external linguistic priors, thereby holding promise for delivering significant performance improvements across various low-resource speech recognition tasks.

2. Basic Theory

Speech recognition aims to convert spoken signals into corresponding sequences of text. With the rise in deep learning, neural network-based Automatic Speech Recognition (ASR) systems have become the mainstream approach. In particular, transformer-based models [27] and end-to-end methods have demonstrated outstanding performance in speech recognition. However, under low-resource conditions, effectively leveraging available speech data remains a significant challenge. In this work, we first present an end-to-end speech recognition pipeline that integrates the Zipformer model [28] with the Recurrent Neural Network Transducer (RNN-T) model. Furthermore, recognizing the pivotal role of external language models under low-resource scenarios, we adopt a deep fusion strategy between an external LM and the RNN-T decoder. This is achieved by incorporating gating mechanisms and a joint network to effectively fuse the language priors learned from the external LM with the acoustic features.

2.1. Zipformer

Zipformer is an efficient encoder model tailored for Automatic Speech Recognition (ASR) tasks. It represents an enhancement of the Conformer architecture by integrating self-attention mechanisms with convolutional operations. A key innovation of Zipformer is its incorporation of a “sparse attention” mechanism, which significantly reduces the computational burden typically associated with processing long sequences in transformer-based models. This design choice allows the model to maintain its capacity for capturing global contextual information while achieving improved efficiency.

As shown in Figure 1, Zipformer adopts a U-Net-like multi-resolution architecture, learning temporal sequence representations at different frame rates through a series of Downsample and Upsample operations. The model first applies a convolutional embedding (Conv-Embed) to the original acoustic features sampled at 100 Hz, reducing the frame rate to 50 Hz while mapping the features to the initial embedding dimensions. The Downsample 50 Hz feature sequence is then passed through six alternating Downsample and Upsample stacks, processing the data at frame rates of 50 Hz, 25 Hz, 12.5 Hz, 6.25 Hz, 12.5 Hz, and 25 Hz, respectively. Each stack module integrates self-attention and convolutional operations, enabling the capture of rich contextual information across different time scales. The outputs of these stacked modules are appropriately truncated or zero-padded before being fused. Finally, the frame rate is unified to 25 Hz through a Downsample module, generating the final output features from the encoder.

Unlike standard multi-head self-attention, Zipformer decomposes the attention computation into two key steps: In the first step, the attention weights

A \in ℝ^{N \times D}

are computed from the input sequence

H \in ℝ^{N \times D}

. This process is analogous to the traditional Softmax computation, but only the resulting weight matrix A is retained. Here, H represents the feature matrix of the input sequence, N denotes the number of time steps, and D represents the dimensionality (or the number of channels) of the feature vectors. In the second step, once the attention weight matrix A is obtained, both the subsequent Self-attention (SA) module and the newly introduced Non-linear Attention (NLA) module can reuse the same A. This allows multiple sub-modules to perform various forms of feature transformations without the need to recompute the large-scale

{Q K}^{T}

. Specifically, the Self-attention operates similarly to conventional self-attention, as computed by Equation (1):

S A (H) = A \otimes V,

(1)

here,

\otimes

denotes either element-wise or matrix multiplication, and V represents the value matrix derived from the input sequence. The Non-linear Attention (NLA) mechanism first projects the input into multiple branches (e.g., A, B, and C), then applies operations such as tanh(B) and the

\otimes

operation with A, and ultimately merges the results back to the original dimensionality, as illustrated in Equation (2):

N L A (H) = W (A \otimes attention (\tanh (B) \otimes C)),

(2)

here,

\tanh (\cdot)

denotes the hyperbolic tangent activation function, which does not fully suppress negative values, thereby enhancing the model’s ability to express non-linear relationships. The

attention (\cdot)

operation performs matrix multiplication or dot product based on the previously computed attention weights A, enabling the exploration of non-linear combinations of global information. Through this approach, Zipformer reuses the attention weights within a single Zipformer block, significantly reducing the quadratic computational complexity that typically arises with multi-head attention in long-sequence scenarios.

By incorporating sparse attention during the encoding stage, Zipformer reduces computational complexity while still retaining the capability to capture global information. In addition, Zipformer replaces the conventional LayerNorm with BiasNorm to mitigate normalization degradation issues encountered during the initial stages of training, as illustrated in Equation (3):

BiasNorm (x) = \frac{x}{RMS [x - b]} ⊙ e^{γ},

(3)

here,

RMS [\cdot]

denotes the root-mean-square operation, b is a channel-level learnable bias, and

γ

is a learnable scalar.

To address the issue of excessively small gradients in the negative region encountered by the Swish activation function, Zipformer introduces two novel activation functions, as illustrated in Equations (4) and (5):

SwooshR (x) = \log (1 + e^{x - 1}) - 0.08 x - 0.313261687,

(4)

SwooshL (x) = \log (1 + e^{x - 4}) - 0.08 x - 0.035,

(5)

specifically, the term

\log (1 + e^{x - c})

represents an offset transformation applied to the Swish activation function, defined as

Swoosh (x) = x σ (x)

, where

σ (\cdot)

denotes the sigmoid function. By introducing an appropriate constant offset c, this formulation maintains non-vanishing gradients for negative inputs while avoiding the instability often associated with exponential computations.

To render the training process insensitive to parameter scaling, Zipformer introduces an improved variant of the Adam optimizer. The conventional Adam optimizer, as illustrated in Equation (6):

Δ_{t} = - α_{t} \cdot \sqrt{\frac{1 {- β_{2}}^{t}}{1 {- β_{1}}^{t}}} \cdot \frac{m_{t}}{\sqrt{v_{c}} + ε},

(6)

building on the conventional formulation, they further incorporate a parameter scaling factor

r_{t - 1}

, thereby yielding Equation (7):

{Δ^{'}}_{t} = - α_{t} \cdot r_{t - 1} \cdot \sqrt{\frac{1 - {β_{2}}^{t}}{1 {- β_{1}}^{t}}} \cdot \frac{m_{t}}{\sqrt{v_{t}} + ε},

(7)

additionally, r itself is explicitly learned, thereby enabling faster convergence and more stable performance improvements.

Zipformer achieves more efficient modeling of long-duration sequences by employing hierarchical Downsample in the encoder, reusing attention weights, and incorporating improved normalization and activation functions. In contrast to traditional methods that model sequences at a single frame rate, Zipformer alternates between multiple frame rates when processing temporal sequences. This approach significantly alleviates the computational burden associated with large-scale attention mechanisms. Moreover, the lower frame rates facilitate a larger receptive field, while the higher frame rates preserve finer details. Collectively, these design choices endow Zipformer with enhanced capabilities for capturing both long-term dependencies and short-term dynamics.

2.2. RNN-T

In the context of Automatic Speech Recognition, a high-performance encoder alone is insufficient to complete the end-to-end recognition process; a decoder is also required to generate the corresponding text sequence from the acoustic representations. Therefore, this work integrates the Zipformer encoder with the RNN-T framework [29], enabling the model to capture long-term dependencies while simultaneously addressing the demands of real-time decoding, thereby facilitating comprehensive end-to-end training and inference.

The RNN-T model is essentially an extension and improvement of the CTC model [30], directly performing joint modeling between the acoustic features and the output label sequence, thereby significantly simplifying the system architecture. As shown in Figure 2, the RNN-T model mainly comprises three core components: the encoder, the prediction network, and the joint network.

In this framework, the encoder accepts a variable-length feature sequence

x = (x_{1}, x_{2}, \dots, x_{T})

, typically derived from the acoustic frontend—in this study, the output of the Zipformer encoder. The encoder transforms the input into a high-level temporal representation

{h_{t}}^{enc} \in ℝ^{D}

that is capable of capturing both temporal and spectral properties of the speech signal. The prediction network, operating as a conditional language model, takes as input the previously predicted output symbols

(y_{1}, y_{2}, \cdot \cdot \cdot, y_{u - 1})

and generates a hidden state representation

p_{u}

through autoregressive prediction based solely on historical outputs when no new acoustic feature is provided. Subsequently, the joint network fuses the encoder output

{h_{t}}^{ent}

and the prediction network output

p_{u}

to generate the combined representation

z_{t, u}

, which is then passed through a Softmax layer to yield the label probability distribution

p (y_{t, u} | z_{t, u})

at each time step (t, u). During the decoding process, RNN-T performs a dual search over both the time steps t (in the acoustic domain) and the output steps u (in the symbolic domain) to identify the most probable output sequence.

RNN-T jointly trains the encoder, prediction network, and joint network by maximizing the log likelihood of the complete label sequence, thereby eliminating the need for forced alignment—an essential step in traditional ASR systems. Since the search occurs simultaneously in both the time dimension t and the output dimension u, RNN-T does not require explicit alignment information. Instead, the model learns to automatically decide which frame should produce an output symbol and which frame should output a blank, thus capturing the variable-length speech-to-text correspondences. The objective function of RNN-T is typically

- \log p (y | x)

, where the maximum likelihood training is performed on the conditional probability of the entire sequence

y = (y_{1}, \dots, y_{u})

given the input speech x. Similarly to CTC, a forward–backward algorithm is used within an implicit alignment lattice to sum over all possible alignments, thus providing end-to-end gradients without explicit alignment constraints.

2.3. LM and Decoding Strategies

An LM, typically trained independently on large-scale text corpora, is designed to capture statistical regularities and contextual dependencies within the language.

In this paper, a deep fusion strategy is primarily employed to seamlessly integrate an LM with the RNN-T decoder. During training, a joint optimization mechanism is used to combine the hidden representation from the LM with the output of the RNN-T decoder’s prediction network, forming a unified fusion representation for subsequent joint network computation. Specifically, the external LM is first pretrained on annotated text data using an LSTM architecture to learn the syntactic and semantic structures of the language. The pretrained LM parameters are then fixed to avoid performance degradation due to overfitting in low-resource settings. Thereafter, a Fusion Layer is introduced between the RNN-T decoder’s prediction network and the joint network. This layer receives the hidden state

p_{u}

from the prediction network and the hidden state

p_{LM}

from the external LM, and, through a gating mechanism, linearly combines them to produce the fused representation

p_{fusion}

, as illustrated in Equation (8):

p_{fusion} = σ (w_{g} [p_{u}; p_{LM}] + b_{g}) ⊙ p_{u} + (1 - σ (w_{g} [p_{u}; p_{LM}] + b_{g})) ⊙ p_{LM},

(8)

here, σ denotes the sigmoid function,

w_{g}

and

b_{g}

are learnable parameters,

⊙

represents element-wise multiplication. The following is illustrated in Figure 3:

After fusion, the representation

p_{fusion}

is fed into the joint network where it is combined with the acoustic features

{h_{t}}^{enc}

produced by the Zipformer encoder, resulting in the final output distribution, as illustrated in Equations (9) and (10):

z (t, T) = f_{joint} ({h_{t}}^{enc}, p_{fusion} (T)),

(9)

p (y_{t, T} | x) = soft \max (w_{2} z (t, T) + b_{2}),

(10)

here, t represents the acoustic time step, T represents the decoder time step,

w_{2}

and

b_{2}

are the parameters of the joint network.

3. Training Process

3.1. Dataset and Preprocessing

The primary training dataset used in this study is the Chinese Cantonese subset of the Common Voice dataset. Common Voice is a multilingual, multi-accent open-source speech dataset, with the Cantonese portion consisting of recordings from volunteers from different regions. The dataset features diverse accents and complex backgrounds, making it highly valuable for evaluating model performance in real-world scenarios. In this study, we selected subsets of 10 h and 100 h to simulate extremely low-resource and medium-scale training environments. During the data preprocessing stage, we cleaned the original text by removing punctuation marks and eliminated anomalous audio clips shorter than 1 s. The splits for the training, validation, and test sets were made according to the proportions recommended by the official guidelines.

To enhance the model’s adaptability to varying speaking speeds and pitch ranges, a speed perturbation technique was applied during the speech data processing phase. Two speed factors, 0.9 and 1.1, were used to stretch or compress the playback speed of the original audio, thus generating more diverse speech samples. Speed perturbation simulates natural fluctuations in human speaking speed, while also helping the model adapt to different tonal qualities and speech habits. This operation effectively expands the data size and diversity, providing a richer variety of speech samples for subsequent model training.

Additionally, to simulate the noisy environments encountered in the real world and improve the model’s robustness in noisy conditions, noise injection was introduced during the data preprocessing phase. Noise segments of a certain length were randomly selected from publicly available noise datasets and background noises recorded in real environments and mixed with the speech signals at various signal-to-noise ratios (SNRs). This allows the network to learn to differentiate between speech signals and background noise during training, improving the model’s robustness in real-world scenarios where noise is unpredictable and variable.

After the augmentation and noise mixing processes, all audio samples were standardized to a 16 kHz mono format to reduce inconsistencies in features caused by varying sampling rates and channel counts. In feature extraction, the Filter Bank (fbank) method was employed to extract 80-dimensional fbank features. Each speech frame lasted 25 ms with a 10 ms frame shift. Following extraction, additional steps such as windowing and energy normalization were applied. Compared to traditional features like MFCC, fbank features are better at retaining the low-frequency details of the acoustic signal and are more suited to the spectral information requirements of neural networks during the learning process. These feature vectors, together with their corresponding labels, formed the training dataset, laying the foundation for the subsequent construction and optimization of the model.

3.2. Training Configuration

The choice of optimizer in training speech models significantly impacts both the convergence speed and stability of the model. In this study, the Adam optimizer was selected to update the model parameters. Adam combines the advantages of momentum and adaptive learning rates, allowing for relatively fast and stable convergence even in large-scale corpora and complex network structures. In this experiment, the initial learning rate was set to 0.045, which provides a balance between fast initial training and avoiding excessive oscillations. In addition, this study uses two V100 GPUs for training, with the model iterated for a total of 40,000 steps. The batch size is dynamically adjusted based on the number of texts in the current batch, and training is conducted over 50 epochs.

Furthermore, during training, gradient clipping was applied to prevent the issue of gradient explosion. When training deep neural networks, excessively large gradients during error backpropagation can lead to overzealous parameter updates, destabilizing the model and potentially causing the loss function to return “Not-a-Number” (NaN) values. In this experiment, the gradients were clipped by their norm, constraining the maximum gradient norm within a predefined threshold. This ensures that the update steps remain stable even in the later stages of training, which helps improve both convergence speed and robustness on complex speech tasks.

3.3. Deep Fusion

First, we trained an initial Zipformer model on 100 h of data to establish basic acoustic modeling capabilities and subsequently trained an LM separately. To further enhance the language priors and decoding performance of the low-resource speech recognition system, we trained and integrated an RNN language model based on LSTM (Long Short-term Memory) on top of the RNN-T model. This LM employs a 6-layer LSTM architecture, with each layer having a hidden state dimension of 800. Thanks to the gating mechanisms in LSTM (input gate, forget gate, and output gate), the model can effectively memorize and update critical information over long sequences, thereby achieving strong performance in the language modeling task.

As shown in Figure 4, when initializing the parameters of the RNN-T decoder, this paper fuses the hidden layer output of the LM’s hidden output

p_{LM}

with the output of the RNN-T prediction network (i.e., the prediction network in the Transducer). The LM performs forward propagation on the input word sequence to generate a series of hidden state representations that reflect the contextual information of the words at the current time step. Since LSTM can effectively capture long-term dependencies, it provides rich language prior knowledge for subsequent decoding. Unlike conventional CTC decoders, the RNN-T decoder incorporates a prediction network whose input is the previously generated token sequence and which outputs a prediction vector at the current time step; these prediction vectors encode the model’s predictions based on acoustic information. The fusion strategy involves introducing a Fusion Layer before the joint network, which performs a weighted concatenation of the LM’s hidden output

p_{LM}

and the prediction network’s output

p_{u}

, thereby further enriching the contextual features for subsequent decoding. The goal of this fusion strategy is to leverage the grammatical and semantic information provided by the language model while preserving the acoustic model’s ability to capture speech signals, thus reducing spelling errors and semantic incoherence during decoding. Subsequently, the output of the Fusion Layer

p_{fusion}

is combined with the output of the Zipformer encoder

{h_{t}}^{ent}

and used as the input to the joint network. The joint network integrates these two types of information to compute the final emission distribution. After processing through the Joint Network, the model is able to make more precise predictions based on the fused contextual features. During the training of the fusion network, we employ the same objective function as that used for the original RNN-T model, ensuring that the model’s optimization direction remains consistent with the overall system objectives. Through joint training, the acoustic model’s decoder and the external language model can be optimized collaboratively, thereby fully leveraging their respective strengths in speech recognition tasks.

3.4. Loss Function

The primary loss function used in this experiment is the RNN-T loss function, which is specifically designed for sequence-to-sequence tasks. Unlike traditional frame-level alignment-based loss functions, the RNN-T loss can learn the alignment between the input speech sequence and the target text sequence without relying on precise alignment information. This is achieved through dynamic programming. The RNN-T loss can be divided into two forms: Simple Loss and Pruned Loss. The RNN-T formula is as follows, as illustrated in Equation (11):

L_{R N N - T} = - \log p (y | x) = - \log \sum_{z} p (y, z | x),

(11)

where

x

represents the feature extracted from the input speech sequence,

y

denotes the corresponding target output sequence, and

p (y | x)

is defined by the RNN-T model. By minimizing this loss function, we can jointly optimize the encoder, prediction network, and joint network, enabling more accurate integration of acoustic and linguistic information during the output prediction process.

3.5. Evaluation Metrics

The most commonly used evaluation metrics in speech recognition tasks are error rates, which are further classified based on the modeling unit, such as phoneme error rate, character error rate (CER), word error rate (WER), and sentence error rate. Since Cantonese has rich representations of Chinese characters or syllables at the character level, and the recognition unit in this experiment is closer to character-level output, CER is the most suitable metric.

CER is primarily measured by calculating the edit distance (including substitution, insertion, and deletion operations) between the recognition result and the reference text. The formula for CER is as follows:

CER = (S + D + I) / N

(12)

where S denotes the number of substitutions, D is the number of deletions, I is the number of insertions, and N is the total number of characters in the reference text.

4. Results Comparison and Analysis

4.1. Main Results

Table 1 presents the ASR results on the 10 h and 100 h subsets of the Common Voice dataset, with CER evaluated on the standard test set.

From the table, it is evident that compared to training with only 10 h of data, the error rates for the 100 h dataset drop substantially. Furthermore, with the same data size, the CER is lower when the deep fusion of the LM is used. Specifically, in the 10 h scenario, LM fusion resulted in a 33.45% reduction in error rate, while in the 100 h scenario, the reduction was 2.82%.

4.2. Impact of Data Scale Differences

On the 10 h dataset, without LM decoding, the model faces severely limited training data. With both acoustic and linguistic information being sparse, the model is more prone to homophone confusion or multi-character errors when recognizing longer sentences or more complex colloquial expressions. When LM decoding is used, the external LM provides significant assistance in the 10 h scenario. In cases where the acoustic model struggles to accurately identify certain phonemes, the language context provided by the LM helps correct the errors. The results indicate that the LM is especially beneficial in scenarios with very limited speech data, demonstrating that linguistic priors can effectively compensate for the lack of acoustic information.

As the corpus size increases to 100 h, the model is able to learn more diverse phoneme variants and accent features. Without LM decoding, the error rates are significantly lower compared to the 10 h scenario. With a larger training dataset, the model’s adaptability to common words and variations in speech rate improves. After combining more comprehensive acoustic learning with LM fusion, the overall error rate further decreases. When processing colloquial expressions and sentences with strong contextual dependencies, the LM can correct outputs that the acoustic model finds difficult to distinguish, thus reducing segmentation errors or substitution errors.

4.3. Performance Improvement from LM Fusion

In the 10 h scenario, the LM effectively reduces recognition errors arising from homophones and heteronyms, resulting in a substantial overall improvement. This indicates that when data are scarce, the semantic and linguistic structural information provided by the LM serves as a critical complement to the model. In the 100 h scenario, although the magnitude of improvement is relatively smaller compared to the 10 h case, the LM still demonstrates a robust gain.

4.4. Ablation Study

To control the model complexity and reduce dependency on hardware resources, this study explores reducing the network depth of the Zipformer encoder and the RNN-T decoder. Specifically, the original configuration of six layers for the encoder and six layers for the decoder is gradually reduced to configurations such as 4-4 layers, with comparative experiments conducted on the same dataset. The results are presented in Table 2.

It can be observed that reducing the network depth from a 6-6 layer configuration to a 4-4 layer configuration causes the model’s CER to increase slightly from 3.89% to 4.02%, indicating that scaling down the model results in some performance degradation. However, in resource-constrained scenarios, appropriately reducing the network layers can significantly lower the computational cost during both training and inference, thereby positively impacting deployment cost and efficiency. Based on practical requirements, subsequent experiments in this study primarily employ a 4-4 layer configuration or even shallower networks to balance recognition accuracy with computational overhead.

With the encoder and decoder network sizes fixed, this study further assesses the impact of the fusion strategy on recognition performance by comparing shallow fusion and deep fusion approaches for incorporating the external language model. Table 3 lists the CER results on the test set after fusion on the 100 h training dataset.

From Table 3, it can be observed that the deep fusion method achieves CERs of 17.65% and 1.07% on the test set, respectively, showing improvement over the shallow fusion’s 37.73% and 3.35%. This study suggests that deep fusion tightly integrates the external LM with the RNN-T decoder during the training phase, establishing a deeper collaborative relationship between acoustic and linguistic modeling. In contrast, shallow fusion simply adds the log probabilities of the LM during decoding, which, although cost-effective and requiring minimal changes to the existing model, offers a more limited degree of integration. Thus, deep fusion is especially valuable in low-resource or high-precision scenarios.

5. Conclusions

In summary, this study conducted a systematic speech recognition experiment targeting low-resource scenarios, comparing the effects of varying data sizes, external LM fusion methods, and network scales on recognition performance.

Data Size and Performance: When the training data increased from 10 h to 100 h, the model’s CER significantly decreased, indicating that corpus size plays a decisive role in the generalization ability and accuracy of the acoustic model in speech recognition tasks. With limited data (e.g., 10 h), the model struggles to learn sufficient phoneme variants and linguistic features, which limits its performance. However, when the data increased to 100 h, the model could better learn the diversity of different accents, speech rates, and contextual variations, achieving good performance even without an external LM in common scenarios.

Effect of LM Fusion: In the 10 h “scarcity of data” environment, the introduction of an external LM significantly improved recognition performance, particularly in distinguishing homophones and recognizing complex sentence structures. The LM’s linguistic priors effectively complemented the insufficient acoustic information, thus reducing the error rate. In the 100 h scenario, the LM’s assistance was still effective, further reducing the error rate, especially when handling colloquial expressions and sentences with strong cross-word boundary dependencies, where the LM helped to correct character confusions at the acoustic level, thus reducing segmentation or substitution errors.

Fusion Strategy and Performance Differences: Deep fusion consistently outperformed shallow fusion in both the 10 h and 100 h scenarios. This is because deep fusion tightly couples with the RNN-T decoder during training, allowing both acoustic modeling and language modeling to be co-optimized, which fundamentally improves the model’s adaptability to diverse speech. Although shallow fusion incurs fewer changes to the existing model and is cost-effective, it only adds weighted LM probabilities during decoding, offering limited fusion effects. In low-resource or critical applications that demand higher recognition accuracy, deep fusion is more valuable.

Model Scale and Efficiency Trade-off: When the encoder and decoder network depth was reduced from 6-6 layers to 4-4 layers, the CER slightly increased but the computational cost for training and inference decreased significantly, which meets the needs of resource-constrained environments. By adjusting the network depth, a balance can be struck between recognition accuracy and computational cost, offering feasible solutions for system deployment under different computational power conditions.

This study demonstrates the advantages of deep learning fusion in low-resource speech recognition systems. In particular, when data are scarce, deep fusion effectively compensates for the lack of acoustic information, significantly enhancing recognition performance.

Nevertheless, although this method yields significant improvements in accuracy, the additional computational overhead may pose challenges for real-time or embedded deployments, particularly in latency-sensitive scenarios. Future research will therefore focus on model compression strategies—such as knowledge distillation [31] or model pruning—to effectively reduce inference latency and computational burden while preserving recognition accuracy. Furthermore, extending this approach to other low-resource languages, especially those with linguistic structures markedly different from Cantonese, is essential for further evaluating and enhancing its generalizability and robustness. Lastly, while this work employs the Adam optimizer to balance convergence speed and training stability, hyperparameter tuning remains a non-trivial challenge. Methods such as evolutionary algorithms or other adaptive approaches could be explored in the future to further improve training efficiency and performance in low-resource contexts.

Author Contributions

Writing—original draft, L.Z.; Writing—review & editing, L.Z.; Supervision, S.W. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the following sources: Yantai City 2023 School-Land Integration Development Project Fund (Grant No. 2323013-2023XDRH001); Yantai City Science and Technology-Based SME Innovation Capability Enhancement Program (Grant No. 2023TSGC112); Chinese National Natural Science Foundation (Grant No. 62201491); Natural Science Foundation of Shandong Province (Grant No. ZR2021QF097); and The APC was funded by [2323013-2023XDRH001, 2023TSGC112, 62201491 and ZR2021QF097].

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dhanjal, A.S.; Singh, W. A comprehensive survey on automatic speech recognition using neural networks. Multimed. Tools Appl. 2024, 83, 23367–23412. [Google Scholar] [CrossRef]
Lakomkin, E.; Wu, C.; Fathullah, Y.; Kalinli, O.; Seltzer, M.L.; Fuegen, C. End-to-end speech recognition contextualization with large language models. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12406–12410. [Google Scholar]
Ardila, R.; Branson, M.; Davis, K.; Henretty, M.; Kohler, M.; Meyer, J.; Morais, R.; Saunders, L.; Tyers, F.M.; Weber, G. Common voice: A massively-multilingual speech corpus. arXiv 2019, arXiv:1912.06670. [Google Scholar]
Chen, G.; Chai, S.; Wang, G.; Du, J.; Zhang, W.-Q.; Weng, C.; Su, D.; Povey, D.; Trmal, J.; Zhang, J. Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio. arXiv 2021, arXiv:2106.06909. [Google Scholar]
Kang, W.; Yang, X.; Yao, Z.; Kuang, F.; Yang, Y.; Guo, L.; Lin, L.; Povey, D. Libriheavy: A 50,000 hours asr corpus with punctuation casing and context. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10991–10995. [Google Scholar]
San, N.; Paraskevopoulos, G.; Arora, A.; He, X.; Kaur, P.; Adams, O.; Jurafsky, D. Predicting positive transfer for improved low-resource speech recognition using acoustic pseudo-tokens. arXiv 2024, arXiv:2402.02302. [Google Scholar]
Ragni, A.; Knill, K.M.; Rath, S.P.; Gales, M.J. Data augmentation for low resource languages. In Proceedings of the INTERSPEECH 2014: 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014; pp. 810–814. [Google Scholar]
Tu, T.; Chen, Y.-J.; Yeh, C.-c.; Lee, H.-Y. End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. arXiv 2019, arXiv:1904.06508. [Google Scholar]
Byambadorj, Z.; Nishimura, R.; Ayush, A.; Ohta, K.; Kitaoka, N. Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation. EURASIP J. Audio Speech Music. Process. 2021, 2021, 42. [Google Scholar] [CrossRef]
Shi, X.; Liu, X.; Xu, C.; Huang, Y.; Chen, F.; Zhu, S. Cross-lingual offensive speech identification with transfer learning for low-resource languages. Comput. Electr. Eng. 2022, 101, 108005. [Google Scholar] [CrossRef]
Zhou, R.; Koshikawa, T.; Ito, A.; Nose, T.; Chen, C.-P. Multilingual Meta-Transfer Learning for Low-Resource Speech Recognition. IEEE Access 2024, 12, 158493–158504. [Google Scholar] [CrossRef]
Mamta; Ekbal, A.; Bhattacharyya, P. Exploring multi-lingual, multi-task, and adversarial learning for low-resource sentiment analysis. Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–19. [Google Scholar] [CrossRef]
Berrebbi, D.; Shi, J.; Yan, B.; López-Francisco, O.; Amith, J.D.; Watanabe, S. Combining spectral and self-supervised features for low resource speech recognition and translation. arXiv 2022, arXiv:2204.02470. [Google Scholar]
Singh, S.; Hou, F.; Wang, R. A novel self-training approach for low-resource speech recognition. arXiv 2023, arXiv:2308.05269. [Google Scholar]
Dunbar, E.; Hamilakis, N.; Dupoux, E. Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge. IEEE J. Sel. Top. Signal Process. 2022, 16, 1211–1226. [Google Scholar] [CrossRef]
DeHaven, M.; Billa, J. Improving low-resource speech recognition with pretrained speech models: Continued pretraining vs. semi-supervised training. arXiv 2022, arXiv:2207.00659. [Google Scholar]
Du, Y.-Q.; Zhang, J.; Fang, X.; Wu, M.-H.; Yang, Z.-W. A semi-supervised complementary joint training approach for low-resource speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 3908–3921. [Google Scholar] [CrossRef]
Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 2020, 33, 12449–12460. [Google Scholar]
Hsu, W.-N.; Bolte, B.; Tsai, Y.-H.H.; Lakhotia, K.; Salakhutdinov, R.; Mohamed, A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3451–3460. [Google Scholar] [CrossRef]
Vu, T.; Luong, M.-T.; Le, Q.V.; Simon, G.; Iyyer, M. STraTA: Self-training with task augmentation for better few-shot learning. arXiv 2021, arXiv:2109.06270. [Google Scholar]
Wang, H.; Zhang, W.-Q.; Suo, H.; Wan, Y. Multilingual Zero Resource Speech Recognition Base on Self-Supervise Pre-Trained Acoustic Models. In Proceedings of the 2022—13th International Symposium on Chinese Spoken Language Processing (ISCSLP), Singapore, 11–14 December 2022; pp. 11–15. [Google Scholar]
Bartelds, M.; San, N.; McDonnell, B.; Jurafsky, D.; Wieling, M. Making more of little data: Improving low-resource automatic speech recognition using data augmentation. arXiv 2023, arXiv:2305.10951. [Google Scholar]
Piñeiro-Martín, A.; García-Mateo, C.; Docío-Fernández, L.; López-Pérez, M.d.C.; Rehm, G. Weighted Cross-entropy for Low-Resource Languages in Multilingual Speech Recognition. arXiv 2024, arXiv:2409.16954. [Google Scholar]
Toshniwal, S.; Kannan, A.; Chiu, C.-C.; Wu, Y.; Sainath, T.N.; Livescu, K. A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018; pp. 369–375. [Google Scholar]
Gulcehre, C.; Firat, O.; Xu, K.; Cho, K.; Barrault, L.; Lin, H.-C.; Bougares, F.; Schwenk, H.; Bengio, Y. On using monolingual corpora in neural machine translation. arXiv 2015, arXiv:1503.03535. [Google Scholar]
Chorowski, J.; Jaitly, N. Towards better decoding and language model integration in sequence to sequence models. arXiv 2016, arXiv:1612.02695. [Google Scholar]
Ashish, V. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, I. [Google Scholar]
Yao, Z.; Guo, L.; Yang, X.; Kang, W.; Kuang, F.; Yang, Y.; Jin, Z.; Lin, L.; Povey, D. Zipformer: A faster and better encoder for automatic speech recognition. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Graves, A. Sequence transduction with recurrent neural networks. arXiv 2012, arXiv:1211.3711. [Google Scholar]
Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 369–376. [Google Scholar]
Hinton, G. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]

Figure 1. Zipformer architecture diagram.

Figure 2. RNN-T architecture diagram.

Figure 3. LM and RNN-T fusion architecture diagram.

Figure 4. Flowchart.

Table 1. Error rates with and without LM fusion using different training corpus sizes.

Data Size	LM	CER
10 h	None	51.1
10 h	RNN-T	17.65
100 h	None	3.89
100 h	RNN-T	1.07

Table 2. Impact of different network depths on CER.

Data Size	Enc-Dec	CER
100 h	6-6	3.89
100 h	4-4	4.02

Table 3. Impact of different fusion methods on CER.

Data Size	Fusion Method	CER
10 h	Shallow fusion	37.73
10 h	Deep fusion	17.65
100 h	Shallow fusion	3.35
100 h	Deep fusion	1.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Wu, S.; Wang, Z. End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios. Electronics 2025, 14, 802. https://doi.org/10.3390/electronics14040802

AMA Style

Zhang L, Wu S, Wang Z. End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios. Electronics. 2025; 14(4):802. https://doi.org/10.3390/electronics14040802

Chicago/Turabian Style

Zhang, Lusheng, Shie Wu, and Zhongxun Wang. 2025. "End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios" Electronics 14, no. 4: 802. https://doi.org/10.3390/electronics14040802

APA Style

Zhang, L., Wu, S., & Wang, Z. (2025). End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios. Electronics, 14(4), 802. https://doi.org/10.3390/electronics14040802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

End-to-End Speech Recognition with Deep Fusion: Leveraging External Language Models for Low-Resource Scenarios

Abstract

1. Introduction

2. Basic Theory

2.1. Zipformer

2.2. RNN-T

2.3. LM and Decoding Strategies

3. Training Process

3.1. Dataset and Preprocessing

3.2. Training Configuration

3.3. Deep Fusion

3.4. Loss Function

3.5. Evaluation Metrics

4. Results Comparison and Analysis

4.1. Main Results

4.2. Impact of Data Scale Differences

4.3. Performance Improvement from LM Fusion

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI