Next Article in Journal
An Intelligent Monitoring System for the Driving Environment of Explosives Transport Vehicles Based on Consumer-Grade Cameras
Previous Article in Journal
Enhanced Blockchain-Based Data Poisoning Defense Mechanism
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach

1
School of Information Systems Engineering, Information Engineering University, Zhengzhou 450001, China
2
Laboratory for Advanced Computing and Intelligence Engineering, Wuxi 214000, China
3
Research and Development Department, Zhengzhou Xinda Institute of Advanced Technology, Zhengzhou 450001, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(7), 4073; https://doi.org/10.3390/app15074073
Submission received: 4 March 2025 / Revised: 31 March 2025 / Accepted: 3 April 2025 / Published: 7 April 2025
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Recent advancements in end-to-end (E2E) modeling have notably improved automatic speech recognition (ASR) systems; however, far-field speech recognition (FSR) remains challenging due to signal degradation from factors such as low signal-to-noise ratio, reverberation, and interfering sounds. This requires richer training data and multi-channel speech enhancement. To address this gap, we introduce Mixer, a novel data augmentation technique designed to further enhance the performance of large-scale pre-trained models for FSR. Mixer interpolates and mixes feature representations of speech samples and their corresponding losses, extending the MixSpeech framework to intermediate layers of Whisper. Additionally, we propose Mixer-C, which further leverages multi-channel information by combining speech from different microphone channels using a channel selector. Experimental results demonstrate that Mixer significantly outperforms existing methods, including SpecAugment, achieving a relative word error rate (WER) reduction of 3.6% compared to the baseline. Furthermore, Mixer-C offers an additional WER improvement of 2.2%, showcasing its efficacy in improving FSR accuracy.

1. Introduction

In recent years, end-to-end (E2E) modeling techniques have significantly improved the accuracy of automatic speech recognition (ASR) systems [1,2,3]. However, challenges persist in far-field speech recognition (FSR) due to signal degradation. Firstly, speech signals attenuate as they propagate over distance, leading to a low signal-to-noise ratio (SNR). Secondly, in enclosed environments such as living rooms or conference rooms, the source signal undergoes multiple reflections from walls and indoor objects, resulting in multipath propagation and a temporal ambiguity phenomenon known as reverberation. Thirdly, microphones may capture unwanted interfering sounds which are often diverse and non-stationary, making them difficult to filter out. All these factors negatively impact the performance of FSR.
Given the signal degradation issues in far-field speech recognition, the processing pipeline diverges significantly from ASR. Firstly, the model is trained on data representative of the typical signal degradation encountered in far-field scenarios, necessitating richer training data. Secondly, microphone arrays are predominantly used instead of single microphones to capture sound, which enables multi-channel speech enhancement, proven highly successful in noisy reverberant environments.
Data augmentation (DA) strategies are crucial as they can generate virtually unlimited training samples [4]. Notably, SpecAugment (SA) [5] has demonstrated significant improvements in the accuracy and noise robustness of end-to-end (E2E) ASR systems. However, to our knowledge, relatively few studies have applied data augmentation techniques to multi-channel FSR. Existing attempts primarily include speed/volume perturbation [6], white noise injection [7], and channel dropout [8], indicating further innovation potential in this area. In response to this gap, we propose Mixer, a method that effectively enhances the performance of large-scale pre-trained models for far-field speech recognition. Mixer achieves this by interpolating and mixing feature representations from speech samples and their corresponding losses. According to the MixSpeech paper given in [9], it only mixes the input speech, while Mixer is actually a generalization of MixSpeech. The Mixer method can not only mix the input speech but also mix the intermediate layer features output via the Whisper encoder, thereby offering a more robust framework for FSR.
Furthermore, since far-field speech recognition employs multiple microphone arrays to capture sound, single-channel data may be incomplete due to factors such as speaker position and movement. To enhance multi-channel speech, we further propose Mixer-C, which selects and combines speech from different channels with the same label through a channel selector. This approach strengthens the multi-channel information of FSR, enabling a more comprehensive understanding for speech and thereby enhancing performance. Additionally, by mixing different channels of speech, Mixer-C addresses a significant limitation that relies on the same array configuration during both training and testing, improving the model’s adaptation to various array configurations.
Although large-scale pre-trained models have demonstrated exceptional performance across various tasks, the immense parameter size of these models makes fine-tuning computationally expensive and prone to overfitting. Therefore, we employ the LoRA [10] algorithm to investigate the performance of FSR on Whisper as our baseline. Our main contributions are as follows: (a) We explore efficient fine-tuning methods based on Whisper for far-field speech recognition (FSR). (b) We introduce a data augmentation method, Mixer, specifically designed for FSR, which significantly enhances model performance. (c) We present Mixer-C, an extension of Mixer, to further leverage multi-channel information for improved accuracy. (d) Experimental results demonstrate that Mixer consistently outperforms existing methods such as SpecAugment across all versions of Whisper, achieving a relative word error rate (WER) reduction of 3.6% compared to the baseline. Moreover, Mixer-C further enhances the performance of Mixer, yielding an additional WER improvement of 2.2%.

2. Related Work

2.1. Far-Field ASR

Recently, large-scale pre-trained models have demonstrated notable success in FSR [11]. However, far-field data often suffer from signal degradation issues, prompting the development of specialized data augmentation strategies. Channel dropout [12] was introduced as a method for dropping frequency bands within a single-channel signal, similar to SpecAugment. To enhance multi-channel information, Shimada et al. [13] utilized audio channel dropping to augment data for training a speaker verification system that leverages beamforming as its frontend. Moreover, ChannelAugment [8] was proposed to address the limitation of both training and testing data being derived from a fixed array geometry. The distinguishing aspect of our research is the integration of different channels from multi-channel inputs during end-to-end (E2E) ASR training. To the best of our knowledge, this approach has not been explored in previous studies.

2.2. Mixing Methods

Data augmentation is a crucial component for successful deep learning. It helps mitigate sample size issues and prevents overfitting, effectively reducing the epistemic uncertainty of the model [14]. Mixup [15] is a simple yet powerful data augmentation technique that improves the performance and robustness of many state-of-the-art models. Given its effectiveness, Mixup has been extended into various derivatives. Manifold Mixup [16] combines input features within the network’s hidden layers. Additionally, Cutmix [17] proposes mixing patches to create more visually meaningful augmentations. Patchup [18] extends this idea by mixing patches in the hidden space. Progress has also been made in applying these techniques to natural language processing. In their research, Guo et al. [19] investigated the application of Mixup for sequence classification tasks. Their findings suggest that blending words and sentences can produce advantageous outcomes. Along similar lines, Ssmix [20] was proposed, featuring a method where less critical words are replaced based on saliency, thereby enhancing the overall performance. MixText [21] applied Mixup for semi-supervised text classification, whereas MixSpeech [9] adapted it for the speech domain. Moreover, MixLoss [22] generalized MixSpeech to sequence generation tasks.

3. Methods

3.1. Model Structure

Whisper [2] is a transformer-based encoder–decoder model for handling various speech tasks, including speech recognition, speech translation, language identification, etc. Assuming the input log-Mel spectrogram feature vectors denoted as X = { x 1 , x 2 , , x T } , and a target transcription is represented by a sequence of N tokens denoted as y = { y 1 , y 2 , , y N } . The input X is initially processed by two 1-dimensional convolution layers, and is then encoded to the hidden vector H by the encoder of Whisper. Next, the decoder consumes H, a prompt S, and the previously generated tokens y < j as inputs to predict the next token y j . The whole process can be described as follows:
H = E n c o d e r ( C o n v ( X ) ) y j = D e c o d e r ( y < j , H , S )
where the prompt S used can be denoted as [ | sop | , previous text , | sot | , | language | , | task | , | no times-tamps | ] . The parts encapsulated within | · | are special tokens. To specify the FSR task, we set the | task | to “transcribe”, and | language | is set as “en”. The “previous text” is an optional sequence of text tokens for decoding control. However, directly adapting such large models to low-resource scenarios can be challenging due to overfitting issues and high computational costs. Therefore, parameter-efficient fine-tuning [10,23,24] has been proposed, which involves freezing the pre-trained model and training specific small additive modules, such as Adapter [24], LoRA [10] and reprogramming [25,26,27].
LoRA [10], introduced in NLP, is designed to adapt large language models (LLMs) for specific domains or tasks more efficiently. Researchers have noted that pre-trained LLM weights occupy a low-dimensional intrinsic space. Leveraging this insight, LoRA keeps the original weights unchanged while updating only the low-rank incremental weight matrices. For instance, consider the i-th linear projection in the forward pass, represented as f i ( X ) = W i T X + b i , where W i R d 1 × d 2 and b i R d 2 are the frozen weight and bias. LoRA modifies the forward computation to include updates to only these low-rank matrices Δ W i , which can be expressed as follows:
f i ( X ) = ( W i + Δ W i ) T X + b i
In this equation, Δ W i is calculated as the product of two rank-decomposed matrices, B i and A i , such that Δ W i = B i A i . The matrices B i R d 1 × r and A i R r × d 2 are the two trainable matrices, with rank r min { d 1 , d 2 } . When deployed in production, we can explicitly compute and store the combined weight matrix W as W = W i + B i A i , and proceed with inference as normal. Consequently, LoRA is designed to introduce no additional latency during inference compared to a fully fine-tuned model. Therefore, we employ the LoRA algorithm for Whisper as our baseline, with a detailed illustration shown in Figure 1.

3.2. Mixing Methods

Mixup [15] generates new augmented data through a process of linear interpolation, as described in the following equations:
x ^ = λ x i + ( 1 λ ) x j y ^ = λ y i + ( 1 λ ) y j
where ( x i , y i ) and ( x j , y j ) represent two randomly sampled examples from the training data. The mixed input feature x ^ and the corresponding mixed label y ^ are generated by blending the i-th and j-th examples. The mixing weight λ B e t a ( α , α ) with α > 0 . When α = 1.0 , λ is uniformly sampled from the interval [0,1], which also means λ U ( 0 , 1 ) . By leveraging the prior knowledge that linear interpolations of inputs yield linear interpolations of the corresponding labels, Mixup effectively expands the training distribution. Moreover, some studies [28,29] have theoretically demonstrated that Mixup constitutes a form of data-adaptive regularization, which proves to be highly effective in improving the model’s generalization capabilities.
Manifold Mixup [16] is an extension of Mixup [15], allowing for the linear interpolation of representations from any layer within a neural network. Consider a neural network f k ( g k ) , where g k maps the input data into the hidden representation at k-th layer, and f k subsequently maps the hidden representation to the final output f ( x ) . For a randomly selected k-th layer, we process two distinct mini-batches of random data ( x 1 , y 1 ) and ( x 2 , y 2 ) as usual, until we reach the k-th layer. This provides us with two intermediate mini-batches ( g k ( x 1 ) , y 1 ) and ( g k ( x 2 ) , y 2 ) . Subsequently, we construct a mixed mini-batch as follows:
( g ˜ k , y ˜ ) : = ( M i x λ ( g k ( x 1 ) , g k ( x 2 ) ) , M i x λ ( y 1 , y 2 ) )
where M i x λ ( a , b ) = λ a + ( 1 λ ) b . The mixing coefficient λ B e t a ( α , α ) with α > 0 . This interpolation yields a novel training example based on the hidden representation of the model, effectively serving as a powerful data augmentation technique. Notably, Mixup becomes a special case of Manifold Mixup [16] when k is set to 0.
Mixup is inherently suited for classification tasks but encounters challenges when applied to speech recognition due to several fundamental reasons: firstly, text sequences ( Y i , Y j ) often exhibit varying lengths, making direct mixing impractical; secondly, the discrete nature of text sequences prohibits straightforward addition; thirdly, mixing two text sequences in the embedding space may lead the model to predict a mixture embedding, which could confuse the ASR model and potentially degrade performance, as the ultimate goal is to recognize a single text transcript from the speech input. Therefore, Meng et al. [9] introduced MixSpeech for speech recognition, which blends two input sequences and applies two separate loss functions in the following manner:
X ^ = λ X i + ( 1 λ ) X j L i = L ( X ^ , Y i ) , L j = L ( X ^ , Y j ) L ^ = λ L i + ( 1 λ ) L j
where X i and Y i represent the input speech and corresponding label for the i-th sample, respectively. X ^ denotes the mixed speech obtained by a frame-wise combination of two speech sequences with mixing factor λ . L denotes the ASR loss, and L ^ denotes the mixed loss. Following Mixup, λ B e t a ( α , α ) with α > 0 .

3.3. Mixer of FSR

In far-field speech recognition (FSR) scenarios, speech signals attenuate as they propagate over distance, resulting in a decreased signal-to-noise ratio (SNR). Additionally, these signals undergo reflections from surfaces and diffraction around obstacles, leading to reverberation. Moreover, microphones frequently capture unwanted environmental noise. These effects distort the speech signal in different manners, and thereby require much more training data to achieve high accuracy and generalization.
We propose a data augmentation method for far-field ASR, called Mixer, which mixes any representations of an E2E ASR model. Consider the input log-Mel spectrogram feature vectors denoted as X = { x 1 , x 2 , , x T } , and that a target transcription is represented by a sequence of N tokens denoted as Y = { y 1 , y 2 , , y N } . For an arbitrary K-layers model like Whisper, we denote g k ( · ) as the function that processes data from the input to the k-th layer, and f k ( · ) as the function that maps data from the k-th layer to the output. As is seen in Figure 2, Mixer includes two stages to implement. First, Mixer mixes any two feature representations g k ( X i ) and g k ( X j ) extracted from audio samples:
g k ( X ^ ) = λ g k ( X i ) + ( 1 λ ) g k ( X j )
where g k ( X i ) denotes the output of k-th layer and k is the layer of the model that requires mixing. Second, Mixer trains the model to optimize the mixed loss functions by aligning the mixed weight to the corresponding loss. The process can be denoted as follows:
L mix = λ L ( f k ( g k ( X ^ ) ) , Y i ) + ( 1 λ ) L ( f k ( g k ( X ^ ) ) , Y j )
where λ ϵ B e t a ( α , α ) with α > 0 . And ϵ is the maximum value of the mixing weight. By aligning the interpolation weight in the mixing of features with that of the loss, the method establishes a linear relationship between the input and output space of the model. When k = 0, since the hidden representations are log-Mel spectrograms of the input, Mixer naturally extends the MixSpeech [9] method.

3.4. Mixer-C of FSR

We further propose a simpler mixing method tailored for far-field ASR, called Mixer-C. Mixer-C interpolates speech samples from the different channels with the same labels to augment the channel information. In far-field speech recognition, there exist multi-channel data collected by multiple microphones. The multi-channel information is instrumental in enabling models to capture not only the spectral characteristics of the target and interference signals but also their spatial attributes, as recorded by the different microphones. This added dimension is highly advantageous for improving the effectiveness of far-field automatic speech recognition (ASR). The multi-channel speech dataset can be denoted as D = { D 1 , D 2 , , D C } , where C signifies the numbers of channels. Each speech sample X i c from the c-th channel is associated with the same label Y i as its counterparts in all other channels, denoted as X i c , where c signifies all channels except for c, i.e., c = { 1 , , c 1 , c + 1 , , C } . As is shown in Figure 2, Mixer-C includes two steps: first, the channel selector selects the paired speech data X i m and X i n from different channels with the same label from the audio samples, and then mixes the selected speech data in the k-th layer of the neural network, which can be formulated as
X ^ = λ g k ( X i m ) + ( 1 λ ) g k ( X i n ) , m n
where X i m D m and X i n D n correspond to speech samples from different channels that share the same label Y i . Due to the two samples sharing the same label, Mixer-C does not require mixing the loss. Mixer-C can enhance the model’s understanding of information from different channels. Compared to Mixer, this approach does not require any additional computation of the loss function, thus improving computational efficiency.

4. Experimental Setup

4.1. Datasets

The DiPCo [30] dataset is an open speech dataset, designed to simulate a family dinner party scenario, addressing the challenges of speech separation and recognition in multi-speaker environments. The dataset was recorded by Amazon employee volunteers engaged in natural conversations within everyday household settings. Specifically, it includes distant speech captured by 5 far-field devices, each equipped with a 7-mic circular array, totaling 35 microphones. As the DiPCo dataset is one of the benchmark datasets in the CHiME-challenge series, it i used to evaluate far-field speech recognition performance across multiple scenarios [11,31].
The dataset was initially preprocessed using the Weighted Prediction Error (WPE) [32] method to enhance the signals for each utterance. After WPE processing, the multi-channel audio from each array was transformed into single-channel audio using the array-based BeamformIt [33] algorithm. Consequently, the multi-channel audio from multiple array devices was consolidated into a single multi-channel audio stream, with the number of channels corresponding to the number of far-field devices.
We utilized the development set for training and evaluation, and the test set for testing purposes. Subsequently, we filtered the audio and label lengths, excluding any audio segments shorter than 3 s or longer than 30 s. For the text content, we filtered out transcripts with fewer than 3 words or more than 100 words.

4.2. Experimental Details

We utilized the ESPnet [34] toolkit for data processing, resulting in enhanced channel audio. Our experiments were based on the Whisper model, including several different versions: Whisper-Small (244 M), Whisper-Medium (768 M), Whisper-Large-v2 (1550 M), and Whisper-Large-v3 (1550 M), all with English as the prompt language. Given the immense size of our dataset, fine-tuning every model comprehensively poses significant computational challenges. Hence, we adopted an efficient fine-tuning strategy, using the LoRA method to fine-tune some parameters. For LoRA, we set r = 4, which resulted in 2.21 M training parameters for the large version of model. We fine-tuned the model for 5 epochs using a batch size of 8, utilizing the Adam optimizer with a learning rate of 1 × 10 3 and Word Error Rate (WER) as the evaluation metric. The default α for the Beta distribution was set to 2, and the default mixing maximum value ϵ was set to 1. The proportion of one batch data to be trained on using Mixer is denoted as τ , which was set to 15%. And we set the default mixed layer k to 0.

5. Results

5.1. Main Results

We compare Mixer with four different configurations: (1) direct testing performance of the pre-trained model; (2) models fine-tuned using the DiPCo dataset; (3) models trained using LoRA without Mixer (Baseline); and (4) models trained with SpecAugment, an effective data augmentation technique. Given our limited computational resources, we did not proceed with fine-tuning the entire Whisper-Large-v2 or Whisper-Large-v3 models. The experiment results are shown in Table 1. Firstly, it can be seen that the direct testing performance of Whisper is already commendable. Surprisingly, both the Whisper-Medium and Whisper-Small achieved comparable results to the Whisper-Large-v2. This is likely attributed to the presence of the DiPCo dataset in the training corpus of Whisper. Furthermore, owing to the vast capacity of the large model, Whisper-Large-v2’s performance on the DiPCo dataset actually declined, whereas Whisper-Large-v2 achieved the best results. This underscores that a combination of large model capacity and improved training still yields exceptional outcomes. Secondly, it can be seen that fine-tuning the full model actually reduced Whisper-Small and Whisper-Medium’s performances on the test set. This was primarily due to the substantial distribution discrepancy between the training and test data, causing the model to forget the original DiPCo data while overfitting to the current batch of training data. Thus, the limitations of fine-tuning the entire model become apparent.
Therefore, we employ the LoRA method for training, and for large models, it only requires fine-tuning of 2.21 M parameters. Initially, we observed that despite the Whisper-Small model’s underwhelming performance, it manages to mitigate the overfitting issue to some degree. Compared with other models, none exhibited overt overfitting, suggesting that the pre-trained initialization of Whisper-Small possesses limited capacity. For Whisper-Medium, utilizing efficient fine-tuning strategies effectively curbs the overfitting that can occur with full model fine-tuning, enabling effective learning. Furthermore, the Whisper-Large-v3 model surpasses the Whisper-Medium model, demonstrating its ability to acquire more knowledge and perform better, demonstrating the Whisper-Large-v3 model’s superiority.
Moreover, we compared the performance of the SpecAugment and Mixer data augmentation methods. Since the speech domain is different from images, there are few mixing-based data augmentation methods like CutMix [17] other than MixSpeech. Therefore, we compared our method with SpecAugment, a very classical data augmentation method in speech. We found that SpecAugment only achieved some performance improvements on the medium and Whisper-Large-v2 versions, but had no effect on Whisper-Small and Whisper-Large-v3. This indicates that it is difficult for SpecAugment to be effective when it comes to far-field speech recognition data with high noise and complex channel conditions. However, Mixer achieves performance improvements on all models and can achieve greater performance gains compared to SpecAugment. We analyzed that this is mainly because SpecAugment injects noise into the data through time-domain or frequency-domain masking to enhance the data, while the channel environment for far-field speech recognition is already very poor, and injecting noise will affect the model’s learning of the data. In contrast, Mixer fits the data distribution in a high-dimensional space through input and output interpolation, thereby enhancing the model’s learning.
For the Mixer-C method, it can be observed that it is effective on most versions of the models. It performs well on the Whipser-Large-v3 model, surpassing the performance of Mixer. However, it performs worse than Mixer on other versions of the models. We analyze this is closely related to the fundamental capabilities of the baseline model. The enhancement of multi-channel information is more effective in models with strong baseline capabilities and a larger model capacity; otherwise, it may confuse the model’s learning. However, Mixer is also superior to SpecAugment and does not overly confuse the model’s learning.

5.2. Method Analysis

The Effect of Different Mixing Proportions. We conducted an analysis on the influence of varying mixing proportions τ within Mixer, specifically examining the shifts in model performance across distinct mixing proportion of 15%, 30%, 50%, 80%, and 100%. The results are shown in Figure 3. Our findings indicate that, for the Whisper-Large-v2 model, achieving superior performance necessitates a higher proportion. It can be found that for the Whisper-Large-v2 model, a higher proportion of data needs to be mixed to achieve better performance. Although it has a considerable capacity, the Whisper-Large-v2 model is not well versed in this specific data type, as evidenced by its poor direct testing performance. Therefore, incorporating additional information about the data distribution can further enhance its performance. Conversely, in the case of the Whisper-Medium, its limited capacity coupled with an already strong baseline performance means that excessive enhancement of the data can inadvertently hinder the model’s ability to learn from genuine data to some extent. In contrast, the Whisper-Large-v3 model exhibits minimal sensitivity to variations in mixing ratios, attributed to its combination of high capacity and extensive knowledge of the data, which allows it to consistently deliver superior performance. But overall, it can be observed that the model is not very sensitive to the mixing proportion τ , demonstrating the superiority of Mixer. Since the analysis in Table 1 shows that the Mixer-C is more suitable for the Whisper-Large-v3 model, we only analyzed the impact of different mixing proportions on Whisper-Large-v3. For Mixer-C, it can be found that mixing more data leads to better performance. This is understandable because for models with strong baseline performance, enhancing channel information allows the model to gain a more comprehensive understanding of the speech signal, thereby improving performance.
The Effect of Different Sampling Distributions. Since the λ value is sampled from the Beta distribution, to explore the impact of different sampling distributions on the Mixer and Mixer-C, we analyzed the coefficient α of the Beta distribution. Table 2 demonstrates the performance impact of the Mixer and Mixer-C on the Whisper-Large-v3 model under various α values. For Mixer, it can be observed that an α value of 2 achieves the best performance. Neither excessively large nor excessively small values of α yield satisfactory results. However, for Mixer-C, both an α of 2 and 0.2 yield good results. Overall, the choice of α has a relatively small impact on Mixer-C, indicating that Mixer-C is less sensitive to the hyperparameter α .
The Effect of Maximum Mixing Values. Since we introduced ϵ to adjust the maximum value of λ sampled from the Beta distribution, we conducted an analysis on how the performance of the Mixer/Mixer-C varies under different maximum mixing values, as shown in Figure 4. Each data point in Figure 4 is accompanied by error bars, which are used to illustrate the uncertainty of the experimental results. The error bars come from experiments conducted on the same maximum mixing value at different mixing proportions (15%, 50%, 80%), reflecting the impact of different mixing ratios on the result.
For Mixer, it can be observed that when the maximum mixing value is 0.1, the model’s performance is better. As the maximum mixing value increases, Mixer’s performance begins to decline, reaching its lowest point at 0.5. When the ratio exceeds 0.5, the performance starts to rise, and the WER reaches its minimum value when the maximum mixing value of WER is 1. We have analyzed the reasons behind this phenomenon. Since X ^ = λ X 1 + ( 1 λ ) X 2 , a decrease in λ leads to an increase in 1 λ , and X ^ transitions from resembling x 1 more to resembling x 2 more. Consequently, when the maximum mixing value approaches 0 or 1, the change in X ^ relative to the input is actually smaller. However, when the maximum mixing value is close to 0.5, the change in X ^ is significant, which may introduce excessive fluctuations and affect the performance. Finally, we can observe similar experimental conclusions for Mixer-C, but it can be seen that the performance of Mixer-C is overall better than that of Mixer. And the performance is superior to the maximum mixing weight until 1. This indicates that enhancing channel information is very important for Whisper-Large-v3.
The Effect of Mixing Encoder Layer k . For large-scale pre-trained models, the entire encoder is devoted to feature extraction. Therefore, we experiment with mixing the features from different layers k of the encoder on Mixer and Mixer-C, and the final experimental results are shown in Figure 5 and Figure 6. When k is 0, it indicates that the input spectrum is being mixed. It can be observed that the results of mixing intermediate layers are not as good as those of mixing at the input, which may be due to the fact that the feature representations in intermediate layers are focused on different aspects, which confuses the model’s learning. However, mixing deeper layer features can also achieve good results, because the later layers, compared to intermediate layers, can encode more acoustic information. Moreover, the deeper features are also closer to the model’s decoding output, which can strengthen the connection between the linear enhancement relationships. Compared to Mixer, Mixer-C shows a significant fluctuation in performance across different layers, indicating that the mixing of channel information is more influenced by the form of feature representation. If the feature is not well suited for representing channel information, it may greatly affect the performance of Mixer-C.
Moreover, we also extend our methods to Manifold Mixer/Mixer-C as we did for Manifold Mixup [16], which is denoted as the dashline. It is clear that although Manifold Mixer/Mixer-C is effective, it is relatively limited and not as effective as directly mixing at the input layer. Compared to Mixer, Mixer-C generalizes better for manifold learning.

6. Conclusions

In this paper, we first propose a data augmentation algorithm called Mixer that mixes the feature representations and corresponding loss of Whisper. Mixer enhances the prior information deficiency caused by signal degradation in far-field speech recognition. Secondly, we further introduce Mixer-C, which mixes the feature representations from different channels with the same labels, enhancing the multi-channel information of far-field data. Extensive experiments demonstrate that both methods can effectively improve the performance of far-field speech recognition, and we analyze the characteristics and applicable scenarios of the two methods. The proposed methods are not only applicable to the Whisper model but can also be extended to other large-scale pre-trained models. This provides researchers with a general framework for optimizing and improving existing speech recognition systems, enabling them to perform better in various complex environments. The proposed method has broad application prospects in fields such as smart homes and in-vehicle voice systems. In the future, we will study more effective multi-channel information enhancement techniques in front-end processing.

Author Contributions

T.N. conceived the experiment(s); T.N. and Y.C. conducted the experiment(s); T.N. and H.H. wrote the paper; D.Q. analyzed the results. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62171470), Natural Science Foundation of Henan Province (No. 232300421240), and Henan Zhongyuan Science and Technology Innovation Leading Talent Project (No. 234200510019).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed in the current study are available from https://s3.amazonaws.com/dipco/DiPCo.tgz.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Baevski, A.; Zhou, Y.; Mohamed, A.; Auli, M. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual, 6–12 December 2020. [Google Scholar]
  2. Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the International Conference on Machine Learning, ICML 2023, Honolulu, HI, USA, 23–29 July 2023; Proceedings of Machine Learning Research. Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J., Eds.; PMLR: New York, NY, USA, 2023; Volume 202, pp. 28492–28518. [Google Scholar]
  3. Pratap, V.; Tjandra, A.; Shi, B.; Tomasello, P.; Babu, A.; Kundu, S.; Elkahky, A.M.; Ni, Z.; Vyas, A.; Fazel-Zarandi, M.; et al. Scaling Speech Technology to 1000+ Languages. arXiv 2023, arXiv:2305.13516. [Google Scholar]
  4. Saon, G.; Tüske, Z.; Audhkhasi, K.; Kingsbury, B. Sequence Noise Injected Training for End-to-end Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, UK, 12–17 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6261–6265. [Google Scholar] [CrossRef]
  5. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, 15–19 September 2019; Kubin, G., Kacic, Z., Eds.; ISCA: Grenoble, France, 2019; pp. 2613–2617. [Google Scholar] [CrossRef]
  6. Schrank, T.; Pfeifenberger, L.; Zöhrer, M.; Stahl, J.; Mowlaee, P.; Pernkopf, F. Deep beamforming and data augmentation for robust speech recognition: Results of the 4th CHiME challenge. Proc. CHiME 2016, 18–20. [Google Scholar] [CrossRef]
  7. Yalta, N.; Watanabe, S.; Hori, T.; Nakadai, K.; Ogata, T. CNN-based multichannel end-to-end speech recognition for everyday home environments. In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, 2–6 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
  8. Gaudesi, M.; Weninger, F.; Sharma, D.; Zhan, P. ChannelAugment: Improving Generalization of Multi-Channel ASR by Training with Input Channel Randomization. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, 13–17 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 824–829. [Google Scholar] [CrossRef]
  9. Meng, L.; Xu, J.; Tan, X.; Wang, J.; Qin, T.; Xu, B. MixSpeech: Data augmentation for low-resource automatic speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021. [Google Scholar]
  10. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
  11. Karafiát, M.; Veselỳ, K.; Szöke, I.; Mošner, L.; Beneš, K.; Witkowski, M.; Barchi, G.; Pepino, L. BUT CHiME-7 system description. arXiv 2023, arXiv:2310.11921. [Google Scholar]
  12. Kovács, G.; Tóth, L.; Compernolle, D.V.; Liwicki, M. Examining the Combination of Multi-Band Processing and Channel Dropout for Robust Speech Recognition. In Proceedings of the 20th Annual Conference of the International Speech Communication Association, Interspeech 2019, Graz, Austria, 15–19 September 2019; Kubin, G., Kacic, Z., Eds.; ISCA: Grenoble, France, 2019; pp. 421–425. [Google Scholar] [CrossRef]
  13. Shimada, K.; Takahashi, N.; Takahashi, S.; Mitsufuji, Y. Sound event localization and detection using activity-coupled cartesian DOA vector and RD3Net. arXiv 2020, arXiv:2006.12014. [Google Scholar]
  14. Kendall, A.; Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? Adv. Neural Inf. Process. Syst. 2017, 5580–5590. [Google Scholar]
  15. Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization, 2018. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  16. Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas, I.; Lopez-Paz, D.; Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; PMLR: New York, NY, USA, 2019; pp. 6438–6447. [Google Scholar]
  17. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  18. Faramarzi, M.; Amini, M.; Badrinaaraayanan, A.; Verma, V.; Chandar, S. PatchUp: A Feature-Space Block-Level Regularization Technique for Convolutional Neural Networks. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; AAAI Press: Washington, DC, USA, 2022; pp. 589–597. [Google Scholar] [CrossRef]
  19. Guo, H.; Mao, Y.; Zhang, R. Augmenting Data with Mixup for Sentence Classification: An Empirical Study. arXiv 2019, arXiv:1905.08941. [Google Scholar]
  20. Yoon, S.; Kim, G.; Park, K. SSMix: Saliency-Based Span Mixup for Text Classification. In Proceedings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, 1–6 August 2021. [Google Scholar]
  21. Chen, J.; Yang, Z.; Yang, D. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020. [Google Scholar]
  22. Chen, Y.; Jiao, X.; Zhang, H.; Yang, X.; Qu, D. MixLoss: A Data Augmentation Method for Sequence Generation Tasks. In Proceedings of the 2023 7th Asian Conference on Artificial Intelligence Technology (ACAIT), Jiaxing, China, 10–12 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 811–816. [Google Scholar]
  23. Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; de Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-Efficient Transfer Learning for NLP. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, CA, USA, 9–15 June 2019; Proceedings of Machine Learning Research. Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: New York, NY, USA, 2019; Volume 97, pp. 2790–2799. [Google Scholar]
  24. Thomas, B.; Kessler, S.; Karout, S. Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7102–7106. [Google Scholar]
  25. Yang, C.H.; Li, B.; Zhang, Y.; Chen, N.; Prabhavalkar, R.; Sainath, T.N.; Strohman, T. From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, 4–10 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar] [CrossRef]
  26. Gao, H.; Ni, J.; Qian, K.; Zhang, Y.; Chang, S.; Hasegawa-Johnson, M. WavPrompt: Towards Few-Shot Spoken Language Understanding with Frozen Language Models. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Republic of Korea, 18–22 September 2022; Ko, H., Hansen, J.H.L., Eds.; ISCA: Grenoble, France, 2022; pp. 2738–2742. [Google Scholar] [CrossRef]
  27. Yang, C.H.; Tsai, Y.; Chen, P. Voice2Series: Reprogramming Acoustic Models for Time Series Classification. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 18–24 July 2021; Proceedings of Machine Learning Research. Meila, M., Zhang, T., Eds.; PMLR: New York, NY, USA, 2021; Volume 139, pp. 11808–11819. [Google Scholar]
  28. Carratino, L.; Cissé, M.; Jenatton, R.; Vert, J. On Mixup Regularization. arXiv 2020, arXiv:2006.06049. [Google Scholar]
  29. Zhang, L.; Deng, Z.; Kawaguchi, K.; Ghorbani, A.; Zou, J. How Does Mixup Help with Robustness and Generalization? In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021.
  30. Van Segbroeck, M.; Zaid, A.; Kutsenko, K.; Huerta, C.; Nguyen, T.; Luo, X.; Hoffmeister, B.; Trmal, J.; Omologo, M.; Maas, R. DiPCo–Dinner Party Corpus. arXiv 2019, arXiv:1909.13447. [Google Scholar]
  31. Mu, B.; Guo, P.; Guo, D.; Zhou, P.; Chen, W.; Xie, L. Automatic channel selection and spatial feature integration for multi-channel speech recognition across various array topologies. arXiv 2023, arXiv:2312.09746. [Google Scholar]
  32. Drude, L.; Heymann, J.; Boeddeker, C.; Haeb-Umbach, R. NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing. In Proceedings of the Speech Communication; 13th ITG-Symposium, Oldenburg, Germany, 10–12 October 2018; VDE: Offenbach, Germany, 2018; pp. 1–5. [Google Scholar]
  33. Anguera, X.; Wooters, C.; Hernando, J. Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2011–2022. [Google Scholar] [CrossRef]
  34. Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 552–568. [Google Scholar]
Figure 1. An illustration of Mixer/Mixer-C and its integration with the Whisper model. We freeze the Whisper model and employ the LoRA algorithm to adapt the model to the far-field speech recognition task.
Figure 1. An illustration of Mixer/Mixer-C and its integration with the Whisper model. We freeze the Whisper model and employ the LoRA algorithm to adapt the model to the far-field speech recognition task.
Applsci 15 04073 g001
Figure 2. The pipeline of Mixer/Mixer-C. In Mixer, there are two stages for mixing: in the first stage, g k ( X ^ ) is obtained by mixing any two feature representations g k ( X i ) and g k ( X j ) extracted from audio samples; in the second stage, the loss function is formulated as a weighted combination of two losses that were computed by the corresponding label of mixed audio samples. In Mixer-C, g k ( X ^ ) is obtained by mixing two feature representations of audio samples belonging to different channels but sharing the same label.
Figure 2. The pipeline of Mixer/Mixer-C. In Mixer, there are two stages for mixing: in the first stage, g k ( X ^ ) is obtained by mixing any two feature representations g k ( X i ) and g k ( X j ) extracted from audio samples; in the second stage, the loss function is formulated as a weighted combination of two losses that were computed by the corresponding label of mixed audio samples. In Mixer-C, g k ( X ^ ) is obtained by mixing two feature representations of audio samples belonging to different channels but sharing the same label.
Applsci 15 04073 g002
Figure 3. WER (%) of three different model versions along varing proportion τ .
Figure 3. WER (%) of three different model versions along varing proportion τ .
Applsci 15 04073 g003
Figure 4. WER (%) of three different model versions along varying maximum mixing value.
Figure 4. WER (%) of three different model versions along varying maximum mixing value.
Applsci 15 04073 g004
Figure 5. WER (%) of Mixer under varying encoder layers.
Figure 5. WER (%) of Mixer under varying encoder layers.
Applsci 15 04073 g005
Figure 6. WER (%) of Mixer-C under varying encoder layers.
Figure 6. WER (%) of Mixer-C under varying encoder layers.
Applsci 15 04073 g006
Table 1. WER (%) of four different Whisper versions under different training settings. “Params” denotes the trainable parameters of LoRA.
Table 1. WER (%) of four different Whisper versions under different training settings. “Params” denotes the trainable parameters of LoRA.
Whisper-Whisper-Whisper-Whisper-
Small Medium Large-v2 Large-v3 Avg.
Params0.4977 M1.1327 M2.2118 M2.2118 M/
Testing46.9842.7646.1142.0444.47
Finetune59.0749.36///
LoRA52.6539.7042.2437.4143.00
+SpecAugment [5]62.9439.1440.5641.8446.12
+Mixer49.1538.4638.5036.0740.55
+Mixer-C51.2939.8240.2935.2741.66
Table 2. WER (%) of Mixer/Mixer-C under different hyperparameter α .
Table 2. WER (%) of Mixer/Mixer-C under different hyperparameter α .
α 0.20.51258
Mixer39.5037.4537.9336.0737.6138.94
Mixer-C35.7236.7736.2035.2737.4336.82
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niu, T.; Chen, Y.; Qu, D.; Hu, H. Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach. Appl. Sci. 2025, 15, 4073. https://doi.org/10.3390/app15074073

AMA Style

Niu T, Chen Y, Qu D, Hu H. Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach. Applied Sciences. 2025; 15(7):4073. https://doi.org/10.3390/app15074073

Chicago/Turabian Style

Niu, Tong, Yaqi Chen, Dan Qu, and Hengbo Hu. 2025. "Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach" Applied Sciences 15, no. 7: 4073. https://doi.org/10.3390/app15074073

APA Style

Niu, T., Chen, Y., Qu, D., & Hu, H. (2025). Enhancing Far-Field Speech Recognition with Mixer: A Novel Data Augmentation Approach. Applied Sciences, 15(7), 4073. https://doi.org/10.3390/app15074073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop