1. Introduction
In recent years, research in artificial intelligence has gradually shifted its focus from large-scale models to small lightweight models (SLMs), aiming to reduce the dependency on GPU computing resources. Due to their reduced parameter counts, compact model sizes, and low computational overhead, SLMs offer viable solutions for deployment in resource-constrained environments while maintaining competitive performance. In the field of audio denoising, various lightweight model architectures have been proposed [
1,
2,
3], aiming to lower the number of parameters and computational complexity. However, these methods exhibit certain limitations in denoising performance; they nevertheless provide valuable reference for the design of efficient speech enhancement systems.
With continuous advancements in model design, several high-performance lightweight models have emerged. For instance, GTCRN [
4] simplifies the DPCRN module and integrates complex mask estimation, significantly reducing parameter count and complexity while optimizing the Equivalent Rectangular Bandwidth (ERB) module to achieve a computational cost of only 33 MMACs. FSPEN [
5] employs a combination of full-band and sub-band joint encoding, a dual-path enhancement module (DPE) with path extension, and a multi-scale decoder to achieve ultra-lightweight and efficient real-time speech enhancement. Similarly, Lightweight SE Network (LiSenNet) [
6] utilizes lightweight sub-band processing and a dual-path structure, leveraging band decomposition and time–frequency modeling to realize efficient and real-time speech enhancement, achieving both significant parameter reduction and competitive performance. However, most existing lightweight models rely on multi-frame inputs, resulting in higher inference latency, which limits their applicability in real-time hearing aid scenarios where power consumption is constrained and latency must be minimized. In contrast, single-frame inference provides extremely low computational burden but suffers from performance degradation due to the lack of contextual information. To address this limitation, we propose a novel low-complexity speech enhancement model based on single-frame feature concatenation and a lightweight convolution-attention mechanism (SDCA-BA). The proposed model incorporates an efficient single-frame feature extraction module and the SDCA-BA module to effectively fuse multi-scale information, achieving high-quality speech enhancement while significantly reducing computational resource consumption. The result is a low-latency, low-power, and high-performance audio denoising solution.
To validate the practicality of the proposed model, we integrated it into a prototype hearing aid device and conducted real-world evaluations. Experimental results show that our model has only 112.3 K parameters and achieves excellent real-time enhancement capabilities with a complexity of 0.22 MMACs. Compared to GTCRN and LiSenNet (33 and 56 MMACs with PESQ scores of 2.87 and 2.95, respectively), and the FSPEN model (approx. 89 MMACs with a PESQ of 2.97), the proposed model achieves superior performance with lower complexity when evaluated on the VoiceBank-DEMAND test set. Our model achieves higher speech quality under a lower computational budget. The proposed system reaches a Perceptual Evaluation of Speech Quality (PESQ) of 2.955 before post-processing, outperforming GTCRN and LiSenNet and approaching FSPEN. After post-processing, the PESQ is further improved to 3.1, outperforming all baseline models with an approximate performance gain of 4.9%.
The remainder of this paper is organized as follows:
Section 2 reviews related work;
Section 3 details the proposed single-frame feature extraction method and its integration with the core modules;
Section 4 describes the experimental setup and presents a comparative analysis against baseline models;
Section 5 discusses the current state and challenges of deploying lightweight models in hearing aid applications; and
Section 6 concludes the paper with future research directions.
2. Related Work
This section first provides a brief overview of the current status of digital signal processors (DSPs) used in hearing aids, followed by a systematic analysis of representative lightweight speech denoising models. The focus is placed on their design principles and optimization strategies. By comparing the similarities, differences, and advantages of various models in terms of structural compression, feature extraction, and multi-scale fusion techniques, this work aims to offer practical references and insights for the design of future lightweight speech denoising models.
2.1. Hearing Aid Signal Processor
With the increasing trend of deploying speech enhancement models on embedded platforms, various digital signal processor (DSP) chips [
7,
8] have been widely adopted in real-time speech processing devices such as hearing aids. Medium-power chips, including Cadence HiFi 3z/4 [
9,
10], XMOS-200 [
11], and ARM Cortex-M55 [
12], typically offer computational capabilities ranging from 100 to 500 MMACs, enabling support for moderately sized deep learning models. However, the relatively high power consumption of these chips limits device battery life, constraining prolonged wearability. To improve energy efficiency, a number of low-power chips have emerged in recent years, such as the TMS320C5505 DSP [
13] and SmartHeaP [
14], which operate at power levels below 10 mW while delivering computational power up to 100 MMACs. These chips are specifically designed for edge neural network inference and have demonstrated promising applications in tasks including voice wake-up detection and noise suppression for hearing aids.
Table 1 shows the MMACs, power consumption, and estimated cost of the DSP chips.
2.2. Ultra-Lightweight Speech Enhancement Models
Conventional deep neural network (DNN)-based speech enhancement models, such as DPATD [
15], MetricGAN [
16], and CMGAN [
17], generally exhibit high computational complexity and large parameter sizes, which hinder their deployment on low-power embedded chips. This limitation is especially critical in real-time applications where power consumption and latency are key constraints. To address these challenges, numerous lightweight speech enhancement models have been proposed, employing techniques including network pruning, parameter quantization, model compression, and multi-scale feature extraction. These approaches aim to significantly reduce computational demands while maintaining satisfactory speech quality. Consequently, lightweight models have become a major focus in the field of low-power speech enhancement research.
Early representative lightweight models such as RNNoise [
18] feature low parameter counts and computational loads but achieve a modest PESQ score of only 2.5 on the VoiceBank-DEMAND dataset, indicating limited noise reduction performance. The DCCRN model [
19], an improvement based on CRN, leverages complex-domain processing to enhance recurrent neural network (RNN) and phase modeling capabilities, improving the PESQ to 2.93. However, this improvement comes at the cost of increased parameter size (3.7 M) and computational complexity (150 MMACs). The DeepFilterGAN model [
20] introduces a full-band real-time speech enhancement system with a GAN-based stochastic regeneration framework, employing Equivalent Rectangular Bandwidth (ERB) domain features to enhance speech envelopes and reduce parameter count. Although the total parameters remain relatively large at 3.58 M, the ERB-based [
21] approach provides a valuable strategy for parameter reduction. With ongoing advancements in model architecture and lightweight techniques, recent speech enhancement models have progressively reduced both parameter size and computational burden. For instance, the GTCRN model [
4] utilizes a grouped time convolution recurrent network architecture, incorporating ERB-based sub-band feature extraction and temporal recursive attention modules. It achieves a PESQ score of 2.87 on the VoiceBank-DEMAND dataset with only 23.7 K parameters and a computational cost of 33 MMACs. Samsung’s FPSEN model [
5] adopts a dual-channel structure combining full-band and sub-band features, enhanced by a Dual-Path Enhancer with Path Extension (DPE) to further optimize the ERB-based approach. This model attains a PESQ of 2.97 with 87 K parameters and 89 MMACs of computation, though its limited phase estimation capacity results in suboptimal performance in multi-speaker scenarios. LiSenNET [
6] proposes a lightweight CRN architecture by retaining only the convolutional encoder and recurrent units, significantly reducing convolutional layers and RNN scale to improve computational efficiency. The model contains merely 56 K parameters and achieves a PESQ of 2.95. Although its performance is slightly lower than larger models, LiSenNET offers a feasible direction for subsequent lightweight model design.
2.3. Multi-Frame Feature Modeling and Single Frame
Single-scale STFT features are insufficient to fully capture the detailed information of both high- and low-frequency components in audio signals. Reference [
22] proposed a fusion approach that aggregates features from multiple scale branches using skip connections and feature concatenation mechanisms, effectively enhancing the integration of multimodal information and significantly improving audio denoising performance. Although multi-scale STFT features help improve the model’s ability to represent spectral details, as mentioned in
Section 2.2, most existing models are trained using multi-frame inputs. Multi-frame inputs, however, lead to increased computational demands during inference, limiting the applicability of these models in low-latency or resource-constrained scenarios. To reduce computational complexity, the FPSEN model [
5] attempted a pseudo-single-frame strategy to alleviate computational load. While this method partially reduces the computational burden, the lack of contextual information hinders the model’s ability to accurately capture noise characteristics in complex acoustic environments, thereby limiting denoising performance. Although such single-frame models have inherent performance limitations, they provide valuable insights for research on ultra-low-complexity denoising models. Building upon this, the present work proposes a character-level concatenation method for single-frame feature computation, effectively compensating for the absence of contextual information, enhancing the detailed representation of single-frame features, and further improving model performance.
2.4. Lightweight and Efficient Attention Mechanisms
The attention mechanism dynamically allocates channel-wise weights to achieve adaptive adjustment across channels, thereby enhancing the model’s responsiveness to critical feature channels. The introduction of this mechanism significantly improves the discriminative power of feature representations and promotes overall model performance optimization. Reference [
23] improved the conventional Squeeze-and-Excitation (SE) mechanism by integrating multi-scale convolutions for weighted channel fusion and introducing an adaptive channel attention mechanism to enhance focus on key features while effectively suppressing irrelevant information. This approach offers a novel perspective for audio denoising tasks. Reference [
24] proposed an extension of the SE module into a three-dimensional framework (Time–Frequency-Channel SE), expanding the traditional two-dimensional SE module—primarily operating along the channel dimension—to simultaneously address time, frequency, and channel dimensions. This enables finer-grained capture of spatiotemporal characteristics of audio events, significantly improving detection accuracy. Moreover, this mechanism dynamically adjusts weights across different positions, strengthening the representation of salient signals while reducing the impact of background noise. References [
25,
26,
27] present various enhancements to the SE mechanism, achieving notable performance gains. Building upon these advances, the present work designs a lightweight SE mechanism that retains essential core components and incorporates reward and penalty strategies during training, thereby effectively improving training stability and model robustness.
3. Methodology of TSDCA-BA Model
In this section, we provide a detailed explanation of the proposed model.
Figure 1 shows the overall architecture of the proposed TSDCA-BA model, which aims to reconstruct clean speech Y from a noisy input signal X. The following subsections describe the structure and functionality of the model in detail.
3.1. Structural Design of the TSDCA-BA Model
3.1.1. Single-Frame Statistical Feature Extraction Module
Figure 2 shows the overall architecture of the proposed Extract stats module. To enrich the frame-level representation, this module computes statistical features for each time step individually by calculating the mean, standard deviation, maximum, and minimum across the feature dimension of each frame. Formally, given the input feature tensor
, where
B is the batch size,
T the number of time steps, and
F the number of features per frame, the statistics for the
b-th sample at the
t-th time step are computed as described in Equation (
1):
Then, these four statistics are concatenated along the feature dimension, resulting in a statistical feature tensor with shape . By concatenating the statistical features with the original features, the module effectively enriches the representation of each frame, enhancing the model’s ability to perceive the distribution of features within each frame. This facilitates the subsequent network in more accurately capturing and understanding the variations and differences of frame-wise features.
3.1.2. SDCA-BA Module
Figure 3 shows the proposed SDCA-BA (Sub-band-aware Dual-Channel Attention with Band-wise Attention) Module consists of two primary components: a sub-band-aware dual-channel attention (SDCA) block and a subsequent Band-wise Attention (BA) Module.
The SDCA module is based on the DSDDB module proposed in the literature [
28,
29] and has been further optimized for lightweight design. The input features first pass through two consecutive one-dimensional convolutional layers with kernel size 3, each followed by batch normalization and a nonlinear activation function. This process effectively captures local temporal information and projects the input into a higher-level feature space. To enhance feature discriminability, a channel attention mechanism is introduced using a Squeeze-and-Excitation (SE) module. Specifically, global average pooling is performed along the temporal dimension to aggregate global contextual information for each channel. The aggregated features are then passed through two fully connected layers with a bottleneck structure and a sigmoid activation function to generate channel-wise attention weights. Finally, these weights are used to adaptively recalibrate the feature channels, thereby enhancing the representation of key informative channels while suppressing irrelevant ones.
Following the SDCA block, a Band-wise Attention Module is introduced to model frequency-band dependencies. Motivated by the observation that different frequency bands in speech signals contain varying levels of semantic and perceptual importance—e.g., low-frequency bands typically carry energy and voicing cues, whereas high-frequency bands often encode fine details—the BA aims to assign dynamic importance to each sub-band. To achieve this, the channel dimension is partitioned into non-overlapping sub-bands, and the mean activation within each sub-band is computed across time. These values are computed using a sigmoid activation function. The attention weights are then broadcast and applied to the corresponding sub-band features through element-wise multiplication. This process enables the model to suppress frequency bands heavily contaminated by noise or interference and to enhance those that contribute more significantly to speech intelligibility, especially in adverse conditions such as strong background noise or multi-speaker scenarios.
By combining temporal convolution, channel-wise recalibration, and frequency-aware modulation, the SDCA-BA module improves the network’s capability to focus on both informative time–frequency regions and robust feature representation, leading to enhanced speech enhancement performance.
3.1.3. Three-Layer BiGRU Convolutional Structure
Figure 4 shows the basic structure of the model, which is composed of bidirectional GRU layers and one-dimensional convolutional layers, with Dropout layers placed between them to reduce overfitting. The input first passes through a bidirectional GRU layer with L2 regularization, maintaining the full output sequence. Then, a Dropout layer randomly deactivates some neurons to improve the model’s ability to generalize. Afterwards, the features are processed by a one-dimensional convolutional layer with L2 regularization to capture local patterns. The convolution output is normalized using batch normalization to stabilize training, followed by a ReLU activation to increase nonlinearity and output stability. As shown in
Figure 1, the architecture combines three bidirectional GRU layers and convolutional layers. Two bidirectional GRU layers with L2 regularization are applied first, each returning the full sequence and followed by Dropout layers to help prevent overfitting. This arrangement is designed to capture temporal dependencies and enhance generalization. Next, the input flows through two one-dimensional convolutional layers with kernel size 3 and L2 regularization. Each convolution is followed by batch normalization and ReLU activation to improve training stability and nonlinear modeling. After the second convolutional layer, a Squeeze-and-Excitation module is added to adaptively adjust channel weights and emphasize important features. A third bidirectional GRU layer with Dropout is then introduced to further refine temporal representations. Finally, a convolutional layer with kernel size 1 performs feature fusion, together with batch normalization and ReLU activation to promote convergence and stable outputs. Overall, this design effectively integrates temporal sequence modeling with local feature extraction, improving the model’s ability to handle complex sequential data.
3.1.4. Squeeze-and-Excitation Networks
The model incorporates a Squeeze-and-Excitation (SE) module to enable dynamic adjustment of feature channels.
Figure 5 shows the structure of the SE module. Specifically, the input features undergo global average pooling along the temporal dimension to obtain a global descriptor for each channel. This is followed by a bottleneck fully connected layer that reduces the dimensionality, employing a ReLU activation function to enhance nonlinear representation. Subsequently, another fully connected layer restores the channel dimension to its original size, with a Sigmoid activation function constraining the output values between 0 and 1, thereby generating channel-wise attention weights. Finally, these weights are multiplied channel-wise with the original features to achieve adaptive recalibration of each channel. This mechanism effectively emphasizes important channels while suppressing irrelevant or redundant information, thus improving the model’s representational capacity and overall performance.
3.2. Perceptual Metric-Based Model Evaluation and Selection-PESQ
In the process of selecting an appropriate loss function, it was found that although using logarithmic mean squared error (log-MSE) as the training objective yielded continuously reduced loss values, this metric does not fully correspond to the subjective perception of speech quality by human listeners. Notably, log-MSE struggles to capture improvements in speech enhancement that are perceptible at the auditory level. Although the PESQ metric is widely recognized as an objective measure of speech quality, its non-differentiable characteristic and high computational cost render it unsuitable for direct use as a training loss. To overcome this limitation, following approaches similar to those treating PESQ as a performance evaluation tool [
30], we integrated the PESQ calculation into a callback function. This function evaluates the model on the validation set at the end of each epoch, computing the average PESQ score as an important indicator of the model’s perceptual quality. Model checkpointing was then guided by selecting weights that correspond to the peak PESQ score observed on the validation set. Comparative experiments indicate that this PESQ-based model selection strategy yields better speech enhancement results compared to the conventional early stopping approach.
3.3. Lightweight Inference Procedure
Speech signals generally exhibit stronger energy in the low-frequency band and weaker energy in the high-frequency band. To balance the feature distribution and enhance the high-frequency components, a pre-emphasis process was applied to the input audio during both training and inference, as illustrated in
Figure 1. Specifically, a high-pass filter was used to boost the high-frequency parts. The discrete-time form of the first-order high-pass pre-emphasis filter is given by Equation (
2), where
denotes the original speech signal,
represents the pre-emphasized signal, and
is the pre-emphasis coefficient, typically ranging from 0.95 to 0.97.
To simultaneously capture the global and local spectral features of speech, multi-scale short-time Fourier transform (STFT) was employed for feature extraction. Specifically, two configurations with frame lengths of 512 and 256 were used to extract spectral features at different temporal resolutions. Each scale applied the Hann window function to facilitate the model in effectively acquiring both coarse-grained and fine-grained spectral information.
In the inference stage, the log-magnitude spectrogram
is first normalized by its mean
and standard deviation
, as shown in Equation (
3), where
F is the number of frequency bins and
T is the number of time frames:
Here,
is the normalized log-magnitude spectrogram;
and
are the mean and standard deviation of
.
The normalized feature
is fed into the pretrained model
to produce a time–frequency mask
for noise suppression, as described in Equation (
4):
where
is the predicted mask applied element-wise to attenuate noise.
The mask
is applied to the linear-scale magnitude spectrogram
, converting back from log scale as in Equation (
5):
Here,
is the estimated clean magnitude spectrogram and ⊙ denotes element-wise multiplication.
Using the phase
extracted from the main scale STFT, the complex spectrogram
is reconstructed by Equation (
6):
where
j is the imaginary unit and
the phase of the noisy input.
Finally, the enhanced time-domain signal
is recovered by ISTFT and amplitude normalization, shown in Equation (
7):
where
is a small constant to avoid division by zero, and
n indexes the time-domain samples.
3.4. Deployment and Optimization of a Simplified Hearing Aid Device
The fundamental working principle of low-power hearing aids involves capturing speech signals through a microphone, amplifying them via a power amplifier, and delivering the enhanced sound to the earphone. We reproduced a simple circuit design reported on an electronics enthusiast forum [
31], but the practical listening experience was unsatisfactory, characterized by hollow sound and poor spatial localization. These issues mainly stemmed from the limited performance of the TDA2320A power amplifier and suboptimal capacitor values. To address these shortcomings, as illustrated in
Figure 6, the 10 µF capacitor was replaced with a 100 µF component to improve interference rejection, the amplifier chip was upgraded to an AD8656 to enhance the signal-to-noise ratio and speech clarity, and a low-noise MIC5365 low-dropout (LDO) regulator was incorporated at the earphone stage to isolate power supply noise. These improvements reduced distortion and hollow sounds, greatly enhancing the listening experience and providing a solid basis for future real-world deployment.
4. Experiments and Performance Evaluation
In this section, we present the overall experimental procedure and results of the proposed model, including performance comparisons with mainstream lightweight baseline models, analysis of experimental outcomes, and evaluation of deployment on real devices.
4.1. Dataset
In this experiment, the VoiceBank+DEMAND dataset is employed for both training and evaluation of the proposed model. Constructed by Valentini et al., this dataset is widely used as a standard benchmark for speech enhancement, and most lightweight models reported in recent studies are also evaluated on this corpus. Therefore, using this dataset ensures a fair and consistent performance comparison. Furthermore, to simulate real-world noise conditions in hearing aids, we introduce custom noisy samples by mixing the clean VoiceBank utterances with actual background noise recorded from hearing aid environments. This extended dataset allows for a practical assessment of the model’s noise suppression capability in deployment scenarios.
4.2. Implementation Details
Due to the large scale of the training dataset and the limited memory capacity of the experimental hardware, a custom data generator was developed to preprocess data in batches and feed it incrementally into the model. This approach ensures a stable and efficient training process while preventing system memory overflow. The multi-scale STFT magnitude processing has been described in detail in
Section 3.3. Subsequently, multi-scale short-time Fourier transform features capturing both high- and low-frequency components are extracted and concatenated to form an enhanced magnitude spectrum, which serves as the model input. This input is then fed into the proposed TSDCA-BA model for training, with the log-MSE loss function applied. Additionally, the PESQ perceptual callback mechanism described in
Section 3.4 is employed to compute the average PESQ score on the validation set after each epoch. The final model is selected based on the highest PESQ score, thereby improving the subjective perceptual quality of the speech enhancement.
To capture both global and local spectral patterns, a multi-scale short-time Fourier transform (STFT) is applied. Specifically, two STFT settings with frame lengths of 512 and 256 samples are employed to extract spectral features at different temporal resolutions. Each scale uses a Hann window to perform the transformation, enabling the model to effectively capture both coarse and fine-grained spectral information. Subsequently, multi-scale short-time Fourier transform (STFT) is employed to extract both high-frequency and low-frequency components, which are then concatenated to form the enhanced spectral magnitude representation as the model input. This input is fed into the proposed TSDCA-BA model for training, using log-MSE as the objective function. Additionally, following the PESQ-aware callback strategy introduced in
Section 3.2, we evaluate the model’s perceptual quality on the validation set after each epoch by computing the average PESQ score. The final model is selected based on the highest PESQ performance, ensuring optimal perceptual enhancement.
During inference, the noisy speech signal is first subjected to a pre-emphasis operation to enhance high-frequency components, consistent with the procedure used during training. Subsequently, multi-scale short-time Fourier transform (STFT) is applied to extract log-magnitude spectra at two different temporal resolutions. These spectra are temporally aligned and concatenated to form a unified spectral representation, which serves as the input to the neural network. To improve input stability, the concatenated features are normalized using their mean and standard deviation. The normalized features are then fed into the pre-trained model, which outputs a time–frequency mask designed to suppress noise while preserving speech components. The predicted mask is applied to the linear-scale magnitude spectrum to attenuate noise components. The phase information from the primary STFT scale is used to reconstruct the complex spectrogram. Finally, inverse STFT is performed to transform the enhanced spectrogram back into the time-domain waveform. A final normalization step is applied to the output audio to ensure appropriate amplitude scaling for subsequent analysis and evaluation.
Inspired by [
32], we conducted additional fine-tuning on the initial model to improve its performance under low signal-to-noise ratio (SNR) audio conditions. Specifically, audio samples with SNR below 0 dB were selected from the original training set to form a fine-tuning dataset. The learning rate was reduced to
to ensure training stability. Multiple rounds of fine-tuning were performed, and the model achieving the highest PESQ score on the validation set was selected. Final performance was evaluated on a blind test set to determine the optimal model. Subsequently, we performed an engineering optimization based on dynamic spectral subtraction, which was integrated as a post-processing step. First, a simple voice activity detection (VAD) mechanism was employed to determine whether the current frame contains speech. For non-speech frames, the noise estimate was dynamically updated to adapt to changing acoustic environments. Then, spectral subtraction was applied to the current spectrum using an adaptively adjusted suppression factor
. Additionally, a spectral floor gating mechanism controlled by a threshold parameter
was introduced to effectively prevent quality degradation or residual noise caused by over-subtraction. With this post-processing strategy, the PESQ performance of the enhanced audio was improved to 3.103, surpassing all baseline models. For fair comparison, the same post-processing was applied to the GTCRN model. However, its PESQ performance slightly degraded. We attribute this to GTCRN’s use of complex masking for simultaneous magnitude and phase estimation, which may already produce outputs near the performance ceiling. In contrast, our model reconstructs the enhanced waveform using the noisy phase during inference, leaving room for further noise suppression improvements. We explored phase reconstruction using the MISI (Multiple Input Spectrogram Inversion) algorithm, but no notable enhancement was observed. The Griffin–Lim algorithm yielded modest improvements, but at the cost of a significant increase in inference latency. These findings not only validate that our model achieves strong performance even under the noisy phase assumption, but also highlight the potential for future improvement through more effective phase estimation strategies. The experimental code and pre-trained models have been released and can be accessed at "
https://github.com/ZujieFan/TSDCA-BA (accessed on 20 June 2025) for reference.
4.3. Results
4.3.1. Comparison with Baseline Models
Figure 7 shows a spectrogram comparison on the test set, highlighting the performance differences between the proposed TSDCA-BA model and the representative RNNoise model in terms of noise suppression. Subfigure (a) illustrates the spectrogram of the noisy input, (b) shows the spectrogram after denoising with RNNoise, (c) corresponds to the output of the proposed TSDCA-BA model, and (d) depicts the spectrogram of the clean reference signal. As observed, the TSDCA-BA model achieves more effective noise reduction while better preserving the spectral structure of speech compared to RNNoise.
Table 2 shows a comprehensive comparison of the proposed models, TSDCA-BA and TSDCA-BA-PP, with several state-of-the-art speech enhancement models under real deployment scenarios. The evaluation focuses on three key metrics: Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI), and computational complexity measured in MMACs. In terms of PESQ, TSDCA-BA-PP achieves the highest score of 3.1, outperforming all baseline models, including FPSEN (2.97), DCCRN-lite (2.9), and LisenNET (2.95). These results indicate that, after the post-processing stage, the proposed model achieves a noticeable improvement in perceived speech quality compared to other baseline methods. In the context of hearing aids, this means the model not only enhances PESQ scores but also retains important environmental information, thereby improving the listening experience for users with hearing impairments. Even without post-processing, TSDCA-BA delivers a strong PESQ score of 2.955, outperforming models such as LisenNET, GTCRN, and RNNoise. Regarding speech intelligibility (STOI), TSDCA-BA achieves a score of 0.904, which is slightly lower than LisenNET (0.95) and DCCRN-lite (0.94). However, given the need to preserve essential background cues, this result remains within a practical and acceptable range. Notably, the post-processing stage in TSDCA-BA-PP leads to a small drop in STOI to 0.898, likely due to the stronger enhancement effects applied to the final output. A major advantage of the proposed models lies in their computational efficiency. Both TSDCA-BA and TSDCA-BA-PP operate at only 0.22 MMACs, which is two orders of magnitude lower than DCCRN-lite (150 MMACs), and significantly below GTCRN (33 MMACs) and LisenNET (56 MMACs). Practical testing on low-power processors reveals that GTCRN cannot maintain real-time performance under such conditions, with the denoising module frequently becoming unresponsive. In contrast, the proposed model maintains a stable latency of around 300 milliseconds, even on low-power devices. These results highlight the suitability of the proposed approach for deployment in resource-constrained or real-time applications, such as hearing aids and edge devices.
The results demonstrate that TSDCA-BA, particularly when integrated with post-processing, achieves an excellent balance between speech quality and computational cost. Although slight compromises in intelligibility are observed, these trade-offs are justified by the substantial improvements in PESQ and computational efficiency. Furthermore, the proposed approach offers a viable pathway for future research on single-frame inference models.
4.3.2. Ablation Study
To validate the effectiveness of the proposed modules, comprehensive ablation studies were conducted in a practical deployment setting.
Table 3 summarizes the ablation results of the TSDCA-BA model. The experiments individually assessed the contributions of key architectural components, including the TSDCA-BA module, Squeeze-and-Excitation (SE) block, and the single-frame statistical feature extraction mechanism. The baseline configuration, Bi-GRU-Conv×1, contains only 24K parameters and achieves a PESQ score of 2.44 and STOI of 0.8904. Increasing the model depth to Bi-GRU-Conv×2 significantly improves performance (PESQ 2.60, STOI 0.9026), demonstrating the advantage of deeper recurrent convolutional fusion for temporal context modeling. Introducing the SDCA module into the Bi-GRU-Conv×3 architecture further raises PESQ to 2.68 and STOI to 0.9030, confirming the module’s efficacy in enhancing temporal dynamics representation. Adding the SE module on top of SDCA (Bi-GRU-Conv×3 + SDCA + SE) increases PESQ to 2.79, while STOI slightly decreases to 0.8894, possibly due to overfitting towards perceptual quality metrics. Notably, incorporating single-frame statistical features alongside SDCA and SE (Bi-GRU-Conv×3 + SDCA + SE + statistical features) further improves PESQ to 2.85 but reduces STOI to 0.8745, indicating a trade-off between speech quality and intelligibility in certain configurations. To ensure fairness, we evaluated the changes in computational load while maintaining the model’s lightweight design. The results show that as modules are added to the model, the computational complexity (measured in MMACs) does not increase significantly, remaining approximately at 0.22 MMACs, which demonstrates the model’s excellent computational efficiency. In contrast, the combination of only SDCA and SE modules without statistical features achieves a PESQ of 2.62 and a higher STOI of 0.9116, suggesting that the single-frame statistical features may hinder STOI improvement. Ultimately, the full model (Bi-GRU-Conv×3 + TSDCA-BA + SE + statistical features) achieves the highest PESQ of 2.92 and STOI of 0.9045, with a modest parameter increase to 112.5 K. This configuration effectively leverages the synergy of global temporal dynamics, channel attention, and frame-level statistical features, resulting in substantial improvements in both perceptual quality and intelligibility while maintaining computational efficiency.
4.4. Real-Device Deployment and System Performance Analysis
In this experiment, the optimized hearing aid was connected to a Raspberry Pi 4B via a USB sound card to simulate a real-device operating environment. A real-time noise reduction system was implemented in Python (3.8.0) based on an ONNX model converted from the proposed TSDCA-BA architecture. The system combines neural network-based mask estimation with dynamic spectral subtraction to continuously enhance the microphone input signals.
Table 4 presents the model’s performance under actual deployment. We tested the trained model using data collected from the hearing aid; the PESQ score improved from 1.2897 before noise reduction to 2.0964 after, representing an approximate performance increase of 62.55%. The system achieves an overall latency of less than 3 ms, with memory usage around 200 MB. Considering the relatively high memory consumption of the Python runtime, future plans include encapsulating the model in C++ and running it via TensorFlow Lite, where memory usage is expected to be reduced to approximately 5–10 MB. The results demonstrate that the proposed model performs effectively on low-power hearing aid devices, with its extremely low computational load providing a viable solution for achieving low-latency, real-time noise reduction in resource-constrained environments. Moreover, the model features a lightweight architecture with low training costs and can be trained on large-scale datasets using only a CPU. The hearing aid’s circuit design is simple, cost-efficient, and delivers satisfactory performance, making it suitable for mass production and wide-scale deployment.
5. Discussion
In this study, we conducted an in-depth analysis of lightweight speech denoising models with fewer than 100K parameters and reproduced the performance of the GRTCN module. Experimental results indicate that, among lightweight models, GRTCN tends to focus more on vocal extraction rather than comprehensive denoising. However, in real-world applications such as hearing aids, users typically desire to perceive the richness of the entire acoustic environment, not just isolated speech. Through practical evaluations in multilingual conversations, television and film scenarios, and live sports broadcasts, we identified a critical limitation: the model not only suppresses background noise but also removes many important auditory details. For instance, languages with sharp phonetic features are sometimes mistakenly treated as noise and eliminated. Similarly, background music, action sound effects (e.g., explosions, combat, shouting), and crowd responses (e.g., cheering, clapping, whistles) are often suppressed, resulting in monotonous speech output. For individuals with hearing impairments, this over-simplified approach significantly reduces the realism and richness of their auditory experience. To address this issue, our proposed model no longer pursues aggressive noise suppression. Instead, it adopts new approaches similar to those described in [
33], which focus solely on reducing background noise, aiming to preserve more contextually relevant acoustic information. This design improves overall auditory perception for hearing aid users. Although this method may slightly compromise denoising performance, it offers a more natural and immersive listening experience.
Additionally, we propose an adaptive framework for future development, in which models are fine-tuned according to specific environmental conditions. For example, three distinct models could be trained and optimized for indoor conversations, media consumption, and outdoor environments, respectively. These models can be stored in external SPI NOR Flash memory and switched manually or automatically via an environment-aware selection algorithm. Future work will further explore this concept to enhance the real-world usability and comfort of ultra-low-power hearing aid devices.
6. Conclusions
In this paper, we propose an ultra-low computational complexity speech denoising model that integrates single-frame feature concatenation with a lightweight SDCA-BA module. The model employs an efficient single-frame feature extraction module to derive four statistical descriptors—mean, standard deviation, maximum, and minimum—from single-frame spectra, which are then fused with the original features to enhance the representation capability of single-frame spectral features. Subsequently, the lightweight SDCA-BA module utilizes a residual structure, channel attention, and band-wise weighting strategies to effectively partition frequency bands and apply differentiated weighting, further strengthening the representation of key frequency bands. Additionally, the channel SE attention mechanism adaptively enhances critical speech features, effectively suppressing background noise interference. This approach significantly improves the model’s feature representation and overall denoising performance without increasing the parameter count. The proposed model substantially reduces computational resource consumption while maintaining high speech enhancement quality, offering a practical solution for achieving low-latency, low-power, and high-performance speech processing on edge devices. Experimental results demonstrate that on the VCTK-DEMAND dataset, the proposed model achieves a PESQ score improvement of approximately 4.9% over GTCRN and LiSenNET, approaching the performance level of the FSPEN model. Meanwhile, it delivers roughly a 150-fold improvement in computational efficiency (MMACs) compared to existing baseline models. These results highlight the superior balance of performance and efficiency of the proposed approach, validating its potential and advantages for real-world applications in resource-constrained environments. Despite these advantages, some limitations remain. The proposed model employs moderate background noise suppression, which introduces certain drawbacks. The processed audio volume may decrease, and denoising performance may degrade in highly complex noise environments. However, for visually impaired users, fully eliminating environmental noise is not always the optimal solution. Instead, selectively filtering high-frequency noise while preserving ambient sounds helps users better perceive their surroundings. Thus, the model balances noise reduction with the preservation of important acoustic details, making it better suited to the practical needs of this user group. Future work will focus on addressing these limitations to further enhance the model’s robustness and adaptability under diverse environmental conditions.
Author Contributions
Conceptualization, Z.F.; methodology, Z.F.; software and optimization, Z.F.; validation, Z.G.; formal analysis, Z.F. and Z.G.; investigation, Z.F. and Z.G.; data curation, Z.F. and Y.L.; writing—original draft preparation, Z.F.; writing—review and editing, Z.F. and Y.L.; model architecture illustration, Z.F. and Y.L.; supervision, J.K.; project administration, J.K.; funding acquisition, J.K. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The data presented in this study are available in the article.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Pandey, A.; Tan, K.; Xu, B. A simple RNN model for lightweight, low-compute and low-latency multichannel speech enhancement in the time domain. In Proceedings of the INTERSPEECH, Dublin, Ireland, 20–24 August 2023; pp. 2478–2482. [Google Scholar]
- Schroter, H.; Escalante-B, A.N.; Rosenkranz, T.; Maier, A. DeepFilterNet: A low complexity speech enhancement framework for full-band audio based on deep filtering. In Proceedings of the ICASSP 2022–IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 20–27 May 2022; pp. 7407–7411. [Google Scholar]
- Li, L.; Lu, Z.; Watzel, T.; Kürzinger, L.; Rigoll, G. Light-weight self-attention augmented generative adversarial networks for speech enhancement. Electronics 2021, 10, 1586. [Google Scholar] [CrossRef]
- Rong, X.; Sun, T.; Zhang, X.; Hu, Y.; Zhu, C.; Lu, J. GTCRN: A speech enhancement model requiring ultralow computational resources. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 971–975. [Google Scholar]
- Yang, L.; Liu, W.; Meng, R.; Lee, G.; Baek, S.; Moon, H.G. FSPEN: An ultra-lightweight network for real time speech enhancement. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 10671–10675. [Google Scholar]
- Yan, H.; Zhang, J.; Fan, C.; Zhou, Y.; Liu, P. LiSenNet: Lightweight sub-band and dual-path modeling for real-time speech enhancement. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Ryan, J.; Coenen, I.; Brennan, R. The evolution of system on chip integrated circuits for hearing-aid signal processing. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Gerlach, L.; Payá-Vayá, G.; Blume, H. A survey on application specific processor architectures for digital hearing aids. J. Signal Process. Syst. 2022, 94, 1293–1308. [Google Scholar] [CrossRef]
- Cadence. HiFi 3z DSP. 2025. Available online: https://www.cadence.com/en_US/home/tools/silicon-solutions/compute-ip/hifi-dsps/hifi-3z.html (accessed on 13 June 2025).
- Cadence. HiFi 4 DSP. 2025. Available online: https://www.cadence.com/en_US/home/tools/silicon-solutions/compute-ip/hifi-dsps/hifi-4.html (accessed on 13 June 2025).
- XMOS. xCORE-200 Multicore Microcontroller. Available online: https://www.xmos.com/xcore-200 (accessed on 13 June 2025).
- Skillman, A.; Edso, T. A technical overview of Cortex-M55 and Ethos-U55: Arm’s most capable processors for endpoint AI. In Proceedings of the 2020 IEEE Hot Chips 32 Symposium (HCS), Palo Alto, CA, USA, 16–18 August 2020; IEEE Computer Society: Los Alamitos, CA, USA, 2020; pp. 1–20. [Google Scholar]
- Swamy, K.A.; Alex, Z.C.; Ramachandran, P.; Mathew, T.L.; Sushma, C.; Padmaja, N. Real-time Implementation of Delay Efficient DCT Based Hearing Aid Algorithm Using TMS320C5505 DSP Processor. In Proceedings of the 2021 Innovations in Power and Advanced Computing Technologies (i-PACT), Kuala Lumpur, Malaysia, 27–29 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar]
- Karrenbauer, J.; Klein, S.; Schönewald, S.; Gerlach, L.; Blawat, M.; Benndorf, J.; Blume, H. Smartheap—A high-level programmable, low power, and mixed-signal hearing aid SoC in 22 nm FD-SOI. In Proceedings of the ESSCIRC 2022—IEEE 48th European Solid State Circuits Conference (ESSCIRC), Milan, Italy, 19–22 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 265–268. [Google Scholar]
- Li, J.; Wang, P.; Li, J.; Wang, X.; Zhang, Y. DPATD: Dual-Phase Audio Transformer for Denoising. In Proceedings of the 2023 Third International Conference on Digital Data Processing (DDP), Luton, UK, 27–29 November 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 36–41. [Google Scholar]
- Fu, S.W.; Liao, C.F.; Tsao, Y.; Lin, S.D. MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement. In International Conference on Machine Learning; PMLR: Berkeley, CA, USA, 2019; pp. 2031–2041. [Google Scholar]
- Cao, R.; Abdulatif, S.; Yang, B. CMGAN: Conformer-based metric GAN for speech enhancement. arXiv 2022, arXiv:2203.15149. [Google Scholar]
- Valin, J.M. A hybrid DSP/deep learning approach to real-time full-band speech enhancement. In Proceedings of the 2018 IEEE 20th International Workshop on Multimedia Signal Processing (MMSP), Vancouver, QC, Canada, 29–31 August 2018; pp. 1–5. [Google Scholar]
- Hu, Y.; Liu, Y.; Lv, S.; Xing, M.; Zhang, S.; Fu, Y.; Wu, J.; Zhang, B.; Xie, L. DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. arXiv 2020, arXiv:2008.00264. [Google Scholar] [CrossRef]
- Serbest, S.; Stojkovic, T.; Cernak, M.; Harper, A. DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration. arXiv 2025, arXiv:2505.23515. [Google Scholar]
- Pandey, A.; Azcarreta, J. Ultra low-compute complex spectral masking for multichannel speech enhancement. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Xu, H.; Wei, L.; Zhang, J.; Yang, J.; Wang, Y.; Gao, T.; Dai, L. A multi-scale feature aggregation based lightweight network for audio-visual speech enhancement. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
- Li, M.; Zheng, Y.; Li, D.; Wu, Y.; Wang, Y.; Fei, H. MS-SENet: Enhancing speech emotion recognition through multi-scale feature fusion with squeeze-and-excitation blocks. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; pp. 12271–12275. [Google Scholar]
- Xia, W.; Koishida, K. Sound Event Detection in Multichannel Audio Using Convolutional Time-Frequency-Channel Squeeze and Excitation. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 3629–3633. [Google Scholar]
- Zeng, C.; Zhao, Y.; Wang, Z.; Li, K.; Wan, X.; Liu, M. Squeeze-and-Excitation Self-Attention Mechanism Enhanced Digital Audio Source Recognition Based on Transfer Learning. Circuits, Syst. Signal Process. 2024, 44, 480–512. [Google Scholar] [CrossRef]
- Ma, X.; Zhai, K.; Luo, N.; Zhao, Y.; Wang, G. Gearbox Fault Diagnosis Under Noise and Variable Operating Conditions Using Multiscale Depthwise Separable Convolution and Bidirectional Gated Recurrent Unit with a Squeeze-and-Excitation Attention Mechanism. Sensors 2025, 25, 2978. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Zou, H.; Zhu, J. Sub-PNWR: Speech Enhancement Based on Signal Sub-Band Splitting and Pseudo Noisy Waveform Reconstruction Loss. In Proceedings of the Interspeech, Kos Island, Greece, 1–5 September 2024; pp. 657–661. [Google Scholar]
- Lin, Z.; Wang, J.; Li, R.; Shen, F.; Xuan, X. PrimeK-Net: Multi-Scale Spectral Learning via Group Prime-Kernel Convolutional Neural Networks for Single Channel Speech Enhancement. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: Seoul, Republic of Korea, 2025; pp. 1–5. [Google Scholar]
- Nikzad, M.; Nicolson, A.; Gao, Y.; Zhou, J.; Paliwal, K.K.; Shang, F. Deep Residual-Dense Lattice Network for Speech Enhancement. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8552–8559. [Google Scholar]
- Martin-Donas, J.M.; Gomez, A.M.; Gonzalez, J.A.; Peinado, A.M. A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality. IEEE Signal Process. Lett. 2018, 25, 1680–1684. [Google Scholar] [CrossRef]
- Weixuezhineng. Simple Hearing Aid. LCEDA. 12 May 2022. Available online: https://oshwhub.com/mozixi/jian-yi-zhu-ting-qi-kai-yuan (accessed on 13 June 2025).
- Guo, Z.; Kavuri, S.; Lee, J.; Lee, M. IDS-Extract: Downsizing Deep Learning Model For Question and Answering. In Proceedings of the 2023 International Conference on Electronics, Information, and Communication (ICEIC), Singapore, 5–8 February 2023; pp. 1–5. [Google Scholar]
- Nogales, A.; Caracuel-Cayuela, J.; García-Tejedor, Á.J. Analyzing the influence of diverse background noises on voice transmission: A deep learning approach to noise suppression. Appl. Sci. 2024, 14, 740. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).