1. Introduction
Cantonese, as one of the major Chinese dialects, is widely spoken in Guangdong, Hong Kong, and Macau. It serves not only as the primary medium of daily communication in these regions but also as an essential carrier of cultural heritage. Owing to its unique phonological features and deep cultural background, Cantonese holds significant academic value in linguistic and speech technology research. In particular, in the field of ASR [
1], the nine lexical tones and rich phoneme inventory of Cantonese present diverse challenges and opportunities for speech model development.
However, achieving high accuracy in Cantonese speech recognition remains a significant challenge. Compared with Mandarin, the complexity of Cantonese syllables, phonemes, and tonal variations makes ASR systems perform less effectively, particularly under low-resource conditions. The scarcity of Cantonese speech data severely constrains the training of ASR systems; as a result, existing Cantonese ASR solutions often fail to obtain sufficient high-quality training data, leading to performance gaps compared with high-resource languages such as Mandarin. Although state-of-the-art models such as Whisper [
2] and Wav2Vec2 [
3] have achieved remarkable success on large-scale datasets in recent years, these models still struggle with accurate recognition of Cantonese syllables and tonal information.
Currently, research on Cantonese speech recognition primarily focuses on optimizing deep learning models, particularly the application of transfer learning to pretrained models such as Whisper and Wav2Vec2. By leveraging transfer learning, researchers can exploit the knowledge embedded in large-scale corpora to compensate for the scarcity of Cantonese data and improve recognition accuracy. However, despite the strong performance of these deep learning models on large-scale datasets [
4,
5], practical deployment—especially inference on edge devices—remains a considerable challenge. Edge devices typically have limited computational and storage resources, which makes deploying and running deep learning models, particularly ASR systems, extremely difficult. Traditional deep learning models often fail to achieve efficient, low-latency inference on edge platforms [
6,
7].
In addition, research on model quantization and deployment for edge devices remains relatively limited. Low-power edge computing devices, such as those in the Internet of Things (IoT), impose stringent constraints on computational and storage resources. How to deploy an efficient and accurate Cantonese ASR model in such resource-constrained environments remains an urgent challenge. Model quantization [
8,
9], as an effective technique for reducing computational complexity and storage requirements, has gradually become a research hotspot for optimizing edge deployment. Quantization can significantly lower the computational load and memory footprint of models, thereby enabling efficient inference for Cantonese ASR on resource-limited devices. However, due to the phonological complexity of Cantonese and the inherent variability of the data, achieving efficient quantization while preserving recognition accuracy remains a major open problem in current research.
In recent years, for ASR systems targeting low-resource languages, researchers have actively explored parameter-efficient tuning and quantization strategies based on pretrained models to improve deployment efficiency and recognition performance. Song et al. first proposed the LoRA-Whisper approach, which integrates low-rank adaptation (LoRA) modules into the Whisper model, effectively mitigating issues such as “language interference” and new language adaptation in multilingual recognition. Experiments covering eight languages reported relative improvements of 18.5% and 23.0% in multilingual recognition and language expansion tasks, respectively; however, this study did not specifically evaluate performance on low-resource languages such as Cantonese [
10]. On the other hand, Andreyev systematically assessed the quantization performance of Whisper under INT4/INT5/INT8 configurations. On the LibriSpeech dataset, INT8 quantization reduced inference latency by approximately 19% and model size by 45% while maintaining recognition accuracy. Although these findings provide useful references for general quantization strategies, their applicability to low-resource or Cantonese scenarios remains unverified [
11]. In addition, research on accent and speaker adaptation, such as MAS-LoRA and SAML, has introduced hybrid LoRA experts or speaker-adaptive mechanisms, achieving superior recognition performance on English accent and speaker datasets. However, these studies have primarily focused on English and have not addressed the challenges of low-resource languages like Cantonese [
12].
Overall, although Cantonese ASR has made notable progress driven by deep learning and transfer learning, particularly demonstrating strong performance on large-scale datasets, it still faces multiple challenges under low-resource conditions. Addressing the inference efficiency of Cantonese ASR models on edge devices and integrating techniques such as model quantization to enable efficient, low-power deployment remain critical and challenging issues in current research.
In summary, although the aforementioned studies have provided valuable technical insights into LoRA-based fine-tuning, multilingual adaptation, quantization optimization, and even mixture-of-experts architectures, no prior research has systematically validated INT8 quantization on LoRA-fine-tuned Whisper models for Cantonese—a representative low-resource and tone-rich language. Comprehensive evaluations of the relationships among tonal recognition accuracy CER, and inference latency before and after quantization are still lacking, as are end-to-end experimental results under edge deployment conditions.
To address the aforementioned challenges, this paper proposes a cost-effective Cantonese ASR system based on LoRA-adapted Whisper, followed by INT8 quantization. The quantization process significantly reduces the model size and accelerates inference while demonstrating strong performance on edge devices, with improved energy efficiency and stable accuracy. CER evaluations confirm that quantization introduces minimal impact on tonal recognition, making the approach particularly suitable for tone-rich languages such as Cantonese.
2. Fundamental Principles
Currently, speech recognition has achieved remarkable progress on large-scale datasets; however, its deployment on edge devices still faces significant challenges. Edge devices typically have limited computational power and storage capacity, which restricts the direct application of complex models. Therefore, reducing model size and lowering computational requirements are key to enabling efficient edge deployment. By compressing model size, it is possible to not only substantially decrease storage demands but also accelerate inference, thereby improving the real-time performance and responsiveness of speech recognition systems on edge platforms.
2.1. Overview of the Whisper Model
Whisper is a multitask automatic speech recognition (ASR) system proposed by OpenAI, built on a Transformer-based architecture [
13]. The model is trained on a large-scale speech dataset and supports multilingual and cross-task learning, including speech recognition and translation. Its primary goal is to achieve highly robust speech recognition across languages and low-resource conditions through large-scale supervised multilingual training. Unlike traditional methods that rely heavily on language-specific labeled corpora, Whisper leverages 680,000 h of paired audio–text data in its pretraining phase, covering 96 languages and diverse noise environments. This extensive pretraining enables Whisper to exhibit strong generalization capability and robustness in real-world applications.
Whisper employs large-scale supervised training within an end-to-end architecture to directly map speech signals to text sequences, eliminating the separate design of acoustic models, language models, and pronunciation lexicons commonly used in traditional ASR systems, thereby simplifying the overall system structure. Unlike self-supervised approaches such as Wav2Vec2, which focus on learning audio representations [
14,
15], Whisper adopts a fully supervised paradigm that directly optimizes the mapping from speech to text. This enables the model to learn joint representations of linguistic semantics and acoustic features, resulting in stronger adaptability across multiple tasks.
The overall architecture of Whisper adopts a typical sequence-to-sequence (Seq2Seq) structure based on an encoder–decoder Transformer design [
16], which primarily consists of three components: a feature extraction module, a Transformer encoder, and a Transformer decoder. The feature extraction module first converts raw audio signals into 80-dimensional Mel spectrograms to capture spectral characteristics of speech. This representation is normalized along both time and frequency dimensions to enhance robustness under varying speech conditions. The encoder takes the Mel spectrogram as input and applies multiple layers of multi-head self-attention mechanisms and feed-forward networks to generate context-aware, high-dimensional speech feature representations:
where
denotes the input Mel spectrogram,
represents the number of frames, and
indicates the hidden dimension. The decoder adopts an autoregressive generation mechanism, predicting the next subword token based on previously generated tokens and the encoder output through a cross-attention mechanism:
The final output is generated based on subword units, with special tokens introduced to control the task type (transcription/translation) and specify the target language. Whisper is trained entirely under a supervised learning paradigm using the cross-entropy loss function:
where
denotes the target token and
represents the length of the output sequence. Unlike self-supervised approaches, Whisper does not rely on masked prediction; instead, it performs direct sequence-to-sequence learning on paired speech–text data to ensure task consistency. The training corpus covers multiple languages and multiple tasks (ASR and speech translation) and includes a wide range of noisy scenarios, enabling the model to maintain robustness and transferability across multilingual and multi-domain tasks.
2.2. LoRA Method
With the widespread adoption of Transformer architectures in speech recognition, the number of model parameters has continued to increase, leading to substantial computational and storage demands during training and inference. For low-resource environments or edge devices, full fine-tuning not only incurs high computational costs but also requires storing multiple copies of the model, making deployment in practical scenarios challenging. To address this issue, LoRA (Low-Rank Adaptation) offers a parameter-efficient fine-tuning (PEFT) strategy that introduces only a small number of trainable parameters while maintaining model performance, enabling task-specific adaptation [
17,
18].
The core idea of LoRA is to freeze the pretrained model weights while introducing a low-rank decomposition as a trainable update to the weight matrices within the Transformer layers, thereby reducing the number of trainable parameters. Taking the linear transformation in the Transformer’s self-attention mechanism as an example, assume that the weight matrix in the pretrained model is given by:
In standard fine-tuning, the entire
needs to be updated, requiring
parameters. LoRA, however, decomposes the update term ΔW\Delta WΔW into the product of two low-rank matrices:
Finally, the transformed weight can be expressed as:
where
denotes the frozen pretrained weights,
represent the trainable low-rank matrices, and the rank
is typically set to 1–5% of the original dimension.
LoRA is typically applied to the attention projection layers and the linear transformation layers of the feed-forward network (FFN) in Transformer architectures. Its implementation mechanism involves two steps: first, freezing the pretrained weights to keep the original model parameters unchanged, which helps prevent overfitting and reduces storage overhead; second, introducing a low-rank module by inserting trainable low-rank matrices
into the specified linear layers to enable learnable updates. Finally, to stabilize training, LoRA introduces a scaling factor
to control the magnitude of the low-rank update:
for an input vector
, the linear transformation process in LoRA is formulated as:
where
represents the frozen part,
is the low-rank update, and the computational complexity is reduced from
to O(r(d + k)).
2.3. Model Quantization
When deploying the LoRA-adapted Whisper model on edge devices, model size and computational complexity remain major bottlenecks. Although LoRA significantly reduces the number of trainable parameters, the entire set of pretrained model weights still needs to be loaded during inference, resulting in high storage overhead and inference latency. For low-power devices such as IoT terminals and mobile platforms, further reducing model size and improving inference speed are critical. Model quantization, as an effective model compression technique, addresses this issue by reducing the numerical precision of parameters, thereby lowering computational cost and memory usage with minimal impact on accuracy. This makes quantization a key enabler for efficient edge deployment.
Quantization reduces storage requirements and accelerates matrix multiplication by mapping high-precision floating-point weights (e.g., FP32/FP16) to low-precision integers (e.g., INT8). Specifically, given the original floating-point weight
, its quantized value
is expressed as:
where
denotes the quantization scaling factor, and
represents the range of integer representation (e.g., 127 for INT8). Dequantization restores an approximate floating-point value as
. Through this scaling and reverse-scaling mechanism, quantization preserves the overall numerical distribution while reducing precision.
Quantization methods can be broadly categorized into three types. The first is static quantization, which calculates the activation distribution using a calibration dataset after training and generates quantization parameters offline. This approach is suitable for pre-deployment optimization. The second is dynamic quantization, which computes the activation scaling factors dynamically during inference, while weights are quantized at load time. This method requires no additional training or calibration and is ideal for rapid deployment of large models. The third is quantization-aware training (QAT), which simulates the effect of quantization during the training process. Although QAT provides the highest accuracy, it requires retraining and is therefore more costly.
The dynamic quantization mechanism applies separate strategies for weights and activations. For weight quantization, FP16 weights are converted to INT8 during model loading, with a fixed scaling factor s maintained. For activation quantization, scaling factors are computed dynamically during inference based on the input, which helps prevent quantization overflow.
For an input vector and its corresponding weight, the linear transformation in quantized inference can be expressed as:
where
are represented in INT8 format, and
denote the corresponding scaling factors.
2.4. Overall Model Architecture
In this study, we integrate Whisper, LoRA, and INT8 quantization to construct an efficient Cantonese ASR system. The overall architecture of the proposed model is illustrated in
Figure 1.
In this architecture, the Whisper model serves as the backbone, providing multilingual speech recognition capabilities and effectively handling syllables and tonal variations in Cantonese. LoRA fine-tuning is then applied to the Whisper model, significantly reducing the number of trainable parameters and thereby lowering both training and inference costs. Finally, the model is quantized to INT8, enabling efficient execution on edge devices while maintaining recognition performance.
3. Training Procedure
3.1. Dataset and Preprocessing
The primary training data in this study is sourced from the Cantonese subset of the Common Voice dataset and the MDCC corpus [
19]. Cantonese differs from Mandarin in terms of tonal and syllabic structure and contains a high degree of colloquial expressions in everyday communication. Therefore, selecting a Cantonese-specific corpus is essential for evaluating the robustness of the proposed method under real-world conditions characterized by multiple tones and variable pronunciations. All audio recordings were resampled to 16 kHz, and the text annotations were standardized in Traditional Chinese format.
To improve speech feature consistency and text normalization, the following processing steps were applied: (1) Deduplication and silence trimming: remove leading and trailing silence segments longer than 500 ms using the SoX silence function; (2) Mel-spectrogram extraction: generate 80-bin Mel-spectrogram with a window length of 25 ms and a frame shift of 10 ms, followed by dB normalization; (3) Text normalization: convert full-width characters to half-width; remove meaningless punctuation marks; retain numbers and Cantonese particles.
3.2. Training Configuration
In the training process of the speech model, the choice of optimizer directly affects the convergence speed and stability of the model. Based on this, this study adopts the AdamW optimizer to update model parameters [
20]. AdamW combines the advantages of momentum and adaptive learning rate from the traditional Adam optimizer and introduces a weight decay mechanism, which enables more stable and efficient convergence on large-scale corpora and deep network structures. For hyperparameter configuration, the initial learning rate is set to 1 × 10
−3 to ensure a fast initial training speed while avoiding excessive oscillation.
For the t-th iteration, the parameter update rule of AdamW is as follows:
In parameter-efficient fine-tuning, LoRA modules are inserted into the projection layers of the Transformer attention mechanism and the linear mapping layers of the feed-forward network to achieve low-rank parameter updates. In this study, LoRA is applied to the Query projection (q_proj), Value projection (v_proj), and the first linear transformation (fc1) in the feed-forward sublayer. The q_proj and v_proj layers directly affect the computation of attention weights, which is critical for capturing contextual relationships, while the fc1 layer determines the transformation capability of the feed-forward network, significantly impacting the expressive power of speech feature modeling.
The LoRA hyperparameters are configured as follows: rank r = 8, scaling factor α = 32, and dropout probability = 0.05. The rank r defines the dimension of the low-rank update matrices; r = 8 is chosen to balance performance and parameter size. The scaling factor α controls the magnitude of the low-rank update and is set to 32 to prevent overfitting and accelerate convergence. A dropout rate of 0.05 introduces stochastic regularization, further improving the generalization ability of the model.
3.3. Dynamic Quantization Process
To further compress the model and reduce inference latency, this study employs INT8 dynamic quantization using ONNX Runtime without requiring retraining. First, the LoRA-fine-tuned Whisper model is exported to an ONNX graph while maintaining FP16 precision to balance numerical stability and quantization accuracy. Next, under the ONNX Runtime framework, both weights and activations are dynamically quantized by mapping floating-point representations (FP16) to low-precision integers (INT8). This approach adjusts scaling factors dynamically based on the input distribution, ensuring that quantization errors remain within a controllable range and maintaining model performance during inference. Finally, before quantization, LoRA weights are merged into the main weight matrices to avoid introducing additional operations during inference. Through this strategy, the final quantized model retains the same structure as the original model, enabling efficient loading and execution.
For a comprehensive comparison with existing quantization schemes,
Table 1 provides a summary of representative methods with respect to training requirements, accuracy variation, latency optimization, and deployment complexity, while also highlighting the advantages and limitations of the dynamic quantization strategy employed in this study. The results indicate that dynamic quantization enables substantial latency reduction and model compression without necessitating retraining, thereby offering a practical and efficient solution for rapid edge deployment scenarios.
Supplementary Notes:
PTQ: Advantages—Fast deployment, simple implementation, and no need for retraining. Limitations—Sensitive to data distribution; may incur significant accuracy loss in tasks with high precision requirements.
QAT: Advantages—Provides the best post-quantization accuracy, often close to the original model. Limitations—Requires retraining, which is costly and complex to implement.
Dynamic Quantization: Advantages—Does not require retraining, offers good cross-platform compatibility, and achieves notable latency reduction. Limitations—Slightly less effective than QAT for activation quantization, and may be unstable under extremely high precision demands.
Dynamic Fixed-Point Quantization: Advantages—Hardware-friendly and effective in latency reduction. Limitations—Strongly hardware-dependent and requires adaptation to the target platform.
Activation Quantization: Advantages—Reduces memory usage with minimal impact on accuracy. Limitations—Limited effectiveness in latency optimization.
Weight Sharing Quantization: Advantages—Provides extremely high compression rates. Limitations—Causes substantial accuracy degradation, making it suitable only for tasks that are less sensitive to precision.
Learned Quantization: Advantages—Allows end-to-end optimization of quantization parameters, balancing accuracy and efficiency. Limitations—Involves a complex training process with high computational cost.
In this study, ONNX Runtime INT8 dynamic quantization was ultimately adopted, as it requires no retraining, maintains accuracy within a controllable range, and reduces the model size to 60 MB. These characteristics make it particularly suitable for scenarios that demand rapid deployment and efficient inference on resource-constrained edge devices.
3.4. Evaluation Metrics
The most commonly used evaluation metric in speech recognition tasks is error rate, which can be further categorized into phoneme error rate, character error rate (CER), word error rate (WER), and sentence error rate depending on the modeling unit. Considering that Cantonese exhibits a rich representation at the character level and the recognition units in this study are closer to character-based outputs, CER is chosen as the primary metric.
CER is calculated by comparing the recognized output with the reference text using edit distance, which includes substitution, insertion, and deletion operations. Its formula is:
where S, D and I denote the number of substitutions, deletions, and insertions, respectively, and N represents the total number of characters in the reference text.
5. Future Work
To further advance the real-world deployment of Cantonese ASR, future work will focus on four dimensions: data, model, deployment, and evaluation. First, we plan to expand the training corpus by crawling and aligning Cantonese subtitles from movies and TV series, Hong Kong stock market announcements, and Cantonese podcasts to build a more comprehensive dataset. The current LoRA configuration, including rank and insertion layers, was manually set and may not be optimal for different models or tasks; therefore, we will explore adaptive approaches such as AdaLoRA or Auto-LoRA, leveraging Fisher information or SVD to automatically allocate rank across layers. We also intend to investigate the applicability of GPTQ-INT4 and SmoothQuant in Transformer-based ASR scenarios. On the deployment side, we will migrate the model to TFLite with NNAPI and EdgeTPU support to validate real-time performance, while introducing continuous power sampling using powermetrics, Intel RAPL, and ARM PMU. The evaluation will report energy–accuracy trade-offs, including mJ/char–accuracy curves and per-energy-unit character correctness.
Through a progressive upgrade incorporating large-scale self-supervised learning, tone-aware optimization, adaptive LoRA, extreme quantization, and streaming inference, combined with comprehensive energy and latency evaluation, future versions aim to achieve CER ≤ 8% under real-world noisy Cantonese speech, while delivering RTF < 0.1 and mJ/char < 1 on mobile SoCs and IoT NPUs, enabling low-power real-time Cantonese ASR for applications such as smart homes, in-car voice assistants, and wearable devices.