1. Introduction
Cultural diversity and participation fairness are critical for achieving a democratic society. Minorities and marginalized populations often experience ethnic assimilation, racism, discrimination, and bullying [
1]. This course can be attributed to communication barriers, in that mainstream populations cannot understand the native languages of minorities. International and national crises often highlight labor market inequalities that disproportionately affect marginalized individuals [
2]. These minority social participation failures are caused by acts of linguistic microaggression against linguistically marginalized populations [
3]. Thus, the first critical issue for achieving cultural diversity and participation fairness in society is that the minority language needs to be understood. In this regard, effective AI-based MT can assist communication with ethnic groups who speak different languages.
Millions of people use mobile applications and online translation services to communicate across languages. With the development of artificial intelligence technology, machine translation plays an increasingly crucial role in global communication. Machine translations are becoming increasingly important for minority languages to enter mainstream society, and more and more research is focused on this subject [
4,
5,
6,
7]. Thus, an effective AI-based methodology is necessary to develop a high-quality translation system for minority languages with low resources to ensure a culturally and ethnically diverse society.
Hybrid methodologies that combine rule-based approaches with neural machine translation have demonstrated significant potential in improving translation quality for under-resourced language pairs [
8]. This research underscores the critical role of integrating artificial intelligence with human intelligence to enhance translation efficiency, particularly for minority languages. Proactive learning frameworks have been identified as effective tools for constructing machine translation (MT) systems tailored to minority languages with limited resources, e.g., a lack of time and budget for developing MT to collect enough lexicons to cover the entire range of the target translations and their corresponding grammar rules [
9]. Effective machine translation for minority languages is critical; for effective communication and eradicating language barriers on a global scale, all languages, including more than 200 minority languages, should be understood and not be left in the hands of ineffective translation [
10]. Despite the advances in human-centered machine translation for low-resource languages—e.g., effective datasets and computational models for AI-based MT [
10]—considerable challenges persist in safeguarding linguistic diversity and meeting the distinctive needs of minority language speakers [
11,
12]. Recent innovations, such as integrating neural translation engines with rule-based systems, have yielded encouraging results in enhancing translation accuracy for endangered languages, including those of Lemko [
13]. These developments in MT technologies hold substantial promise for supporting language revitalization initiatives and empowering minority language communities [
14]. This study introduces a hybrid AI-driven machine translation system that combines phrase-based and neural machine translation techniques to enhance the quality of Hakka-to-Chinese translation, where Hakka is a minority language and Chinese is the primary stream language spoken in Taiwan.
The proposed system addresses the limitations of purely statistical or neural methods by integrating phrase-based MT with neural approaches, particularly in low-resource settings. The key to improving translation quality is adopting a recursive translation approach, dynamically generating parallel corpora, and leveraging them for continuous, profound learning improvements. This methodology enhances immediate translation services for Hakka speakers and contributes to the broader goal of preserving and revitalizing this language through AI-driven solutions.
This hybrid machine translation system primarily aims to establish a robust and scalable platform for Hakka language translation, addressing the challenges posed by its low-resource nature. The system is designed to facilitate effective communication in Hakka while preserving its linguistic and cultural heritage. To achieve this, the translation model incorporates Hakka culture-specific items (CSIs), including idiomatic expressions, proverbs, master words, ancient dialects, and emerging linguistic trends.
There has been an increase in research addressing the challenge of producing applicable translation models in the absence of translated training data [
15]. Machine translations of dialects mainly face the problem of scarce parallel data in language pairs. The other issue is that the existing data often has a lot of noise or is from a very specialized domain.
Previous research has implemented machine translation systems using CNNs with attention mechanisms to translate Mandarin into Sixian-accented Hakka [
16]. These studies have addressed dialectal variations by separately training exclusive models for Northern and Southern Sixian accents and analyzing corpus differences. Given the limited availability of Hakka corpora, previous systems have faced challenges with unseen words frequently occurring during real-world translation. To mitigate this, past research has employed forced segmentation for Hakka idioms and common Mandarin word substitutions to improve translation intelligibility, leading to promising results even with small datasets. These systems have been proposed for applications in Hakka language education and as front-end processors for Mandarin–Hakka code-switching speech synthesis [
16]. In contrast to previous research using CNNs with attention mechanisms, this study introduces a hybrid AI-driven machine translation system that combines phrase-based and neural machine translation techniques to enhance the quality of Hakka-to-Chinese translation.
Traditional Transformer neural machine translation does not employ recurrent neural networks (RNNs) for natural language processing tasks like machine translation. It mainly focuses on self-attention mechanisms, with the Transformer being very close to human-level learning effects; however, it does not consider the critical issue of the minority language translation via machine translation, when the language translated is a low-resource language with sparse parallel corpora, limited linguistic annotations, and a lack of robust language models for neural machine translation, such as natural language processing (NLP). Thus, this study proposes a hybrid machine translation approach integrating RNN to resolve these shortcomings, including a lack of parallel data in language pairs, significant noise from data or very specialized domains, unseen words frequently occurring during real-world translation, and the neglect of RNNs in enhancing translation quality.
The remainder of this paper is structured as follows.
Section 2 is a literature review of low-resource languages, machine translation systems, and the Hakka corpus used for language translation.
Section 3 introduces hybrid machine translation, including the objectives of the system, preprocessing, the system’s development process and architecture, neural machine translation (NMT), phrase-based machine translation (PBMT), and hybrid AI-driven translation system development.
Section 4 is the system evaluation, reporting the performance of phrase-based machine translation and describing the hybrid artificial intelligence model.
Section 5 is the conclusion, addressing theoretical implications, practical implications, research limitations, and directions for future research.
3. Hybrid Machine Translation
3.1. Objectives of the Hybrid Machine Translation System
This system supports multilingual translations, enabling seamless conversion between Hakka, Chinese, English, and other Asian languages commonly spoken by Hakka diaspora communities, such as Japanese, Thai, Indonesian, and Malay. A prototype system was developed to evaluate the proposed hybrid approach. This system primarily focuses on literal Chinese-to-Hakka translation, ensuring fidelity in linguistic representation. The system dynamically adapts to different contextual interpretations, allowing for nuanced translations that accurately reflect Hakka cultural elements.
To enhance cultural representation in translations, the system integrates phrase-based machine translation, which incorporates an expanded translation lexicon featuring Hakka culture-specific terms. This approach ensures that key cultural expressions are explicitly retained in translations rather than omitted or misrepresented by conventional deep learning models [
34]. Unlike traditional “black-box” deep learning-based translation models, which often lack transparency in handling culture-specific vocabulary, this system adopts a “white-box” approach, enabling direct control over translation references. By doing so, the system preserves the authenticity of Hakka language expressions while improving overall translation quality.
This hybrid AI model provides a structured methodology for improving machine translation in low-resource languages, balancing linguistic accuracy, cultural relevance, and computational efficiency. Future iterations of this system will continue refining translation accuracy through recursive learning, ensuring that the platform evolves alongside the expansion of Hakka language digital resources.
3.2. Preprocessing
Preprocessing is crucial in optimizing training datasets for neural machine translation, particularly in handling low-resource languages like Hakka. Raw textual data often contains long sentences, extraneous noise, and inconsistent formatting, negatively impacting model performance. To enhance training efficiency and maintain translation accuracy, this study establishes a detailed preprocessing pipeline to standardize, clean, and refine the dataset before input into the machine translation system. The preprocessing process was implemented using Python 3.10.12 within a Jupyter Notebook 3.6.3 environment. Text segmentation was performed using the Jieba tokenizer, followed by manual adjustment to accommodate Hakka-specific linguistic features. Regular expressions from Python’s re library were used to remove punctuation, annotations, and extraneous symbols. A maximum sentence length of 40 Chinese characters was imposed to reduce sentence complexity and improve alignment accuracy. Sentences exceeding this threshold were excluded. Additional cleaning rules included the removal of parentheses and brackets containing synonym explanations; filtering of ambiguous separators such as slashes (“/” or “//”); and elimination of blank lines, non-Chinese characters, and excessive whitespace. All data was normalized using UTF-8 encoding to ensure script consistency and prevent character corruption during processing. This standardized preprocessing pipeline improves the quality and consistency of the dataset, thereby enhancing the overall performance and reliability of both the phrase-based and neural machine translation components. To improve data quality, the preprocessing phase consists of the following five key steps:
Sentence Segmentation: Split training sentences based on punctuation marks to enhance structural clarity.
Punctuation Removal: Eliminating unnecessary punctuation that may introduce inconsistencies in translation.
Synonym Annotation Removal: Removing comments enclosed within “()” or “[]”, which often contain redundant synonym explanations.
Ambiguous Symbol Cleanup: Discarding “/” or “//” symbols used as separators for alternative expressions, reducing lexical redundancy.
Whitespace and Non-Chinese Character Filtering: Stripping blank lines, excessive spaces, and non-Chinese characters that do not contribute to model learning.
These preprocessing steps for raw textual data standardize sentence structures, eliminate noise, and enhance dataset consistency, thereby creating the parallel corpus and ultimately improving the accuracy and efficiency of the NMT model. Experimental results confirm that segmentation based on punctuation and the systematic removal of redundant symbols significantly enhance translation quality. The refined dataset ensures that both phrase-based and neural machine translation models can generate more accurate and culturally appropriate translations for Hakka.
Example of Preprocessing:
This example illustrates how synonyms, extraneous annotations (e.g., brackets, parentheses, and slashes), non-standard punctuation, and redundant symbols are removed in order to create a cleaner and more uniform dataset. Such standardization plays a crucial role in ensuring data quality and boosting the performance of the translation model during training.
3.3. System Development Process and Architecture
Figure 1 illustrates the proposed hybrid machine translation system’s architecture, highlighting the integration of PBMT, NMT, and recursive learning cycles for text translation. The following introduces the functionality of PBMT and NMT in the proposed system, thereby suggesting the hybrid AI-driven translation system framework.
3.4. Phrase-Based Machine Translation (PBMT)
Phrase-based machine translation is a good alternative when designing a dialect machine translation system with limited parallel corpus resources. The Hakka language is one of the eight major dialects of Chinese, all of which are written in Chinese characters. To fully present the cultural characteristics of Hakka, cultural feature words are manually sorted and added to the phrase lexicon. The translated text at this stage can better reflect Hakka cultural characteristics and improve translation quality. Thus, phrase-based machine translation with Hakka culture-specific items can achieve good translation quality.
The accuracy rate of this translation design is quite high, but there are still many problems. The original Hakka text is first divided into single words or phrases, and then, statistics and a limited vocabulary and phrase comparison table are used to select the most common translation methods for these words or phrases, which are then recomposed into sentences according to Hakka grammar. This study adapted phrase-based machine translation to construct a parallel corpus of Hakka-Chinese.
The problem with phrase-based machine translation is that it generates the most likely translations based on statistical principles, which are not always accurate. For example, polysemic words can sometimes be problematic. Second, phrase-based machine translation mainly translates single-word phrases, and its ability to translate sentences is limited. When sentences are long, complex, ambiguous, or have grammatical exceptions, phrase-based machine translation is prone to mistranslation. When the number of parallel corpus thesauruses is small, compared with neural machine translation, the translation accuracy rate of phrase-based machine translation is acceptable. Thus, when parallel corpus resources are limited, phrase-based machine translation can be used. However, the translation quality is restricted and cannot produce high-quality results. Conversely, when a rich parallel corpus is available, neural machine translation can produce high-quality translation results. Thus, this study also integrates NMT into the proposed system to enhance translation quality.
3.5. Neural Machine Translation (NMT)
Neural machine translation usually relies on parallel corpora in two languages (the source language and target language) to train models. For example, the Chinese-to-English model uses millions of parallel corpora as training datasets. However, low-resource languages, such as minority languages and most dialects, do not have enough parallel corpora as training datasets. As open sources of parallel data have been exhausted, one avenue for improving low-resource NMT is to obtain more parallel data through web-crawling. However, even web-crawling may not yield enough text to train high-quality MT models. Nonetheless, monolingual texts will almost always be more plentiful than parallel texts, and leveraging monolingual data has been one of the most important and successful research areas in low-resource MT. A recent study showed how carefully targeted data gathering can lead to clear MT improvements in low-resource language pairs [
35]. For successful translation models, data is the most important factor, and the curation and creation of datasets are key elements for future success [
15].
3.6. Hybrid AI-Driven Translation System Development
By integrating PBMT and NMT, the proposed machine translation system can enhance translation accuracy for low-resource languages like Hakka. The system follows a structured three-stage development process, which is depicted in
Figure 2 and outlined below.
Stage 1: Extracting the Parallel Corpus. We designed the five preprocessing procedures for raw textual data, including (1) Sentence Segmentation, (2) Punctuation Removal, (3) Synonym Annotation Removal, (4) Ambiguous Symbol Cleanup, and (5) Whitespace and Non-Chinese Character Filtering, as mentioned above, to assist in creating the parallel corpus and optimize the training datasets. A phrase-based machine translation (PBMT) approach is implemented using a limited Chinese–Hakka dictionary. This dictionary-driven translation converts a large volume of Hakka monolingual corpora into Chinese, thereby constructing a parallel corpus, which serves as a training dataset. In this stage, the PBMT procedure utilizes a curated Hakka–Mandarin dictionary comprising 9000 culture-specific lexical items, including idiomatic expressions, proverbs, and traditional vocabulary, primarily derived from the Hakka Proficiency Test word list (Sixian dialect), the Ministry of Education’s Hakka–Chinese dictionary, and a manually annotated set of culture-specific terms. These lexical items translated 72,000 Hakka monolingual sentences into Mandarin through rule-based alignment, resulting in an initial parallel corpus. An additional 9000 sentence pairs were manually compiled and validated by native Hakka speakers with linguistic expertise, bringing the total corpus to 81,000 Hakka–Mandarin sentence pairs. Sentences exceeding 40 Chinese characters or containing ambiguous or redundant entries were excluded. The dataset was randomly split into 80% for training and 20% for evaluation. This process ensures the preservation of linguistic structures and culturally embedded expressions and establishes a reproducible foundation for model training and testing.
Table 1 summarizes the extracted items for the establishment of the parallel corpus.
Stage 2: Training the NMT Model. In this stage, the parallel corpus gained from stage 1 is utilized to train a neural machine translation (NMT) model, integrating the phrase-based and deep learning approaches to improve translation accuracy. This allows the system to enhance contextual understanding and sentence-level coherence. By leveraging both methods, the proposed hybrid system overcomes the limitations of purely statistical or neural-based models, particularly in low-resource languages.
Stage 3: Recursive Translation. The final stage involves recursive translation. New parallel data is iteratively added to retrain the NMT model, continuously refining translation accuracy by expanding the dataset and improving its ability to process Hakka language patterns, idiomatic expressions, and syntactic structures.
The above hybrid approach effectively combines the rule-based advantages of PBMT with the deep learning capabilities of NMT. Its recursive learning mechanism, designed for the translation of low-resource languages, ensures the ongoing enhancement of translation quality. This methodology enhances immediate translation services for Hakka speakers while contributing to the broader goal of preserving and revitalizing their language through AI-driven solutions.
4. System Evaluation
4.1. Experimental Setup
The Hakka language in Taiwan has six different accents, namely, the Sixian (Nansixian and South Sixian), Hailu, Dapu, Raoping, and Zhao’an dialects. This study collects a parallel corpus of the Hakka language in the form of sentences. Vocabulary from the Certificate of Hakka Proficiency Test is taken only from the Sixian accent, with about 5000 entries; conversely, the Ministry of Education dictionary has about 20,000 entries, more thoroughly sorting the examples. There are about 32,000 entries in the parallel corpus in this experiment.
The external Chinese–Hakka language parallel corpus includes the Hakka dictionary, the Hakka entry on the website of the Ministry of Education, and the culture-specific items compiled in this study. These parallel corpora are used in our phrase-based machine translation system. It utilizes statistics and the collected vocabulary of the languages to determine the most common translations for these words and phrases. According to the appropriate grammar, they are then reorganized into sentences, with the Chinese words translated into Hakka words and sentences.
To measure the effectiveness of our machine translation, we adopted the BLEU (Bilingual Evaluation Understudy) metric as the primary evaluation standard. BLEU is one of the most widely accepted and reproducible automated metrics in machine translation research. It can assess the closeness of a machine-generated translation to a professional human translation by comparing overlapping n-grams, making it particularly suitable for large-scale, sentence-level translation quality assessment. Quality is measured by the correspondence between a machine’s output and that of a human: “the closer a machine translation is to a professional human translation, the better it is” [
36]. This metric’s computational efficiency and objectivity also allow for consistent evaluation across experimental conditions, especially when applied to low-resource language scenarios where extensive human evaluations may be infeasible in early-stage system development. Given the exploratory nature of this study and its aim to benchmark translation quality across different system configurations (PBMT vs. hybrid models), BLEU provides a sufficient and reliable basis for performance comparison. In the experimental setup, 500 out-of-sample entries were extracted from the Ministry of Education’s Hakka dictionary as the evaluation corpus. These entries were deliberately excluded from the training data to ensure objective performance measurement and generalization.
The BLEU score is computed using various parameters, including the Brevity Penalty (BP), which penalizes translations shorter than the reference translations to prevent artificially high scores; in our experiments, BP = 1.000, indicating no significant penalty. The ratio represents the proportion of the hypothesis (machine-translated) text length to the reference (human-translated) text length, which is 1.062 in our case, meaning machine-generated translations were slightly longer. Hypothesis length (hyp_len) and reference length (ref_len) refer to the total number of words in the machine-translated and human-translated texts, which were 9174 and 8641, respectively. The slight difference in length suggests that the machine translation system tends to generate longer output sentences. The BLEU score for the phrase-based machine translation system was 47.52, demonstrating competitive translation quality despite the limited corpus. BP is 1.0 when the candidate translation length is the same as any reference translation length. The closest reference sentence length is the best match length [
36].
While BLEU is a widely used evaluation metric, it has limitations, particularly in capturing nuances and contextual meaning in low-resource language translation. Therefore, alternative metrics such as METEOR (Metric for Evaluation of Translation with Explicit ORdering) and TER (Translation Edit Rate) were also considered. METEOR accounts for synonymy, stemming, and paraphrasing, making it a comprehensive evaluation method [
37]. Conversely, TER evaluates the number of edits required to convert machine-generated translations into human translations, providing insight into translation fluency and coherence [
38]. Future studies may integrate human evaluation methods alongside automated metrics to obtain a holistic translation quality assessment.
4.2. Experiment
Translating dialects and minority languages has always been a significant problem in machine translation. There are several approaches to mitigating the issue of low-resource machine translation, including transfer learning, semi-supervised learning, and unsupervised learning techniques [
39]. The current study adopted the hybrid artificial intelligence model, combining phrase-based and neural machine translation to translate a Hakka language text into Chinese.
Table 2 lists the parameters of the model specification.
The machine translation models were trained for 40,000 epochs with a batch size of 64, using the Adam optimizer and an initial learning rate of 0.002. Early stopping was applied if the BLEU score on the validation set did not improve for five consecutive epochs. A validation split of 10% was used to monitor model performance during training. The training was conducted using an NVIDIA RTX 4090 GPU (24 GB VRAM) with 128 GB of RAM (NVIDIA, Santa Clara, CA, USA) on an Ubuntu 20.04 LTS system. The models were implemented using TensorFlow 2.0 and trained via the OpenNMT-tf framework. This configuration was chosen to balance computational efficiency with effective convergence, ensuring stable optimization for low-resource Hakka translation tasks.
The hybrid translation model for experiments constructed in this study consists of three stages, as depicted and described in
Figure 2, including: (1) phrase-based machine translation (Extracting the Parallel Corpus), (2) neural machine translation (Training the NMT model), and (3) recursive translation. As the Hakka text corpus is collected, the first stage of translation work is then carried out through phrase-based machine translation to generate an 81,000 parallel corpus. The generated parallel corpus is utilized as the training dataset in stage 2 to complete the Transformer machine translation training work. Further, recursive translation with increased parallel corpora can improve the translation quality in stage 3.
4.3. Discussion of Results
The changes observed in
Figure 3 are primarily due to the impact of increasing training data size and refining the hybrid AI-driven translation system. Initially, PBMT demonstrated better performance when the dataset was small, effectively utilizing statistical alignments. However, as the dataset expanded and the NMT model was trained on a more comprehensive parallel corpus, its ability to generalize improved significantly. This resulted in higher BLEU scores in later stages. Additionally, improvements in encoder parameters, including deeper networks, increased hidden units, and additional attention heads, contributed to better contextual learning and enhanced translation accuracy. The recursive refinement of the hybrid model, where PBMT-generated translations were incorporated into NMT training, further reinforced learning, making translations progressively more accurate. The improvements in
Figure 3 validate the effectiveness of the hybrid approach, demonstrating that while PBMT is advantageous for low-resource settings, the integration with deep learning models yields superior long-term translation performance.
As
Table 2 notes, we implemented an encoder–decoder Transformer model consisting of six layers, with key hyperparameters set as follows: the model dimension (dmodel) was configured to 256, the feedforward network dimension (dff) was set to 1024, and the number of attention heads (num_heads) was fixed at 2. Additionally, the model employed learned positional embeddings to effectively capture word order dependencies and incorporated weight-sharing between the token embedding and output layers to improve efficiency and reduce the number of trainable parameters. The initial training results yielded the best BLEU score at 46.18.
To address the challenge posed by the limited availability of parallel corpora for Hakka, we utilized a monolingual Hakka text corpus comprising 751,960 words provided by the Hakka Affairs Council. We employed a PBMT system to perform reverse translation to expand the available training data, converting the monolingual Hakka corpus into Mandarin. This process resulted in a Chinese–Hakka parallel corpus, which served as the training dataset for the final NMT model. We conducted additional training cycles to refine the Transformer model using this expanded parallel corpus. The final model achieved an optimal BLEU score of 51.63, reflecting a substantial improvement in translation quality.
The evaluation metrics for this result were as follows: BP = 1.000, indicating that there was no significant length penalty applied; ratio = 1.015, signifying that the machine-translated output was approximately 1.5% longer than the human reference translations; hypothesis length (hyp_len) = 8774, representing the total number of words in the model-generated translations; and reference length (ref_len) = 8641, corresponding to the length of the human-translated reference corpus. The increase in BLEU score from 46.18 to 51.63 suggests that the quality of Hakka translation was significantly enhanced through reverse translation, data augmentation, and iterative training refinements. These results demonstrate the effectiveness of integrating PBMT with NMT, allowing for the creation of a robust hybrid translation system. This approach improves translation accuracy and provides a viable solution for data scarcity in low-resource languages like Hakka.
5. Conclusions
5.1. Theoretical Implications
This study contributes to the theoretical development of low-resource language processing by proposing a hybrid artificial intelligence translation framework that combines PBMT with NMT within a recursive learning mechanism. This model advances current translation theory by demonstrating that rule-based and neural approaches are not mutually exclusive but can be synergistically integrated to overcome the limitations of sparse parallel corpora and dialectal diversity. Moreover, the explicit incorporation of culture-specific items into the translation pipeline highlights the necessity of culturally grounded modeling in multilingual NLP systems, offering a novel perspective on how linguistic form and cultural meaning can be preserved through computational means. The recursive expansion of training data through iterative translation–refinement further reinforces that machine learning architecture can be adapted to low-resource environments by leveraging linguistically aware strategies.
5.2. Practical Implications
From a practical standpoint, the hybrid framework developed in this research offers a replicable and modular solution for minority language translation, particularly for Hakka, which suffers from fragmented digital resources and dialectal complexity. The system’s architecture, including preprocessing protocols and recursive learning loops, provides a blueprint for institutions aiming to digitize, translate, or teach under-resourced languages. Its adaptability to multilingual inputs and capacity to retain cultural expressions make it highly suitable for applications in education, community media, cultural preservation, and government-supported language revitalization initiatives. This study thus demonstrates how AI-driven translation systems can serve as technological tools and as instruments of sociolinguistic inclusion, language equity, and digital citizenship for marginalized language communities.
5.3. Research Limitations
While the proposed hybrid machine translation system demonstrates promising performance, several limitations must be acknowledged. The training corpus used in this study focuses exclusively on Hakka’s Sixian dialect. As a result, the model’s applicability to other major Hakka dialects—such as Hailu, Dapu, and Raoping—remains untested, limiting its generalizability across dialectal variations. Furthermore, the current system is designed to operate at the sentence level and does not incorporate mechanisms for modeling discourse-level coherence, contextual dependencies, or semantic flow across multiple sentences. In terms of evaluation, this study relies primarily on automated metrics such as BLEU. While BLEU provides a valuable benchmark for translation quality, it does not fully capture linguistic nuances such as idiomatic expression, syntactic fluidity, or cultural appropriateness. The absence of human evaluations also restricts the interpretability of the system’s performance in practical settings, particularly in language learning or real-world communication scenarios involving minority language users. These limitations highlight important areas for future development and refinement.
5.4. Directions for Future Research
To build on the present findings, future studies should pursue several directions. Expanding the corpus to include diverse dialectal variants would enhance the system’s linguistic coverage and robustness, while incorporating ASR and TTS technologies would enable the development of multimodal, voice-based Hakka translation tools. Furthermore, integrating the translation system into practical applications—such as AI-powered chatbots, virtual tour guides, or language-learning platforms—would allow for the empirical testing of user experience, effectiveness, and usability. Finally, human-centered evaluation methods should be implemented to assess translation quality regarding semantic fidelity, cultural relevance, and communicative function, including expert reviews and end-user feedback. Teaching the minority language to minority ethnic groups positively impacts their chances of long-term survival and fluency [
40]. These research trajectories would not only extend the current model but also its relevance to other endangered or low-resource languages needing digital preservation and revitalization.