Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur

Pan, Kun; Zhang, Xiaogang; Chen, Liping

doi:10.3390/app14135764

Open AccessTechnical Note

Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur

by

Kun Pan

^1,2

,

Xiaogang Zhang

^1,* and

Liping Chen

^1,*

¹

College of Information Engineering, Tarim University, Alaer 843300, China

²

Tarim Oasis Agriculture Key Laboratory of the Ministry of Education, Xinjiang Uyghur Autonomous Region, Alaer 843300, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5764; https://doi.org/10.3390/app14135764

Submission received: 23 May 2024 / Revised: 23 June 2024 / Accepted: 24 June 2024 / Published: 1 July 2024

(This article belongs to the Topic Theoretical and Applied Problems in Human-Computer Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

In the field of Natural Language Processing (NLP), the lack of support for minority languages, especially Uyghur, the scarcity of Uyghur language corpora in the agricultural domain, and the lightweight nature of large language models remain prominent issues. This study proposes a method for constructing a bilingual (Uyghur and Chinese) lightweight specialized large language model for the agricultural domain. By utilizing a mixed training approach of Uyghur and Chinese, we extracted Chinese corpus text from agricultural-themed books in PDF format using OCR (Optical Character Recognition) technology, converted the Chinese text corpus into a Uyghur corpus using a rapid translation API, and constructed a bilingual mixed vocabulary. We applied the parameterized Transformer model algorithm to train the model for the agricultural domain in both Chinese and Uyghur. Furthermore, we introduced a context detection and fail-safe mechanism for the generated text. The constructed model possesses the ability to support bilingual reasoning in Uyghur and Chinese in the agricultural domain, with higher accuracy and a smaller size that requires less hardware. It (our work) addresses issues such as the scarcity of Uyghur corpora in the agricultural domain, mixed word segmentation and word vector modeling in Uyghur for widespread agricultural languages, model lightweighting and deployment, and the fragmentation of non-relevant texts during knowledge extraction from small-scale corpora. The lightweight design of the model reduces hardware requirements, facilitating deployment in resource-constrained environments. This advancement promotes agricultural intelligence, aids in the development of specific applications and minority languages (such as agriculture and Uyghur), and contributes to rural revitalization.

Keywords:

large language model; Chinese-Uyghur bilingual; lightweight; agricultural expertise; knowledge question answering service

1. Introduction

In the field of information acquisition and knowledge transfer, the rapid development of modern information technology has greatly promoted the popularization and updating of knowledge. The release of generative large language models such as ChatGPT [1], ChatGLM, and Llama3 is undoubtedly a significant breakthrough in this field. With their powerful natural language processing capabilities, they bring a brand-new experience of information acquisition and knowledge learning to people [2,3].

Generative large language models like ChatGPT, ChatGLM, Llama3, Wenxin Yiyan, and iFLYTEK Spark, with their exceptional natural language understanding and generation capabilities, are changing the way we interact with information [4]. Compared to traditional search engines, these models interact with users in natural language, providing more accurate and personalized information recommendations, making the information acquisition process more natural and efficient. However, training and running these models is not an easy task. They have extremely high requirements in terms of hardware, costs, and data. Not only do they require high-performance GPU servers and large amounts of memory to handle complex model structures and enormous parameter quantities, but they also require a significant amount of time and money to collect, clean, annotate, and store data. In specific fields such as agriculture, as well as for niche or specific language groups such as Uyghur, the difficulty of data acquisition due to professional and regional limitations further increases costs and limits the application effect of these models in these fields.

We propose a bilingual hybrid training model in the agricultural domain that does not rely on generative large language models such as ChatGPT, ChatGLM, and Llama3 as base models. This model uses Uyghur and Chinese knowledge in twelve agricultural areas, including agricultural development history, crop cultivation, animal husbandry, forestry, aquaculture, agricultural product processing, agricultural machinery, agricultural economy, agricultural environment, agricultural technology, agricultural policy, and agricultural and rural development, as the corpus. After adopting the Transformer architecture and optimizing the parameters with a mixed corpus of Uyghur and Chinese containing 240,000 vocabularies, and training on a single NVIDIA GeForce RTX 4090 for 4 h, the agricultural domain-specific model achieved a training loss rate of less than 0.07%. It has 78.52 million parameters and an overall model size of 1.5 GB. It has also reached a new technical level in knowledge extraction, especially in Uyghur knowledge extraction. In terms of hardware, it can be deployed and run normally on a graphics card with a configuration of four cores and 4 GB of video memory, as shown in Figure 1, which depicts an example of the model’s application.

As shown in Table 1, these models generally possess strong cross-language capabilities, covering multiple languages from English and Chinese to French and German. However, they exhibit various limitations in terms of professional adaptability, hardware requirements, and model scale. Firstly, as general-purpose large language models, they may have limitations in deep understanding and generation in specific professional fields, failing to fully meet the precision requirements of professional scenarios such as agriculture, law, healthcare, and scientific research. This is due to their pursuit of broad applicability rather than deep specialization. Secondly, the high hardware threshold poses a significant challenge, especially for models like ChatGPT and Wenxin Yiyan. Their operation relies on a high-performance distributed GPU integrated computing environment, which not only requires a huge initial investment but also complicates operation and maintenance, limiting the possibility of widespread application. Furthermore [5,6,7,8], although models like Llama3 and ChatGLM can reduce hardware requirements through quantization, they still require relatively high configurations, which remain a significant burden for ordinary users or small businesses. In addition, the enormous scale of these models, such as ChatGPT with approximately 175 billion parameters and Wenxin Yiyan and iFLYTEK Spark with about 100 billion parameters, brings powerful language processing capabilities but also exacerbates issues related to resource consumption, model efficiency, sustainability, and environmental impact.

As shown in Table 1, the above models excel in supporting multiple languages, covering a wide range of linguistic groups. However, their common shortcoming lies in the lack of direct support for ethnic minority languages such as Uyghur, as illustrated in Figure 2. This deficiency not only restricts the communication and information access of users of these languages but also reflects the current lack of inclusiveness of large language models in terms of global linguistic diversity. The absence of support for minority languages like Uyghur not only hinders the goal of inclusive technology but also ignores the value of cultural diversity. This underscores the importance of strengthening the construction of datasets for minority languages and making adjustments to improve model adaptability in research.

In Section 1, we introduced the research object, including its background and application domains, highlighting the significance of modern information technology in knowledge dissemination and updating, and analyzed the limitations of existing large-scale language models in terms of support for specific domains and languages. In Section 2, we detailed the process of data collection and processing, encompassing the construction of a corpus through data collection, textual conversion, cleaning, traditional to simplified Chinese character conversion, and the development and accuracy assessment of a Uyghur language translation tool. In Section 3, we examined word segmentation and vocabulary construction, selecting SentencePiece as the core segmentation technology, and described the selection and implementation of segmentation methods, as well as the division and encoding of datasets. In Section 4, we proposed a model and introduced its architectural design, presenting a Transformer-based generative pre-training model and discussing the impact of model parameter quantization on hardware resource requirements. In Section 5, we studied model training, showcasing the training process, data processing and training flow, model evaluation, and generated examples, while also exploring the performance of our model compared to other models such as LSTM and TKAN. In Section 6, we investigated the application and deployment of the model, exploring the detection of context in generated text and a fuse mechanism, and proposing solutions for user interaction to address issues in model deployment. In Section 7, we compared our model with other methods and drew conclusions, emphasizing its advantages in terms of performance, lightweight design, bilingual integration, data processing and training innovations, as well as text generation with context detection and fuse mechanisms. This research not only addresses the issue of agricultural knowledge dissemination, but also promotes technological fairness, cultural diversity, and popularization, providing technical support for agricultural intelligence in ethnic languages and the implementation of rural revitalization strategies.

2. Corpus Collection and Preprocessing

2.1. Data Collection

The PDF book information was mainly downloaded through the book service purchased by the Tarim University Library and was divided into twelve categories (Agricultural Development History, Crop Cultivation, Animal Husbandry, Forestry, Aquaculture, Agricultural Product Processing, Agricultural Machinery, Agricultural Economics, Agricultural Environment, Agricultural Science and Technology, Agricultural Policy, and Agricultural and Rural Development), with a total of 102 books, as shown in Figure 3.

2.2. Textual Conversion and Data Cleaning

Utilize Python’s PIL library (Pillow, a fork of PIL, for image processing) to automatically extract images from PDF files and, in conjunction with OCR (Optical Character Recognition) technology, recognize and convert the text content within the images into editable text formats. This aims to automate the processing and enhance the efficiency and accuracy of information extraction.

2.2.1. Loading PDF Documents and Page Extraction

Using the fitz library to open a PDF file and iterate through each page, you have the option to either export the page directly as an image or extract text. If the goal is to perform image recognition, you can convert the page to an image format such as PNG or JPEG. This process can be accomplished with the Page.getPixmap() method of fitz, followed by saving the image or passing it directly to the next step of processing.

2.2.2. Text Recognition

Using PIL (Python Imaging Library, also known as Pillow) to preprocess the extracted images can help improve the accuracy of OCR recognition. This preprocessing includes resizing, converting to grayscale, and binarizing the image. For instance, you can use Image.open() to open the image, convert() function to convert it to grayscale, and apply thresholding for binarization to reduce noise interference.

2.2.3. Image Preprocessing

Use pytesseract for text recognition in images. By passing the preprocessed image data to the pytesseract.image_to_string() function, you can obtain the recognized text content. To improve recognition accuracy, you can specify language parameters or customize recognition configurations.

2.2.4. Result Handling and Output

The recognized text may require further post-processing, such as removing irrelevant characters, paragraph segmentation, etc., to improve readability and usability. Finally, the text is saved to a .txt document named according to the book’s title based on the requirements.

To evaluate the accuracy of OCR (Optical Character Recognition) using the cnocr library for Chinese text in images, the process begins by cleaning the reference text through the clean_text function to ensure accurate comparison. Next, the cnocr library is utilized to perform OCR on each image and extract the recognized text. Then, the SequenceMatcher from the difflib library is used to calculate the similarity between the OCR-recognized text and the reference text, resulting in a loss rate that reflects the degree of difference between the recognition result and the true text. Subsequently, the loss rate for each page is stored in a list and printed out for analysis. Finally, the average loss rate across all pages is computed, and a chart depicting the change in loss rate with respect to page numbers is drawn using the matplotlib library, visually exhibiting the trends in OCR recognition performance.

Loss Calculation Formula:

L o s s R a t e = 1 - \frac{L e n g t h o f L C S}{m a x (l e n (A), l e n (B))} \times A d j u s t m e n t F a c t o r

We have extracted 77 pages of images from the book titled “The Historic Transition from Traditional Agriculture to Modern Agriculture” using OCR technology, converted them into text, and compared the OCR-recognized text with the manually corrected data. We have calculated the loss rate for each page, and the data are shown in Figure 4. The average loss rate for this book is 0.0396. Using the same method, we have calculated the average loss rate for each of the 102 books, and the average loss rate for each book is shown in Figure 5. The average loss rate for all 102 books is 0.0304. With such a low recognition loss, using OCR technology to extract and recognize text from each page of a PDF, and manually checking and correcting the recognition errors, can greatly save time and cost in corpus processing.

2.2.5. Data Cleansing

The text converted from books is mostly reliable in terms of data cleansing. However, for some PDFs, there may be garbled characters during the conversion process, such as garbled text in headers, footers, citations, and annotations. The possibility of achieving cleaning through code is quite small, so we adopt manual inspection, proofreading, and deletion of garbled parts. Additionally, we correct any obvious errors in the text.

2.3. Conversion of Traditional to Simplified Chinese Characters

Since some books are presented in traditional and simplified Chinese, we convert the traditional characters to simplified characters. Our Chinese training set uniformly uses simplified Chinese, and we utilize the opencc library in Python for batch conversion.

Construction of a Tool for Converting Traditional Chinese Characters to Simplified Chinese:

Define the Conversion Function: A function named traditional_to_simplified is defined, which takes a parameter text representing the Traditional Chinese text that needs to be converted.
Initialization of Converter: Internally within the function, an instance of the OpenCC converter is created via cc = OpenCC(‘t2s’). The ‘t2s’ parameter specifies the conversion direction, indicating transformation from Traditional Chinese to Simplified Chinese.
Execution of Conversion: The conversion is carried out by invoking the method .convert(text) on the converter instance cc, where text is the input Traditional Chinese text being transformed into Simplified Chinese.
Return of Results: Upon completion of the conversion, the function returns the converted Simplified Chinese text via return simplified_text.

2.4. Data Organization and Annotation

Data collation, the data collation section in this document differs significantly from other models. The proposed data merging method adopts a knowledge graph structure as the basis to integrate book materials and article materials of the elements in the knowledge graph structure. For example, the content of the knowledge point “agriculture” from multiple books is merged to satisfy the coherence of the knowledge extraction context and the proximity of relevant context weights. Paragraphs are concatenated without blank lines. Taking the two books “Agricultural Informatization” and “Q&A on Agricultural Informatization in New Rural Areas” as examples, we merge the knowledge points of “What is agricultural informatization?” and “What does agricultural informatization refer to?” We also merge the knowledge statements of “What is agricultural information technology?” The start of the merged content is marked with the <CLASSIFY-START> tag, and the end position is marked with the <CLASSIFY-END> tag. The numbers of the twelve book categories are merged separately, and the merging of each small knowledge point starts with the <CLASSIFY-START> tag and ends with the <CLASSIFY-END> tag. This is done to mark the subsequent detection of text context coherence and output fuse mechanism.

2.5. Construction of Uyghur Translation Tool

Training a large language model for Uyghur necessitates a vast amount of Uyghur language corpus. Given that the Xinjiang Uyghur Autonomous Region [9,10,11,12], where the Uyghur population is concentrated, primarily revolves around agriculture and related industries, and official publications are primarily in Chinese, high-quality Uyghur agricultural literature is scarce. Consequently, we developed a Uyghur translation tool to translate the organized and merged Chinese texts.

Quick translate is a platform supporting translations between multiple languages, including Uyghur (Uyghur, also written as Uighur [13,14,15], with the language code ug), and provides an API at “api-b2b.backenster.com” (accessed on 12 April 2014). We constructed a class named Translator encapsulating all necessary logic for interacting with the online translation API. Designed adhering to object-oriented programming principles, this class ensures code that is easy to understand, maintain, and extend.

2.5.1. Architecture of the Uyghur Translation Tool System

The core of this translation tool is a class named Translator, which is designed based on object-oriented programming principles and encapsulates the entire logic for interacting with the Swiftline Translation API. During class initialization, Translator sets up the basic configuration for translation requests, including the target API URL and authentication headers, using a preset Bearer Token for identity verification. Its core method, translate, takes in parameters such as the source language, target language, text to be translated, and an optional platform type. It performs the translation operation by constructing request data, sending a POST request, and processing the response. The translation result is returned in JSON format, facilitating further processing or display. In practical applications, cross-language translation can be conveniently carried out by instantiating the Translator class and calling its translate method.

2.5.2. Accuracy Evaluation of Uyghur Translation

In evaluating the performance of a rapid translation API in translating Chinese to Uyghur text, we have adopted a quantitative strategy aimed at measuring the consistency between the translation output and the manually corrected Uyghur text, thereby judging the accuracy of the translation. The core of this process lies in calculating the translation accuracy rate, which measures the precision of the translation by comparing the word-level match between the translation result and the reference text—that is, the Uyghur text confirmed as accurate by expert proofreading. The specific process of calculating the accuracy rate comprises five key steps: First, we select a segment of manually corrected Uyghur text as the reference standard. Then, we count the total number of words in this reference text, which serves as the denominator for subsequent calculations. Next, we compare the output of the rapid translation API with the reference text word by word, identifying the words that match exactly between the two, which are considered as correctly translated parts. Based on this, we sum up the number of all correctly translated words, serving as the numerator for calculating the accuracy rate. Finally, we divide the number of correctly translated words by the total number of words in the reference text, multiply by 100%, and thus obtain the percentage value of the translation accuracy rate. Following these steps, we can obtain quantitative analysis results for the translation accuracy rate, which provide us with intuitive feedback on the performance of the rapid translation API in converting Chinese to Uyghur text.Let the number of correctly translated words be denoted as A, and the total number of words in the reference translation be denoted as B, Accuracy Calculation Formula:

A c c u r a c y = \frac{A}{B} \times 100 %

We used the Quick Translate platform to translate them into Uyghur. The corrected Chinese text after manual correction of the originally recognized text was batch translated into Uyghur through the Quick Translate API, resulting in Uyghur translation texts. Further, manual inspection and correction were performed on the translated Uyghur texts. The accuracy was calculated by comparing the translated Uyghur texts with the manually corrected Uyghur texts, as shown in Figure 6. The average accuracy rate for the randomly sampled 50 sentences is 0.9909.

Using the same approach, we grouped the Uyghur translations of the 102 books into 51 sets of two books each and calculated the average accuracy rate for each set, as depicted in Figure 7. The final overall average accuracy rate for the 51 sets of data was 0.9903. By utilizing the Quick Translate platform to translate the corrected Chinese texts into Uyghur, we significantly reduced the processing time required for translation. Moreover, the remarkably high translation accuracy of Quick Translate improved the efficiency of manual inspection and grammatical correction, effectively ensuring that the translated corpus does not impact the output of subsequent models.

2.5.3. Batch Translation of Chinese Text to Uyghur and Corpus Integration

We segmented the Chinese agricultural knowledge corpus based on the <CLLASIFY-START> and <CLLASIFY-END> tags, divided the content into sentences to adapt to the translation API, and translated into Uyghur using the Translator class while excluding the tag pairs. We then iterated through the sentences for translation using two nested loops, and added the corresponding tag pairs to the translated text. Finally, we merged the Chinese and Uyghur texts into an 87 MB fully annotated Chinese/Uyghur bilingual corpus named nongye_books_data_cn.txt.

3. Word Segmentation and Vocabulary Construction

In the field of natural language processing [16], text preprocessing is the foundation for building efficient models, and the word segmentation step is particularly important, as it directly affects the model’s understanding and processing efficiency of the text [17]. Currently, some mainstream word segmentation tools include jieba, SentencePiece, and HanLP. As shown in Table 2, each word segmenter has its unique design and applicable scenarios:

Jieba, as a classic Chinese word segmentation library, is renowned for its maturity, stability, and rich customization capabilities. It is particularly adept at handling word segmentation in Chinese text, but it may have limitations in dealing with unknown words (new words or specialized terminology) and long sentence structures.
HanLP, as a comprehensive natural language processing toolkit [18], offers a wide range of Chinese processing functions, including word segmentation, part-of-speech tagging [19], etc. Its advantages lie in its high configurability and deeply optimized models. However, for large-scale data processing and task-specific optimization needs, more customization work may be required.
In contrast, SentencePiece stands out as our preferred word segmentation and vocabulary construction tool in this study due to its unique design philosophy and practical advantages:
Flexibility and universality: SentencePiece is not limited to a specific language. Its statistical-based subword unit generation method enables it to excel in multi-language processing, especially for mixed-language or low-resource language text data.
Adaptability: By learning statistical patterns in the data [20], SentencePiece can automatically generate a subword dictionary that best fits the current corpus. This gives it significant advantages in discovering new words and handling rare words, which is particularly important for rapidly changing internet language and specialized literature.
Efficiency and practicability: It supports multiple model types and adopts the unigram model by default, balancing vocabulary coverage and model complexity. At the same time, the generated integer encoding effectively reduces storage space requirements and accelerates subsequent model training processes.

Table 2. Function Characteristics Table of Mainstream Word Segmenters [21].

Tokenizer Tool	Advantages	Disadvantages	Uyghur Language	Chinese Language	English Language	Japanese Language
jieba	1. Supports three tokenization modes: accurate mode, full mode, and search engine mode [22]	1. Ineffective recognition of proper nouns and rare words	×	√	√	√
	2. Supports custom dictionaries	2. Does not support minority languages like Uyghur
	3. Supports multiple languages, such as Chinese [23], English, etc.	—
SentencePiece	1. Supports multiple languages [23], such as Chinese, English, Japanese, etc.	1. Requires a large amount of text data for model training	×	√	√	√
	2. Supports character-level tokenization	2. Ineffective for minority languages like Uyghur
	3. Allows training of custom models	—
	4. Supports custom retained dictionaries	—
HanLP	1. Supports multiple languages, such as Chinese [23], English, Japanese, etc.	1. Ineffective recognition of proper nouns and rare words	×	√	√	√
	2. Supports functions like part-of-speech tagging, named entity recognition, etc.	2. Does not support minority languages like Uyghur
	3. Supports custom dictionaries	—

After comparing various tokenizers such as jieba, SentencePiece, and HanLP, we have ultimately decided to adopt SentencePiece as the core tokenization technology for this research. It not only satisfies the need for efficiently processing large-scale text data, but also provides a solid foundation for subsequent natural language model training through its flexible subword strategy, particularly demonstrating unique advantages in balancing model performance and generalization capabilities. Tokenization is a crucial step in the preprocessing phase, aiming to effectively convert raw text into serialized numerical representations that the model can understand. To achieve this goal, we have employed the SentencePiece model, a highly efficient subword tokenization tool widely used in the field of natural language processing [23].

3.1. Selection and Implementation of Tokenization Method

To enable SentencePiece to effectively support Uyghur language tokenization without excessive or inadequate segmentation, we first applied it to a Chinese dataset for tokenization. The resulting Chinese tokens were then translated into a Uyghur dictionary list using the pre-constructed Uyghur language data translation tool through iterative translation.

As a tool for tokenization and vocabulary construction, SentencePiece is primarily valued for its flexibility and efficiency. It automatically generates a subword unit dictionary by learning statistical information from the training data, supporting multiple model types (such as unigram, bpe, etc.). In this experiment, although no specific model type was directly specified, the default unigram model was adopted, which performs exceptionally well in handling rare words and maintaining the semantic integrity of subwords. By controlling the vocab_size parameter to 240,000, we ensured a balance between vocabulary coverage and computational efficiency.

To preserve the original structural information of the text, such as newline and tab characters, we included user-defined symbols (user_defined_symbols) like ‘\n’ and ‘\t’ when training the tokenization model. This helps restore the text’s format or structure in subsequent processing. Furthermore, to prevent SentencePiece from over- or under-segmenting the text, we added the translated Uyghur dictionary list to the user_defined_symbols, enhancing its ability to adapt to Uyghur text segmentation.

After these steps, we obtained a bilingual mixed vocabulary named “nongye_books_data.vocab” with 240,000 words and corresponding word IDs for both Chinese and Uyghur.

3.2. Dataset Partitioning and Encoding

To evaluate model performance and simulate training and testing environments in practical applications, we divided the original text dataset “nongye_books_data_cn_and_ug.txt” into training and test sets based on a preset ratio (9:1 in this case). By improving the data loading function, we ensured that newline characters were preserved during text segmentation, which is crucial for certain text generation and sequence alignment analyses.

Using the trained SentencePiece model, we converted the segmented text into integer sequences (token IDs) and stored them as binary files (.dat format) in numpy arrays. This preprocessing significantly reduces memory usage and accelerates data reading speed during subsequent model training.

3.3. Role of Vocabulary Construction and Dataset Partitioning

The tokenization strategy and its implementation not only optimize text processing efficiency but also lay a solid foundation for subsequent language model training. By tokenizing at the subword level, the model can better understand and generate text, especially when dealing with domain-specific texts containing a large number of long-tail words or rare expressions. Additionally, in-depth analysis of the generated tokenization results, such as subword frequency distribution and coverage evaluation, can provide valuable insights into the model’s internal mechanisms [24].

4. Model Architecture Design

In this section, we designed and implemented a generative pre-trained model (TarimGPT) based on the Transformer architecture to explore the potential of autoregressive language modeling for mixed Chinese and Uyghur training tasks as well as lightweight models. The issue of using monolingual and multilingual tools, especially for languages with limited resources, that linear bag-of-words classifiers remain quite a strong baseline across all settings and both natural languages. However, transformer-based neural language models typically outperform them [25]. The core of this model lies in the combination of key technologies such as Multi-Head Attention, Positional Encoding, Feed-Forward Networks, and Layer Normalization to improve the accuracy and fluency of language generation [26].

The Transformer architecture is based on complex recurrent or convolutional neural networks that include encoders and decoders. The best-performing models also connect encoders and decoders through attention mechanisms. We propose a new and simple network architecture, the Transformer, which is solely based on the attention mechanism, completely dispensing with recurrence and convolutions. Experiments on text generation tasks have shown that these models are superior in quality, have higher parallelism, and require significantly less training time.

4.1. Model Architecture Parameter Configuration

Quantization of Transformer model parameters refers to converting the weights and activations in the model from floating-point numbers to a lower-precision data type [27]. This process aims to reduce the model’s memory usage and computational requirements, thus improving the model’s efficiency and deployment feasibility on specific hardware. Quantization of model parameters has several significant implications for hardware requirements:

Reduced memory usage: By reducing the bit-width of the model’s weights and activations, the required storage space for the model is significantly reduced [28,29,30]. This is especially crucial for devices with limited memory resources, such as mobile devices, embedded systems, or edge computing devices, enabling the deployment of large Transformer models on these devices.
Accelerated inference speed: Many modern hardware (such as GPUs, TPUs, and specific AI accelerators) are optimized for low-precision computations, providing faster computational speed at lower precision. This means that quantized models can achieve higher throughput and lower latency when performing inference on these hardware [31,32,33,34,35].
Energy saving: Due to the reduction in computations and memory accesses, quantized models consume less energy during runtime, significantly reducing energy costs and environmental impact for battery-powered devices or large-scale data centers.
Widened application scope: Parameter quantization enables the application of complex Transformer models that were previously difficult to deploy due to resource constraints in more scenarios, such as real-time natural language processing tasks on IoT devices or advanced decision support systems in autonomous vehicles.
Enhanced user experience: Faster response times and lower energy consumption mean that users can enjoy smoother and more enduring application experiences, especially on mobile devices.

The application of artificial intelligence (AI) in the field of knowledge question answering is becoming increasingly widespread. Generative Pre-trained Transformer (GPT) has attracted particular attention. We have developed a new GPT-based, manually modified method to construct the model, comprehensively considering and adjusting parameters such as hidden dimensions, the number of attention heads, and dropout ratios to reduce the training volume of the model, thereby reducing the usage of video memory during matrix operations. We have also proposed a new theoretical framework by adding context detection and text fusing mechanisms at the OUTPUT end, as shown in Figure 8.

The specific configuration is as follows:

Vocabulary Size: 240,000 Uyghur and Chinese word items, covering a rich range of linguistic expressions.
Sequence Length: The maximum input length is set to 256 to handle complex sentence structures and long text requirements.
Hidden Dimension (d_model): Set to 256, serving as the dimension for the internal representation of the model, balancing computational efficiency and expressive power.
Number of Layers (n_layer): A total of 8 Transformer blocks, increasing the depth of the model, which is conducive to learning more complex linguistic structures.
Number of Heads (n_head): Each layer uses 8 attention heads, enhancing the model’s parallel processing capabilities and ability to capture dependencies between different positions [35,36].
Bias: Bias terms are used in the linear layers to enhance the model’s expressive ability.
Dropout Rate: Set to 0.1, maintaining the integrity of information transmission and facilitating the understanding of the model’s behavior during training.

4.1.1. Calculation of Memory Requirements

Vocabulary Size and Embedding Matrix: With a vocabulary size of 2,400,000 words and an assumed embedding dimension of 256, the embedding matrix size is 2,400,000 × 256 = 614,400,000 floating-point numbers. Using float32 precision, this would require approximately 2.46 GB of GPU memory. However, this is typically not the main source of memory consumption, as the embedding layer can utilize techniques like Embedding Bag or optimized lookup tables to reduce memory usage.
Memory Usage of Transformer Blocks: Each Transformer block primarily consumes memory from the QKV calculations, i.e., (seq_len × d_model) × 3 (for Q, K, V). For 8 layers, the total is 256 × 256 × 3 × 8 = 1,536,960 floating-point numbers, approximately 6.23 MB (float32). This is just the basic intermediate state, and with residual connections, layer normalization, FFN, etc., the actual consumption will be greater.
Full Model Memory Estimation: Considering all layers, residual connections, intermediate outputs, and backpropagation, along with possible batch processing, a safe estimate is several times the above calculations, with additional overhead. For large-scale training, especially with large batch sizes, tens of GB of GPU memory may be required.

4.1.2. Computational Power Requirements

To calculate the FLOPs (floating-point operations) required for a Transformer model, the focus is on multiplication and addition operations. Each layer of self-attention and FFN generates a significant number of multiplications and additions. For self-attention, the complexity of each head is approximately O(seq_len² × d_k), where d_k = d_model/n_head [37]. For 8 heads, each layer is roughly 8 × seq_len² × d_model. For 8 layers, it is 8 × seq_len² × d_model. The complexity of FFN is 4 × d_model². The total complexity is (8 \times seq_len² \times d_model + 8 \times d_model² \times n_{layer}). With this approximate FLOPs requirement, a suitable GPU can be chosen based on its FLOPs performance. To calculate the required GPU memory size more precisely, note that actual memory requirements are influenced by various factors, including framework implementation details, mixed-precision training optimization techniques, batch size, etc. We provide estimates for the model.

Basic Calculations

(1)

Embedding Layer: Approximately 2.46 MB, but the actual usage can be reduced through optimization techniques.

Transformer Block Forward Pass:

QKV Calculation: 2,562,563 = 196,608 floating-point numbers, for 8 layers it’s 196,608 × 8 = 1,572,864.
FFN: Each layer is approximately 4256² = 262,144 floating-point numbers, for 8 layers it’s 262,1448 = 2,097,152.
Total: 1,572,864 + 2,097,152 = 3,670,016 floating-point numbers, which is approximately 14.68 MB (float32), not considering the small occupancy of residual connections and Layer Normalization (LN).

(2)

Batch Size: The batch size is set to 16.

(3)

Backward Propagation: Theoretically, it’s twice the forward pass, but considering optimizations and reuse, it may be slightly less than twice. Here, we simply estimate it as twice.

Comprehensive Estimation

Forward Pass: Approximately 14.68 MB, multiplied by the batch size of 16, resulting in approximately 234.88 MB.
Backward Propagation: Estimated as twice the forward pass, approximately 469.76 MB.
Additional Overhead: Including gradients, optimizer states, etc., assuming additional occupancy is similar to the model parameters, but without specific parameter quantities, it generally accounts for a small portion.

Comprehensively considering, the memory requirement would fall within the range of 150 MB to 100 MB (forward + backward), which translates to approximately 0.15 GB to 0.1 GB in GB units. In reality, especially considering the additional overhead of deep learning frameworks, batch normalization, storage of model parameters, optimizer states, and dynamic allocation during actual training, the required memory is often much larger than the directly calculated value. Therefore, we consider having at least 16 GB to 8 GB of GPU memory to ensure stability and flexibility during training, while for larger scales or batch sizes, a GPU with 12 GB or higher memory would be more suitable.

Based on the above assessment of hardware requirements, we adopted a single NVIDIA GeForce RTX 4090 graphics card, which has 16,384 CUDA cores and 24 GB of memory, providing the parallel computing power required for our experiments. Therefore, we chose it as the core computing power hardware for model training.

4.2. Key Components

4.2.1. Sinusoidal Positional Encoding

Word embeddings are the foundation for a model to understand text, mapping each word in the vocabulary to a high-dimensional vector space such that words with similar semantics are close in the vector space [38]. In the GPT model, the tok_embed_table is implemented using an Embedding layer, which takes a word index as input and outputs the corresponding word embedding vector. This way, text is transformed into a numerical form that the model can understand.

Positional encoding, on the other hand, provides the model with information about the order of words in a sequence [39]. The SinusoidPE class achieves this by generating values that vary with the position using sine and cosine functions, ensuring that the model can distinguish different positions in the sequence. This step is crucial for generating text with temporal dependencies.

4.2.2. Multi-Head Self-Attention

As shown in Figure 9, the self-attention mechanism is the core of the Transformer architecture, allowing the model to consider contextual information from all positions while processing the input sequence [40]. The SelfAttention class implements this mechanism by linearly transforming the input into query, key, and value components, and then calculating attention weights. Multi-head attention further extends this process to multiple parallel attention heads, increasing the model’s parallelism and its ability to attend to different positional relationships. Specifically, the introduction of masking operations ensures that only historical information is utilized to predict the future during the autoregressive process, maintaining the causality of the sequence.

We designed a multi-head self-attention module that implements parallelized attention computation, enhancing the model’s ability to handle long-distance dependencies [40]. By dividing the input into multiple “heads” and performing attention computation independently, the model can capture various association patterns at different positions. Meanwhile, the masking mechanism ensures the autoregressive property, meaning that predictions rely only on historical information.

4.2.3. Feed-Forward Network, FFN

After each self-attention block, the FeedForward class implements a two-layer fully connected neural network, typically including an activation function (such as GELU) and a Dropout layer, to enhance the model’s non-linear representation ability [41]. This step transforms the original input information deeply, helping the model learn complex patterns.

A two-layer feed-forward neural network is connected after each Transformer block [42], utilizing the GELU activation function and Dropout regularization, enhancing the model’s non-linear representation ability and assisting it in learning more complex linguistic features.

4.2.4. Layer Normalization

We integrated the above components to form the basic building block of the Transformer. Each block consists of a self-attention layer and a feed-forward network wrapped by Layer Normalization layers [43,44]. This design helps stabilize training and promote gradient flow. By stacking multiple such blocks, the model can capture increasingly complex linguistic structures and long-distance dependencies.

The application of Layer Normalization after each sub-layer ensures that the features input to the next layer have a stable distribution, accelerating the training process and improving model stability.

4.2.5. Initialization and Regularization

The model initialization strategy, such as the normal distribution initialization applied in the _init_weights method, is crucial for the stable training and rapid convergence of the model. Additionally, the Dropout layer used in the model as a regularization technique helps reduce the risk of overfitting and improve the model’s generalization ability [45].

5. Model Training

We define the training and testing processes for our text generation system to demonstrate the model’s learning ability and generation quality when handling text data in a specific domain. This research not only involves the configuration and optimization of the model architecture, but also encompasses data processing, training strategies, and the evaluation of generated instances, comprehensively reflecting the entire process from model development to application. Here, we adopt the commonly used loss function for general large language models to measure the quality of the model. The goal of model training is to adjust the model parameters through optimization algorithms (such as gradient descent) to minimize the average loss (or empirical risk) on the training dataset. This process iterates continuously until the loss function value no longer decreases significantly or until predetermined stopping criteria are met. A smaller loss function value indicates that the model’s predictions are closer to the true label values [45], signifying better model performance. In ideal circumstances, if the model could make perfect predictions, the predicted values would be identical to the true values, and the loss function should reach its minimum possible value, typically zero.

5.1. Model Architecture and Configuration

Our model is based on the GPT structure and customized through the GPTConfig class, including setting the batch size to 64 to balance computational resources and learning efficiency, as well as introducing a dropout layer with a rate of 0.1 to enhance the model’s generalization ability. Model training utilizes CUDA acceleration when available to ensure efficient operation.

5.2. Data Processing and Training Workflow

In the data preprocessing stage, we utilize np.memmap to efficiently read integer sequences from large-scale training sets (train.dat) and test sets (test.dat), providing a continuous and efficient data flow for model training. The get_batch function randomly extracts sequence data segments as model inputs and target outputs, simulating a sequence-to-sequence prediction task, which strengthens the model’s learning of long-sequence dependencies.

During the training process, we employ the AdamW optimizer with a learning rate of 10⁻³ and continuously optimize the model parameters over 30,000 iterations. The current loss value is output every 100 iterations, facilitating the monitoring of training dynamics. Through backpropagation and gradient descent, the model gradually learns patterns and regularities in the data, ultimately converging to a lower loss level, indicating that the model can effectively capture and generate the characteristics of data sequences.

5.3. Model Evaluation and Generation Instances

This study has trained an autoregressive language model based on the Transformer architecture, with the model weights saved as “model.pth” and a size of 1.5 GB. It can efficiently run on a GPU with 4 GB of video memory. Validated by a pre-trained SentencePiece tokenizer, the model demonstrates an end-to-end capability from text input to generation. In evaluating the model, we adopted several key metrics including accuracy, recall, F1 score, and loss function value to measure its performance. Accuracy refers to the proportion of positive samples that are correctly predicted as positive by the model, calculated as the true positives (TP) divided by the sum of true positives and false positives (FP). Recall measures the proportion of actual positive samples that are predicted as positive by the model, calculated as the true positives (TP) divided by the sum of true positives and false negatives (FN). The F1 score, as the harmonic mean of precision and recall, comprehensively considers the accuracy and completeness of the model. During the training process, the loss function value is used to measure the discrepancy between the model’s predictions and the true values, where a lower value generally indicates better model performance. Our model exhibited excellent performance in all evaluation metrics, with a low loss value of 0.07, indicating a small prediction error and strong generalization ability. These results not only validate the effectiveness of the model but also provide strong support for its application in the agricultural field and other related areas.

Here are the equations for accuracy, recall, F1 score, and loss function:

Accuracy/Precision: $P r e c i s i o n = \frac{T P}{T P + F P}$
Recall: $R e c a l l = \frac{T P}{T P + F N}$
F1 Score: $F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$
Loss: $L o s s = - \sum_{i} l o g P (y i | y 1, \dots, y i - 1)$

We have successfully trained an autoregressive language model based on Transformer, with its weight file “model.pth” sized at 1.5 GB, which can run smoothly on a GPU with at least 4 GB of video memory. By loading the pre-trained SentencePiece tokenizer (bird_shooter.model), the model has successfully completed the entire process from encoding text data to decoding, demonstrating its powerful generation capability.

To comprehensively evaluate the model’s performance, we have adopted multiple key indicators, including accuracy, recall, F1 score, and loss function value. These indicators collectively assess the model’s correctness, precision, comprehensiveness, and learning effectiveness. In particular, the model has exhibited good convergence characteristics during the training process, with the final loss function value reduced to 0.07, indicating a high degree of consistency between the model’s predictions and actual results.

Based on Figure 10, which shows the loss function curve and accuracy curve, it can be observed that in the initial stage, the rapid decline of the loss function indicates that the model is converging quickly. In the later stages, the flattening of the loss function curve suggests that the model is approaching convergence. Initially, the accuracy rate is relatively low, but it gradually improves as the number of iterations increases. In the mid-to-late stages, although the growth rate slows down, the accuracy continues to increase overall, indicating that the model performs well in identifying correct samples.

As shown in Table 3, after 3.9 h of training, the model achieved an accuracy, recall, and F1 score of 0.984, with a smooth loss rate of 0.067. This indicates that the model performs excellently in prediction tasks, demonstrating high generalization ability and reliability. Furthermore, we utilized the torchinfo. summary method to visualize the model structure, providing assistance in understanding and debugging the model architecture. This experiment not only showcases the learning capability of customized models on textual data in the agricultural field, but also provides valuable references for research and applications in related fields.

5.4. Performance of the Dataset When Applied to Other Models

Following the methods and processes described in the Corpus Collection and Preprocessing, as well as Word Segmentation and Vocabulary Construction sections, the data were collected, organized, identified, translated, manually checked for errors, segmented, and the dataset was partitioned. Under the condition of ensuring the use of the same dataset as the Transformer model described above, the performance of this dataset on two currently popular models, LSTM and KAN, was analyzed through training on these models, respectively.

5.4.1. Performance of the Dataset on LSTM

Artificial neural networks have become state-of-the-art technology in language modeling tasks on small corpora. Although feedforward networks can only consider a fixed context length to predict the next word, recurrent neural networks (RNNs) can utilize all previous words. Due to the difficulty of training RNNs, the Long Short-Term Memory (LSTM) neural network architecture can be employed [45].

By introducing long short-term memory (LSTM) into the cell structure, it can handle long-term dependencies well. Since its introduction, LSTM has achieved almost all exciting results based on RNNs. LSTM has become a focus of deep learning. We reviewed LSTM units and their variants to explore the learning capabilities of LSTM units [45].

To construct, train, evaluate, and generate a text processing model based on LSTM (long short-term memory), several steps are involved. Firstly, a Config class defines the basic parameters of the model, such as the vocabulary, sequence length, model dimensions, etc. Then, SinusoidPE provides positional information of sequence elements to capture the sequential nature of the sequence. The LSTMBlock is a crucial part for processing sequential data, capturing long-distance dependencies. The LSTMModel integrates word embeddings, positional encodings, LSTM layers, layer normalization, and an output layer, and defines methods for model initialization, forward propagation, text generation, and embedding retrieval. The main function (main) sets up the model parameters, creates an instance of the model, and defines the training loop, including data loading, forward propagation, loss calculation, and parameter updates. After training, the model’s performance is measured through evaluation functions, such as accuracy, precision, etc., and training curves are plotted. Additionally, the model has a text generation capability and includes a series of helper functions to support tasks like data loading, performance evaluation, and score calculation. The entire system constructs a comprehensive LSTM language model framework, encompassing model definition, training, evaluation, and application, achieving a range of text processing functionalities.

Based on Figure 11, we can further analyze the performance of the LSTM language model during the actual training process. The data presented indicate that the model achieved a loss value of 0.087 during training, suggesting that the model is able to fit the patterns in the training data effectively. However, the model also exhibited an accuracy of 0.687, indicating that its prediction performance in inference tasks is not ideal.

As shown in Table 4, after 5.2 h of training, the model achieved a recall rate of 0.831, indicating that its performance in identifying positive examples is inferior to that of the Transformer model. The model’s F1 Score also reached 0.831, suggesting that the model’s balance between precision and recall is also not as good as the Transformer model.

5.4.2. Performance of the Dataset on TKAN

Kolmogorov–Arnold Networks represent a promising alternative to Multi-Layer Perceptrons (MLPs), while TKAN, a novel neural network architecture inspired by both KAN and LSTM, stands for Temporal Kolmogorov–Arnold Networks. TKANs combine the strengths of both networks and consist of Recurrent Kolmogorov–Arnold Networks (RKANs) layers with embedded memory management. This innovation enables multi-step time series forecasting with enhanced accuracy and efficiency. By addressing the limitations of traditional models in handling complex sequential patterns, the TKAN architecture offers significant potential for advancement in fields requiring one-step-ahead forecasting [45].

We have implemented a language model based on the KAN architecture, specifically utilizing a variant of the Kolmogorov–Arnold Network (KAN) known as the Temporal Kolmogorov–Arnold Network (TKAN). TKAN integrates the advantages of both KAN and LSTM, managing memory through embedded recurrent KAN layers (RKANs) to enhance the accuracy and efficiency of multi-step time series predictions. This architecture is particularly suitable for domains requiring advance predictions, as it overcomes the limitations of traditional models when dealing with complex sequential patterns.

The Config class, which defines the model’s configuration, includes parameters such as vocabulary size, sequence length, and model dimensions. Key components of the model include SinusoidPE for sinusoidal positional encoding, SelfAttention representing a multi-head self-attention mechanism, and FeedForward, a feed-forward network. Notably, the linear layer within the FeedForward network employs KANLinear, a substitute for the traditional nn.Linear, to leverage the advantages of the TKAN architecture. The main body of the model, KAN-Model, integrates these components, utilizing KANLinear layers in place of conventional linear layers for the complex task of multi-step time series predictions. Additionally, it facilitates text generation and embedding representation retrieval.

In the training and evaluation section, the code provides functions to calculate accuracy, precision, recall, and F1 scores, as well as functions to plot the changes in loss and accuracy during the training process. The training function, named train, utilizes the AdamW optimizer and performs forward propagation, loss calculation, backpropagation, and parameter updates in each iteration. Following the completion of training, the model is evaluated on the entire training set.

Based on Figure 12, we can further analyze the performance of the TKAN model during the actual training process. As indicated by the data, the model achieved a loss value of 0.346 during training, suggesting that the model is able to effectively fit the patterns in the training data. However, the model also exhibited an accuracy of 0.687, which, as shown in Table 5, indicates that the accuracy and scores of the parameters are not ideal after 7.1 h of training. This suggests that the model’s prediction performance in inference tasks is not satisfactory.

5.5. Performance of the Dataset When Applied to Other Models

The effectiveness of this corpus processing method and the Transformer model in other under-resourced languages is validated by translating Chinese corpus into Tibetan corpus, which is also a language with scarce resources, and then conducting manual error correction, word segmentation, dataset division, and training with the Transformer model.

Firstly, the recognized and manually corrected Chinese corpus was translated into Tibetan through the API of the Swift Translation Platform, and 8 book samples were manually reviewed and corrected. The translated Tibetan corpus was compared with the translated and manually corrected Tibetan corpus to calculate the translation accuracy. By randomly selecting 50 paragraphs to calculate the accuracy of each, the average accuracy of 50 paragraphs was calculated, as shown in Figure 13, and the final average accuracy of 50 paragraphs was 0.9859. This method was used to calculate the average accuracy of each of the 8 translated book text samples, as shown in Figure 14, and the average accuracy of all 8 books was calculated to be 0.9873.

By translating Chinese corpus into Tibetan through the aforementioned method, it can greatly reduce the time for manual review and improve the efficiency of corpus processing. We continue to advance the experiment based on this method, applying the Transformer model to Tibetan, which is also a language with scarce corpus resources, to conduct mixed training of Chinese and Tibetan, and calculate the results.

As can be seen from the loss function curve and accuracy curve in Figure 15, in the initial stage, the loss function rapidly decreases, indicating that the model is converging quickly; in the later stage, the loss function curve tends to flatten out, suggesting that the model is already close to convergence. This demonstrates that the Transformer model performs exceptionally well when applied to Tibetan, a language with scarce corpus resources. After 4.1 h of training, as shown in Table 6, the loss is very low, and the accuracy and score are very high, indicating that the model has good overall performance.

6. Model Application and Deployment

In the testing and deployment phase of our model, particular emphasis was placed on implementing text context inspection and a text circuit breaking mechanism to enhance the quality and novelty of generated texts, thereby ensuring output relevance without redundancy. This approach not only validates the efficacy of the constructed model but also showcases its adaptability and controllability in practical application scenarios.

6.1. Testing Model Performance and Optimization Strategies

The cosine similarity coefficient is a measure of the angle between two vectors, particularly useful in the field of text analysis for assessing the similarity between different texts. It reflects the consistency of their directions in high-dimensional space by calculating the cosine of the angle between the two vectors. This measure is particularly effective in text corpora, as text can often be represented as high-dimensional vectors, with each dimension corresponding to the frequency of occurrence of a word or phrase.

To evaluate the generative effectiveness and refine output quality, a series of targeted testing strategies were employed. Central to this effort was the development of a function named generate_text_with_circuit_breaker, which integrates an intelligent circuit-breaking mechanism to govern the coherence and diversity of the generated text. The function operates through the following steps:

Sentence Integrity Check: Ensures each candidate output concludes with punctuation such as periods, question marks, or exclamation points, preserving natural sentence integrity.
Cosine Similarity Evaluation: Evaluate the similarity between newly generated sentences and previous text using TF-IDF vectorization and cosine similarity calculation. When the similarity is too high (exceeding a preset threshold, such as 0.8), trigger a fuse mechanism to terminate further text generation, thus preventing content disassociation and redundancy.
Filtering Out Sentences Ending with Question Marks: Exclude sentences ending with question marks from the final output to better suit the needs of certain scenarios, such as declarative answers.

The generate_text_with_circuit_breaker function embodies a circuit-breaking strategy designed to excise segments from the generated text that closely resemble the input text, thus guaranteeing freshness and variety in the output. By assessing paragraph-to-paragraph cosine similarities and retaining segments only when their similarity falls below a set threshold and concludes with specific characters (like periods or exclamation marks), the function averts the repetition of information, enhancing response quality.

Initially, the calculate_similarity function is defined, utilizing TfidfVectorizer and cosine_similarity to quantify the cosine similarity between two pieces of text. This step translates text into TF-IDF vector representations, subsequently measuring their semantic similarity through cosine similarity. Lower similarity scores indicate greater content disparity between the texts.

In practical experiments, the application of cosine similarity is significant. The text generated by the model’s inference is a continuous sequence, and its length is controlled by parameters such as d_model and max_new_tokens. d_model represents the word vector dimension and determines the demand for computational resources to a certain extent. max_new_tokens is a threshold that determines the number of consecutive tokens generated from the model’s inference, which produces a series of coherent and contextually related tokens. However, in practice, our training parameters are relatively small, and generating longer texts can be understood as extracting corresponding text blocks from the corpus. Ensuring the integrity of the output result cannot be guaranteed if word-based context fusion is interrupted.

Therefore, in the experiment, the generated text undergoes a similarity check by segmenting it into paragraphs. This avoids incoherence in the knowledge passages extracted during the inference process. A threshold is set to determine the similarity between paragraphs and determine whether a fusion interruption is needed. The specific application in the experiment is as follows:

Step 1: Initialization: A vectorizer is created to build a vocabulary and calculate TF-IDF weights.

Step 2: Vectorization: The generated text is divided into paragraphs, and each paragraph is converted into a TF-IDF vector and stored in tfidf_matrix.

Step 3: Similarity Calculation: Starting from the first paragraph to the second, their similarity is calculated. This involves obtaining their TF-IDF vectors, calling cosine_similarity, and storing the results in the similarities list.

Step 4: Repeat Calculation: The above steps are repeated for each pair of adjacent paragraphs in the document until the last paragraph.

Step 5: Collect Results: The similarities list now contains the similarity scores for each pair of adjacent paragraphs in the document. This allows for real-time monitoring of the similarity between paragraphs within the generated text, which is beneficial for understanding the document structure and improving the quality of the final output text.

Cosine similarity is a metric used to measure the similarity in direction between two non-zero vectors, and its value ranges from −1 to 1. Specifically, the formula for calculating cosine similarity is:

C o s i n e S i m i l a r i t y = \frac{A \cdot B}{| | A | | | | B | |} = \frac{\sum_{i = 1}^{n} A_{i} B_{i}}{\sqrt{\sum_{i = 1}^{n} A_{i}^{2}} \sqrt{\sum_{i = 1}^{n} B_{i}^{2}}}

The generate_text_with_circuit_breaker function implements the core logic of the circuit breaker mechanism. It takes input text (input_text), an originally generated response text (output_text), and an optional similarity threshold (defaulting to 0.0). The function first splits the response text into multiple paragraphs and then calculates the similarity between each paragraph and the input text. For each paragraph, if its similarity to the input text is below the set threshold and the paragraph ends with specific ending characters (such as a period, exclamation mark, etc.), the paragraph is considered novel and complete, thus preserved. Finally, the preserved paragraphs are re-assembled into the final output text, which removes redundant information highly similar to the input and ensures the independence and creativity of the output content.

The split_and_output_text function handles the generated text according to specific rules (such as \t\t separators or sentence segmentation) to ensure that the output format meets the requirements, such as limiting the number of output sentences.

By using a series of carefully selected input examples related to the agricultural field, such as “expanding agricultural product sales channels” or “integrating farmers into the entire agricultural industry chain”, the model is able to generate continuous text fragments closely related to the input topics. These generated results not only verify the model’s understanding and generation capabilities in specific domain knowledge but also highlight its potential applications, such as generating policy recommendations or supplementing industry reports.

6.2. Model Deployment and User Interaction Design

We have developed a straightforward command-line interface that enables users to input queries and receive immediate responses generated by the model, as illustrated in Figure 16. This process encompasses several pivotal steps:

Loading the Model and Tokenizer: Ensuring the pretrained GPT model and its corresponding tokenizer are efficiently loaded in the deployment environment, facilitating the encoding and decoding of text data.
Interactive Loop Design: Implementing an infinite loop that waits for user input, where each iteration involves transforming the user’s query into the input format required by the model. The model is then invoked to generate an answer via the generate function, which produces a response of a predetermined length based on the user’s question and instantly presents it to the user.
Simplified Text Generation Function: For rapid demonstration purposes, we provide the generate_text function. This function directly generates text without incorporating circuit-breaking logic, making it suitable for preliminary model performance validation and swift iterative testing.
Text Inspection and Circuit-Breaking Handling: The generate_text_with_circuit_breaker function subjects the text generated by the generation function to quality inspection and applies circuit-breaking measures to obtain a final response that adheres to the intent of the input query, ensuring relevance and coherence.

7. Discussion

This study addresses the urgent need for agricultural technology promotion in the context of the rural revitalization strategy. Using pdf electronic books provided by online libraries such as Chaoxing as the original corpus, OCR technology is employed to convert pdf image-based electronic books into text-based book corpora. Data cleaning, sorting, and tagging are conducted through regular expressions and manual error correction. The text corpora are segmented into sentences, and the Quick Translation Platform’s Chinese-Uyghur translation API is used to batch translate the Chinese text corpora into Uyghur corpora. The Chinese and Uyghur corpora are then merged.

Based on the Chinese corpora, a Chinese vocabulary is constructed, and through the Xunjie Translation Platform’s Chinese-Uyghur translation API, Uyghur words are translated to obtain a Uyghur vocabulary. SentencePiece is utilized with custom settings, including the Uyghur vocabulary, to tokenize the mixed Chinese and Uyghur corpora text, build a vocabulary, divide the dataset, and construct a word model.

Utilizing the principles of the Transformer architecture, components such as Encoding, Self-Attention, Feed-Forward Network, and Layer Normalization are customized and combined to construct the model architecture. Parameters are quantitatively set, and the model is trained accordingly. The performance of the model is evaluated using metrics such as Loss, Accuracy, Recall, and F1 Score through the construction of evaluation functions [46].

In the model’s output section, a text context detection and text circuit breaker mechanism are implemented to filter and mask non-relevant text generated. The gradio library is used to create a user interaction interface, defining an input box to receive user questions or prompts and an output box to display the model’s generated responses. The interface is titled “Lightweight Professional Agricultural Q&A Robot” and includes descriptive text to guide user input. This successfully constructs an agricultural Q&A robot application supporting Chinese and Uyghur languages, based on a lightweight large language model specializing in the agricultural domain.

7.1. Comparison with Existing Language Models and Training Methods

Compared with existing language models, this paper proposes a lightweight large language model specifically designed for the agricultural domain and supports both Chinese and Uyghur. The model is based on the Transformer architecture and optimized for specific challenges. As shown in Table 7, compared to LSTM and TKAN models, the loss rate is significantly lower than LSTM and TKAN, and the accuracy and score are significantly higher than LSTM and TKAN. Moreover, the training time under the same hardware configuration is shorter than the other two models. The experimental results indicate that the performance of the language model based on the Transformer architecture for mixed training of Chinese and Uyghur is significantly better than the other two models.

The corpus processing method, which combines OCR optical technology recognition, batch translation using the Swift Translation Platform API, and manual review and correction, was extended to model training for Tibetan, a language with similarly scarce corpus resources, using the Transformer architecture-based model. The results, as shown in Table 7, indicate a loss rate of 0.066 and an accuracy rate of 0.986. The evaluation data show little difference compared to the model training results for Chinese and Uyghur, demonstrating good stability, scalability, and overall comprehensive performance.

7.2. Comparison of Hardware Resource Requirements with Existing Mainstream Models

The existing method supports bilingual capabilities in Mandarin and Uyghur, which is relatively rare among large language models. Specifically, support for Uyghur in the agricultural field is uncommon in other studies. Compared to ChatGPT and ChatGLM, which support multiple languages [46] such as English and Chinese, their support for minority languages is limited, and they do not support Uyghur.

The existing method is optimized for professional knowledge in the agricultural field, including knowledge of agricultural development history, crop cultivation, animal husbandry, and other agricultural fields. This targeted optimization makes the model more accurate and efficient in handling agriculture-related issues. In contrast, general large language models may have limitations in deep understanding and generation in specific fields. Professional models, on the other hand, focus on highly relevant sustainable development goals. This emphasizes the importance of careful model selection, considering task requirements, cost, complexity, and transparency [47].

When running OPT-175B on a single 16 GB GPU, ChatGPT achieves significantly higher throughput with FlexGen compared to the most advanced offloading systems, reaching a generation throughput of 1 token/second for the first time with an effective batch size of 144. In the HELM benchmark test, FlexGen can benchmark a 30B model equipped with a 16 GB GPU on 7 representative sub-scenarios within 21 h [47].

The existing method can be normally deployed and run on a graphics card configured with 4 cores and 4 GB of video memory, significantly reducing hardware requirements and making the model more accessible and applicable. As shown in Table 8, compared to ChatGPT, Llama-2, ChatGLM, and Wenxin Yiyan, which have higher hardware requirements and require a high-performance distributed GPU integrated computing environment, the existing method has lower hardware demands.

7.3. Context Correlation Detection and Fuse Mechanism

This paper proposes a lightweight large language model specifically designed for the agricultural field, supporting both Chinese and Uyghur. The model is based on the Transformer architecture and optimized for specific challenges, including the collection and processing of agricultural Uyghur corpus, Chinese-Uyghur mixed word segmentation, lightweight modeling in the agricultural field, context correlation detection, and text circuit interruption mechanism. With a size of only 1.5 GB, the model is suitable for running on low VRAM GPUs, aiming to improve semantic coherence while reducing memory usage and computational requirements.

The existing method effectively addresses the issue of non-relevant text fusion during small-scale corpus training through deep learning techniques, context detection mechanisms, and text fuse mechanisms. This approach enhances the model’s robustness and potential advantages in handling complex problems.

The existing method demonstrates significant advantages in terms of model size, hardware requirements, specific domain optimization, context correlation detection, text circuit interruption mechanisms, model transparency, and environmental impact. However, there are also some limitations. For instance, data in certain agricultural sub-fields may not be comprehensive, affecting the model’s generalization ability. When handling large-scale or complex tasks, computational resource limitations can become a bottleneck. Additionally, the model’s adaptability to multilingual and cross-cultural scenarios still needs further exploration and optimization. Future work will focus on expanding datasets, optimizing model structures, enhancing multilingual support, improving context detection mechanisms, increasing interpretability, and optimizing real-time performance.

7.4. Limitations and Future Prospects

Although this study has made breakthroughs in the construction of a large Chinese-Uyghur bilingual language model in the agricultural field, there are several limitations. Despite covering two major languages, the dataset may have insufficient coverage in certain agricultural sub-fields, affecting the model’s generalization. Although the model is lightweight, computational resource limitations can still be a bottleneck when handling large-scale or complex tasks. Despite the model’s excellent performance in bilingual text processing, further exploration and optimization are needed for adaptability in multilingual and cross-cultural scenarios. Additionally, while the model’s context detection and text fuse mechanisms are innovative, more precise adjustments are required to meet diversified needs. The model’s interpretability needs to be enhanced to increase user trust, especially in scenarios with high real-time feedback requirements, where the model’s reasoning speed and response time need to be optimized. Future work will focus on expanding datasets, optimizing model structures, enhancing multilingual support, improving context detection mechanisms, increasing interpretability, and optimizing real-time performance. In particular, the model’s deficiency in multimodal processing capabilities, such as combining image, sound, and other data types for comprehensive analysis and understanding, will be a key focus of future research to achieve richer interactions and broader application scenarios. Simultaneously, assessing the social impact of the model is essential to ensure parallel technological development and social value.

7.5. Conclusions

(1)

Verification of Model Performance Superiority: Experimental results show that through the carefully designed Transformer architecture and customized parameter settings, the constructed model achieves an accuracy rate of over 90% in agricultural terminology recognition, significantly outperforming general-purpose models in the agricultural field. This indicates that targeted model optimization can effectively improve the accuracy and practicality of knowledge-based question-and-answer in specific domains, reducing misunderstandings and deviations in knowledge-based question-and-answer, and enhancing the precision of agricultural technology dissemination.

(2)

Breakthrough in Lightweight Design: Based on the assessment of hardware requirements for model training, to ensure stability and flexibility during the training process, a GPU with 12 GB of video memory can complete the model training task for larger scales or batch processing. The actual obtained model size is only 1.5 GB, and according to the evaluation of hardware computing power required for loading and applying the model, it can run smoothly in a low-configuration hardware environment supporting 4 GB of video memory, lowering the technical threshold for agricultural intelligent services. This achievement proves that lightweight models, while maintaining high performance, can adapt to resource-constrained environments, facilitating widespread deployment and popularization of the model, especially for remote areas and grassroots agricultural workers, representing a significant technological advancement.

(3)

Knowledge Fusion in a Bilingual Environment: This study addresses the absence of minority languages in large language models by constructing a high-quality bilingual corpus in Chinese and Uyghur, promoting technological fairness and linguistic diversity inclusiveness. The development of translation tools and data integration strategies not only enrich the agricultural knowledge base but also build a bridge for cross-lingual knowledge sharing. It is the first time that Uyghur has been proposed and experimentally studied as a large language model corpus to construct a large language model, enhancing Uyghur farmers’ ability to access information, reflecting the humanistic care and social value of technology.

(4)

Innovation in Data Processing and Model Training: During the data preprocessing stage, the study innovatively combines OCR technology and text cleaning methods to effectively convert a large number of agricultural professional books into electronic texts, providing rich training data for the model. Meanwhile, through the simplification of traditional characters and custom text tagging strategies, the quality of knowledge graph construction is ensured, enhancing the model’s ability to understand context. This methodological innovation provides valuable experience for the development of specific language models in other fields.

(5)

Utilizing SentencePiece to Improve Uyghur Word Segmentation Accuracy Through Bilingual Data, Including:

1.: Using Chinese data to train the model and translating it into Uyghur to create a targeted dictionary;
2.: Integrating translation tools to construct a bilingual dictionary, enhancing multilingual processing capabilities;
3.: Employing the unigram model to balance coverage and efficiency;
4.: Preserving text format and structure, suitable for complex document processing;
5.: Precisely controlling word segmentation to maintain textual naturalness;
6.: Generating a bilingual vocabulary containing 240,000 words, promoting Chinese-Uyghur bilingual information processing technology.

(6)

Innovatively applying a context detection and circuit breaker mechanism for text generation, significantly optimizing the quality and novelty of generated text. Features include:

1.: Using the generate_text_with_circuit_breaker function to integrate intelligent circuit breakers to ensure textual coherence and diversity;
2.: Implementing sentence integrity detection and cosine similarity evaluation to automatically screen and prevent redundant content generation;
3.: Customizing similarity thresholds and end symbol filtering to enhance content independence;
4.: Accurately calculating text similarity through TF-IDF vectorization, enhancing semantic understanding depth.

The model has been verified in agricultural scenarios, demonstrating its ability to generate high-quality and relevant content in specific domains, showcasing its broad potential in policy formulation, industry analysis, and other application scenarios.

Author Contributions

Conceptualization, K.P. and X.Z.; Methodology, K.P.; Software, L.C.; Validation, K.P., X.Z. and L.C.; Formal Analysis, K.P.; Investigation, X.Z.; Resources, X.Z.; Data Curation, K.P.; Writing—Original Draft Preparation, K.P. and X.Z.; Writing—Review and Editing, K.P. and X.Z.; Visualization, X.Z.; Supervision, L.C.; Project Administration, L.C.; Funding Acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Xinjiang Production and Construction Corps Project for the Development and Application Demonstration of Intelligent Robot Equipment for Cotton topping (Project No. 2023AB040) and the President’s Fund for Innovative Research Teams at Tarim University (Project No. TDZKCX202308).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental data supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors express sincere gratitude to the technicians who contributed to this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rathi, A.K.A. Pursuing the distilled good practices to improve the quality of Environmental Impact Assessment Reports and hence enhance the EIA effectiveness and help address the concerns of project proponents: An Indian Context. Macro Manag. Public Policies 2023, 5, 26–43. [Google Scholar] [CrossRef]
Zhu, A.; Dugan, L.; Hwang, A.; Callison-Burch, C. Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications. arXiv 2023, arXiv:2309.05542. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, X.; Yu, Y. ChatGLM-6B Fine-Tuning for Cultural and Creative Products Advertising Words. In Proceedings of the 2023 International Conference on Culture-Oriented Science and Technology (CoST), Xi’an, China, 11–14 October 2023; pp. 291–295. [Google Scholar]
Xia, Z.; Gao, B.; Yu, C.; Han, H.; Zhang, H.; Wang, S. A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing. Comput. Model. Eng. Sci. 2024, 138, 1103–1137. [Google Scholar] [CrossRef]
Akilandeswari, K.; Sivakumar, N.R.; Alkahtani, H.K.; Basheer, S.; Ghorashi, S.A. Smart Healthcare Activity Recognition Using Statistical Regression and Intelligent Learning. Comput. Mater. Contin. 2024, 78, 1189–1205. [Google Scholar] [CrossRef]
Zhong, S.; Yan, Z.; Wei, C.; Feng, L.; Chun, Z. Missing Value Imputation for Radar-Derived Time-Series Tracks of Aerial Targets Based on Improved Self-Attention-Based Network. Mater. Contin. 2024, 78, 3349–3376. [Google Scholar]
Mazharul, H.Q.; Arif, F.; Aurangzeb, K.; Khan, J.A.; Rubab, S.; Anwar, M.S. Identification of Software Bugs by Analyzing Natural Language-Based Requirements Using Optimized Deep Learning Features. Comput. Mater. Contin. 2024, 78, 4379–4397. [Google Scholar]
Cui, X.; Song, C.; Li, D.; Qu, X.; Long, J.; Yang, Y.; Zhang, H. RoBGP: A Chinese Nested Biomedical Named Entity Recognition Model Based on RoBERTa and Global Pointer. Comput. Mater. Contin. 2024, 78, 3603–3618. [Google Scholar] [CrossRef]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Punta Cana, Dominican Republic, October 2020; pp. 38–45. Available online: https://aclanthology.org/2020.emnlp-demos.6/ (accessed on 20 May 2024).
Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT ’92), Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar] [CrossRef]
Bashar, M.A.; Nayak, R. ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN. Res. Sq. Prepr. 2023. [Google Scholar] [CrossRef]
López Luna, M.; Taboada-Ortega, M.A.; Alvarez-Amparán, M.A.; Cedeño-Caero, L. Effect of iron incorporation on W based catalysts for oxidative desulfurization of dibenzothiophene compounds. Catal. Today 2022, 394, 336–347. [Google Scholar] [CrossRef]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
Raparthi, M.; Dodda, S.B.; Reddy SR, B.; Thunki, P.; Maruthi, S.; Ravichandran, P. Advancements in Natural Language Processing-A Comprehensive Review of AI Techniques. J. Bioinform. Artif. Intell. 2021, 1, 1–10. [Google Scholar]
Zhao, G.; Wang, Z.; Huang, Y.; Zhang, H.; Ma, X. Transformer-Based Maneuvering Target Tracking. Sensors 2022, 22, 8482. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Bai, T.; Li, X. Inverting Chlorophyll Content in Jujube Leaves Using a Back-Propagation Neural Network–Random Forest–Ridge Regression Algorithm with Combined Hyperspectral Data and Image Color Channels. Agronomy 2024, 14, 140. [Google Scholar] [CrossRef]
Zhang, Y.; Hu, Y.; Chen, X. Context and Multi-Features-Based Vulnerability Detection: A Vulnerability Detection Frame Based on Context Slicing and Multi-Features. Sensors 2024, 24, 1351. [Google Scholar] [CrossRef] [PubMed]
Ruan, S.; Cang, H.; Chen, H.; Yan, T.; Tan, F.; Zhang, Y.; Duan, L.; Xing, P.; Guo, L.; Gao, P.; et al. Hyperspectral Classification of Frost Damage Stress in Tomato Plants Based on Few-Shot Learning. Agronomy 2023, 13, 2348. [Google Scholar] [CrossRef]
Bin, W.; Lin, W. Forecasting Grain Yield in China Using Attention-based ADE-Bi-IndRNN Model. Oper. Res. Manag. Sci. 2024, 33, 102–119. [Google Scholar]
Zheng, Z.; Huang, S.; Weng, R.; Dai, X.-Y.; Chen, J. Improving self-attention networks with sequential relations. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1707–1716. [Google Scholar] [CrossRef]
Xu, G.; Liu, L.; Dong, J. Vulnerability Detection of Ethereum Smart Contract Based on SolBERT-BiGRU-Attention Hybrid Neural Model. Comput. Model. Eng. Sci. 2023, 137, 903–922. [Google Scholar] [CrossRef]
Zhou, W.; Jiang, X.; Qin, C. C-CORE: Clustering by Code Representation to Prioritize Test Cases in Compiler Testing. Comput. Model. Eng. Sci. 2024, 139, 2069–2093. [Google Scholar] [CrossRef]
Gillioz, A.; Casas, J.; Mugellini, E.; Abou Khaled, O. Overview of the Transformer-Based Models for NLP Tasks; IEEE: New York, NY, USA, 2020; pp. 179–183. [Google Scholar]
Kaixu, Z.; Maosong, S. Unified Framework of Performing Chinese Word Segmentation and Part-of-Speech Tagging. China Commun. 2012, 1, 1–9. [Google Scholar]
Kostić, M.; Batanović, V.; Nikolić, B. Monolingual, multilingual and cross-lingual code comment classification. Eng. Appl. Artif. Intell. 2023, 124, 106485. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
Dongmei, Z.; Mengzhen, S. Integrated Development of Tea and Tourism in Taishan Mountain Tea Valley in the Context of Rural Revitalization. Asian Agric. Res. 2024, 16, 1–9. [Google Scholar]
Li, L.; Li, J.; Wang, H.; Nie, J. Application of the transformer model algorithm in chinese word sense disambiguation: A case study in chinese language. Sci. Rep. 2024, 14, 6320. [Google Scholar] [CrossRef] [PubMed]
Pressel, D.; Liu, W.; Johnston, M.; Chen, M. Lightweight transformers for conversational ai. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, Seattle, WA, USA, 10–15 July 2022; pp. 221–229. [Google Scholar]
Zhou, J.; Lin, Q.; Feng, X.; Ren, D.; Teng, J.; Wu, X.; Wu, D.; Zhang, X.; Yuan, X.; Chen, Z.; et al. Evaluating the performance of genomic selection on purebred population by incorporating crossbred data in pigs. J. Integr. Agric. 2024, 23, 639–648. [Google Scholar] [CrossRef]
Liu, A.; Han, X.; Wang, Y.; Tsvetkov, Y.; Choi, Y.; Smith, N.A. Tuning Language Models by Proxy. arXiv 2024, arXiv:2401.08565. [Google Scholar] [CrossRef]
Liu, Z.; Yao, W.; Zhang, J.; Yang, L.; Liu, Z.; Tan, J.; Choubey, P.K.; Lan, T.; Wu, J.; Wang, H.; et al. AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System. arXiv 2024, arXiv:2402.15538. [Google Scholar]
Thawakar, O.; Vayani, A.; Khan, S.; Cholakal, H.; Anwer, R.M.; Felsberg, M.; Baldwin, T.; Xing, E.P.; Khan, F.S. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. arXiv 2024, arXiv:2402.16840. [Google Scholar]
He, C.; Luo, R.; Hu, S.; Zhao, Y.; Zhou, J.; Wu, H.; Zhang, J.; Han, X.; Liu, Z.; Sun, M. UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs. arXiv 2024, arXiv:2404.07584. [Google Scholar]
Shi, Z.; Xu, X.; Liu, X.; Chen, J.; Yang, M.H. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17482–17491. [Google Scholar]
Scoones, I. The Politics of Global Assessments: The Case of the International Assessment of Agricultural Knowledge, Science and Technology for Development (IAASTD). J. Peasant. Stud. 2009, 36, 547–571. [Google Scholar] [CrossRef]
He, S.; Xin, J.; Peng, H.; Zhang, E. Research on Malicious URL Detection Based on Feature Contribution Tendency. In Proceedings of the 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 24–26 April 2021; pp. 576–581. [Google Scholar]
Chiang, S.-Y.; Lin, T.-Y. Low-Brightness Object Recognition Based on Deep Learning. Comput. Mater. Contin. 2024, 79, 1757–1773. [Google Scholar] [CrossRef]
Soutner, D.; Müller, L. Application of LSTM neural networks in language modelling. In Proceedings of the 16th International Conference on Text, Speech, and Dialogue (TSD 2013), Pilsen, Czech Republic, 1–5 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 105–112. [Google Scholar]
Yu, Y.; Si, X.; Hu, C.; Zhang, J. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
Genet, R.; Inzirillo, H. Tkan: Temporal Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2405.07344. [Google Scholar] [CrossRef]
Ansari, A.S. A Review on the Recent Trends of Image Steganography for VANET Applications. CMC-Comput. Mater. Contin. 2024, 78, 2865–2892. [Google Scholar] [CrossRef]
Xu, M.; Shen, C.; Zhang, J.; Wang, Z.; Ruan, Z.; Poslad, S.; Xu, P. Improved HardNet and Stricter Outlier Filtering to Guide Reliable Matching. Comput. Mater. Contin. 2023, 75, 4785–4803. [Google Scholar] [CrossRef]
Hajikhani, A.; Cole, C. A Critical Review of Large Language Models: Sensitivity, Bias, and the Path Toward Specialized AI. Quant. Sci. Stud. 2024, 1–22. [Google Scholar] [CrossRef]
Hsu, H.H.; Huang, N.F. Xiao-Shih: A Self-Enriched Question Answering Bot with Machine Learning on Chinese-Based MOOCs. IEEE Trans. Learn. Technol. 2022, 15, 223–237. [Google Scholar] [CrossRef]
Roy, P.K.; Saumya, S.; Singh, J.P.; Banerjee, S.; Gutub, A. Analysis of Community Question-Answering Issues via Machine Learning and Deep Learning: State-of-the-Art Review. CAAI Trans. Intell. Technol. 2023, 8, 95–117. [Google Scholar] [CrossRef]
Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Chen, B.; Liang, P.; Re, C.; Stoica, I.; Zhang, C. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 31094–31116. [Google Scholar] [CrossRef]

Figure 1. The image depicts an example of a question-and-answer interaction in Uyghur, which is then translated into English to demonstrate the model’s application.

Figure 2. Screenshots (a,b) showcase the translated results when posing questions in Uyghur. Screenshot (a) (Wenxin Yiyan Model Feedback): “Apologies, this feature is not yet available. Please feel free to ask me other questions in Chinese or English, and I will try my best to provide assistance”. Screenshot (b) (iFlytek Spark Model Feedback): “Sorry, your input language is not supported by iFlytek Spark. Please pose your questions in Chinese or English”.

Figure 3. Corpus collection and classification directory structure diagram.

Figure 4. The loss incurred from using OCR technology to recognize the text in the book sample titled “The Historic Transition from Traditional Agriculture to Modern Agriculture”.

Figure 5. The average loss rate incurred from using OCR technology to recognize the text in 102 books.

Figure 6. The accuracy rate of translating 50 randomly selected Chinese paragraph samples into Uyghur.

Figure 7. The average accuracy rate of translating every two Chinese books into Uyghur among 102 books.

Figure 8. Transformer Structure Before and After Modifications Comparison.

Figure 9. Model training results.

Figure 10. Loss function curve and accuracy curve graph.

Figure 11. Loss function curve and accuracy curve graph for the LSTM model.

Figure 12. Loss function curve and accuracy curve graph for the TKAN model.

Figure 13. The accuracy rate of translating 50 randomly selected Chinese paragraph samples into Tibetan.

Figure 14. The average accuracy rate of translating every two Chinese books into Tibetan among 8 books.

Figure 15. Loss function curve and accuracy curve graph for a Transformer model trained with a mixture of Chinese and Tibetan languages.

Figure 16. TaLiMu GPT operation structure.

Table 1. Key performance characteristics summary of major general-purpose language models.

Model	Supported Languages	Deployment Environment	Model Type	Parameter Volume	Key Features
ChatGPT v3.5	English, Chinese, French, German, etc.	High-performance distributed GPU integrated computing	General-purpose large language model	Approximately 175 billion	Strong multi-language processing capability, supports broad domain conversations, resource-intensive, requiring a high-performance computing environment for deployment.
ChatGLM v6B	Chinese, English	Quantized model with minimum support of 6 cores and 6GB VRAM	General-purpose large language model	Approximately 6 billion	Lower hardware threshold, suitable for a wider range of deployment environments while maintaining good language processing capabilities.
Llama3 v70B	English, Chinese, etc.	Quantized model with minimum support of 8 cores and 12GB VRAM	General-purpose large language model	Approximately 70 billion	Larger parameter volume brings stronger expressive ability, but the hardware requirement is reduced through quantization technology.
Wenxin One-Sentence v3.5	Chinese, English, etc.	High-performance distributed GPU integrated computing	General-purpose large language model	Approximately 100 billion	Optimized for Chinese, with powerful language understanding and generation abilities, requiring higher hardware specifications.
iFlytek Spark v4.0	Chinese, English, etc.	High-performance distributed GPU integrated computing	General-purpose large language model	Approximately 100 billion	Highly optimized for speech recognition and processing, especially excelling in Chinese scenarios, with high hardware requirements.

Table 3. Model evaluation metrics results.

Metric Names	Loss	Accuracy	Recall	F1 Score
Result	0.067	0.984	0.984	0.984

Table 4. LSTM model evaluation metrics results.

Metric Names	Loss	Accuracy	Recall	F1 Score
Result	0.140	0.687	0.831	0.831

Table 5. TKAN model evaluation metrics results.

Metric Names	Loss	Accuracy	Recall	F1 Score
Result	0.346	0.653	0.653	0.653

Table 6. Evaluation metrics results for a Transformer model trained with a mixture of Chinese and Tibetan languages.

Metric Names	Loss	Accuracy	Recall	F1 Score
Result	0.066	0.986	0.986	0.986

Table 7. Training result data of different models and corpora of different language categories.

Model Name	Loss	Accuracy	Recall	F1 Score	Train Time	Data Type
Transformer	0.067	0.984	0.984	0.984	3.9	Chinese And Uyghur
LSTM	0.140	0.687	0.831	0.831	5.2	Chinese And Uyghur
TKAN	0.346	0.653	0.653	0.653	7.1	Chinese And Uyghur
Transformer	0.066	0.986	0.986	0.986	4.1	Chinese And Tibetan

Table 8. Comparison of hardware requirements and parameter statistics with existing mainstream large language models.

Model Name	Deployment Environment	Model Type	Number of Parameters	Model Size
ChatGPT v3.5	High-performance distributed GPU integrated computing	General-purpose large language model	Approximately 175 billion	>20 GB
ChatGLM v6B	Quantized model with minimum support of 6 cores and 6 GB VRAM	General-purpose large language model	Approximately 6 billion	12.4 GB
Llama3 v70B	Quantized model with minimum support of 8 cores and 12 GB VRAM	General-purpose large language model	Approximately 70 billion	>14 GB
Wenxin One-Sentence v3.5	High-performance distributed GPU integrated computing	General-purpose large language model	Approximately 100 billion	>20 GB
iFlytek Spark v4.0	High-performance distributed GPU integrated computing	General-purpose large language model	Approximately 100 billion	>20 GB
TaLiMu GPT v1.0	Supports a minimum of 4 cores and 4 G VRAM.	Large language models in specialized fields.	Approximately 78.52 million	1.5 GB

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pan, K.; Zhang, X.; Chen, L. Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur. Appl. Sci. 2024, 14, 5764. https://doi.org/10.3390/app14135764

AMA Style

Pan K, Zhang X, Chen L. Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur. Applied Sciences. 2024; 14(13):5764. https://doi.org/10.3390/app14135764

Chicago/Turabian Style

Pan, Kun, Xiaogang Zhang, and Liping Chen. 2024. "Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur" Applied Sciences 14, no. 13: 5764. https://doi.org/10.3390/app14135764

APA Style

Pan, K., Zhang, X., & Chen, L. (2024). Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur. Applied Sciences, 14(13), 5764. https://doi.org/10.3390/app14135764

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur

Abstract

1. Introduction

2. Corpus Collection and Preprocessing

2.1. Data Collection

2.2. Textual Conversion and Data Cleaning

2.2.1. Loading PDF Documents and Page Extraction

2.2.2. Text Recognition

2.2.3. Image Preprocessing

2.2.4. Result Handling and Output

2.2.5. Data Cleansing

2.3. Conversion of Traditional to Simplified Chinese Characters

2.4. Data Organization and Annotation

2.5. Construction of Uyghur Translation Tool

2.5.1. Architecture of the Uyghur Translation Tool System

2.5.2. Accuracy Evaluation of Uyghur Translation

2.5.3. Batch Translation of Chinese Text to Uyghur and Corpus Integration

3. Word Segmentation and Vocabulary Construction

3.1. Selection and Implementation of Tokenization Method

3.2. Dataset Partitioning and Encoding

3.3. Role of Vocabulary Construction and Dataset Partitioning

4. Model Architecture Design

4.1. Model Architecture Parameter Configuration

4.1.1. Calculation of Memory Requirements

4.1.2. Computational Power Requirements

4.2. Key Components

4.2.1. Sinusoidal Positional Encoding

4.2.2. Multi-Head Self-Attention

4.2.3. Feed-Forward Network, FFN

4.2.4. Layer Normalization

4.2.5. Initialization and Regularization

5. Model Training

5.1. Model Architecture and Configuration

5.2. Data Processing and Training Workflow

5.3. Model Evaluation and Generation Instances

5.4. Performance of the Dataset When Applied to Other Models

5.4.1. Performance of the Dataset on LSTM

5.4.2. Performance of the Dataset on TKAN

5.5. Performance of the Dataset When Applied to Other Models

6. Model Application and Deployment

6.1. Testing Model Performance and Optimization Strategies

6.2. Model Deployment and User Interaction Design

7. Discussion

7.1. Comparison with Existing Language Models and Training Methods

7.2. Comparison of Hardware Resource Requirements with Existing Mainstream Models

7.3. Context Correlation Detection and Fuse Mechanism

7.4. Limitations and Future Prospects

7.5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI