Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur
Abstract
:1. Introduction
2. Corpus Collection and Preprocessing
2.1. Data Collection
2.2. Textual Conversion and Data Cleaning
2.2.1. Loading PDF Documents and Page Extraction
2.2.2. Text Recognition
2.2.3. Image Preprocessing
2.2.4. Result Handling and Output
2.2.5. Data Cleansing
2.3. Conversion of Traditional to Simplified Chinese Characters
- Define the Conversion Function: A function named traditional_to_simplified is defined, which takes a parameter text representing the Traditional Chinese text that needs to be converted.
- Initialization of Converter: Internally within the function, an instance of the OpenCC converter is created via cc = OpenCC(‘t2s’). The ‘t2s’ parameter specifies the conversion direction, indicating transformation from Traditional Chinese to Simplified Chinese.
- Execution of Conversion: The conversion is carried out by invoking the method .convert(text) on the converter instance cc, where text is the input Traditional Chinese text being transformed into Simplified Chinese.
- Return of Results: Upon completion of the conversion, the function returns the converted Simplified Chinese text via return simplified_text.
2.4. Data Organization and Annotation
2.5. Construction of Uyghur Translation Tool
2.5.1. Architecture of the Uyghur Translation Tool System
2.5.2. Accuracy Evaluation of Uyghur Translation
2.5.3. Batch Translation of Chinese Text to Uyghur and Corpus Integration
3. Word Segmentation and Vocabulary Construction
- Jieba, as a classic Chinese word segmentation library, is renowned for its maturity, stability, and rich customization capabilities. It is particularly adept at handling word segmentation in Chinese text, but it may have limitations in dealing with unknown words (new words or specialized terminology) and long sentence structures.
- HanLP, as a comprehensive natural language processing toolkit [18], offers a wide range of Chinese processing functions, including word segmentation, part-of-speech tagging [19], etc. Its advantages lie in its high configurability and deeply optimized models. However, for large-scale data processing and task-specific optimization needs, more customization work may be required.
- In contrast, SentencePiece stands out as our preferred word segmentation and vocabulary construction tool in this study due to its unique design philosophy and practical advantages:
- Flexibility and universality: SentencePiece is not limited to a specific language. Its statistical-based subword unit generation method enables it to excel in multi-language processing, especially for mixed-language or low-resource language text data.
- Adaptability: By learning statistical patterns in the data [20], SentencePiece can automatically generate a subword dictionary that best fits the current corpus. This gives it significant advantages in discovering new words and handling rare words, which is particularly important for rapidly changing internet language and specialized literature.
- Efficiency and practicability: It supports multiple model types and adopts the unigram model by default, balancing vocabulary coverage and model complexity. At the same time, the generated integer encoding effectively reduces storage space requirements and accelerates subsequent model training processes.
Tokenizer Tool | Advantages | Disadvantages | Uyghur Language | Chinese Language | English Language | Japanese Language |
---|---|---|---|---|---|---|
jieba | 1. Supports three tokenization modes: accurate mode, full mode, and search engine mode [22] | 1. Ineffective recognition of proper nouns and rare words | × | √ | √ | √ |
2. Supports custom dictionaries | 2. Does not support minority languages like Uyghur | |||||
3. Supports multiple languages, such as Chinese [23], English, etc. | — | |||||
SentencePiece | 1. Supports multiple languages [23], such as Chinese, English, Japanese, etc. | 1. Requires a large amount of text data for model training | × | √ | √ | √ |
2. Supports character-level tokenization | 2. Ineffective for minority languages like Uyghur | |||||
3. Allows training of custom models | — | |||||
4. Supports custom retained dictionaries | — | |||||
HanLP | 1. Supports multiple languages, such as Chinese [23], English, Japanese, etc. | 1. Ineffective recognition of proper nouns and rare words | × | √ | √ | √ |
2. Supports functions like part-of-speech tagging, named entity recognition, etc. | 2. Does not support minority languages like Uyghur | |||||
3. Supports custom dictionaries | — |
3.1. Selection and Implementation of Tokenization Method
3.2. Dataset Partitioning and Encoding
3.3. Role of Vocabulary Construction and Dataset Partitioning
4. Model Architecture Design
4.1. Model Architecture Parameter Configuration
- Reduced memory usage: By reducing the bit-width of the model’s weights and activations, the required storage space for the model is significantly reduced [28,29,30]. This is especially crucial for devices with limited memory resources, such as mobile devices, embedded systems, or edge computing devices, enabling the deployment of large Transformer models on these devices.
- Accelerated inference speed: Many modern hardware (such as GPUs, TPUs, and specific AI accelerators) are optimized for low-precision computations, providing faster computational speed at lower precision. This means that quantized models can achieve higher throughput and lower latency when performing inference on these hardware [31,32,33,34,35].
- Energy saving: Due to the reduction in computations and memory accesses, quantized models consume less energy during runtime, significantly reducing energy costs and environmental impact for battery-powered devices or large-scale data centers.
- Widened application scope: Parameter quantization enables the application of complex Transformer models that were previously difficult to deploy due to resource constraints in more scenarios, such as real-time natural language processing tasks on IoT devices or advanced decision support systems in autonomous vehicles.
- Enhanced user experience: Faster response times and lower energy consumption mean that users can enjoy smoother and more enduring application experiences, especially on mobile devices.
- Vocabulary Size: 240,000 Uyghur and Chinese word items, covering a rich range of linguistic expressions.
- Sequence Length: The maximum input length is set to 256 to handle complex sentence structures and long text requirements.
- Hidden Dimension (d_model): Set to 256, serving as the dimension for the internal representation of the model, balancing computational efficiency and expressive power.
- Number of Layers (n_layer): A total of 8 Transformer blocks, increasing the depth of the model, which is conducive to learning more complex linguistic structures.
- Bias: Bias terms are used in the linear layers to enhance the model’s expressive ability.
- Dropout Rate: Set to 0.1, maintaining the integrity of information transmission and facilitating the understanding of the model’s behavior during training.
4.1.1. Calculation of Memory Requirements
- Vocabulary Size and Embedding Matrix: With a vocabulary size of 2,400,000 words and an assumed embedding dimension of 256, the embedding matrix size is 2,400,000 × 256 = 614,400,000 floating-point numbers. Using float32 precision, this would require approximately 2.46 GB of GPU memory. However, this is typically not the main source of memory consumption, as the embedding layer can utilize techniques like Embedding Bag or optimized lookup tables to reduce memory usage.
- Memory Usage of Transformer Blocks: Each Transformer block primarily consumes memory from the QKV calculations, i.e., (seq_len × d_model) × 3 (for Q, K, V). For 8 layers, the total is 256 × 256 × 3 × 8 = 1,536,960 floating-point numbers, approximately 6.23 MB (float32). This is just the basic intermediate state, and with residual connections, layer normalization, FFN, etc., the actual consumption will be greater.
- Full Model Memory Estimation: Considering all layers, residual connections, intermediate outputs, and backpropagation, along with possible batch processing, a safe estimate is several times the above calculations, with additional overhead. For large-scale training, especially with large batch sizes, tens of GB of GPU memory may be required.
4.1.2. Computational Power Requirements
- (1)
- Embedding Layer: Approximately 2.46 MB, but the actual usage can be reduced through optimization techniques.Transformer Block Forward Pass:
- QKV Calculation: 2,562,563 = 196,608 floating-point numbers, for 8 layers it’s 196,608 × 8 = 1,572,864.
- FFN: Each layer is approximately 42562 = 262,144 floating-point numbers, for 8 layers it’s 262,1448 = 2,097,152.
- Total: 1,572,864 + 2,097,152 = 3,670,016 floating-point numbers, which is approximately 14.68 MB (float32), not considering the small occupancy of residual connections and Layer Normalization (LN).
- (2)
- Batch Size: The batch size is set to 16.
- (3)
- Backward Propagation: Theoretically, it’s twice the forward pass, but considering optimizations and reuse, it may be slightly less than twice. Here, we simply estimate it as twice.
- Forward Pass: Approximately 14.68 MB, multiplied by the batch size of 16, resulting in approximately 234.88 MB.
- Backward Propagation: Estimated as twice the forward pass, approximately 469.76 MB.
- Additional Overhead: Including gradients, optimizer states, etc., assuming additional occupancy is similar to the model parameters, but without specific parameter quantities, it generally accounts for a small portion.
4.2. Key Components
4.2.1. Sinusoidal Positional Encoding
4.2.2. Multi-Head Self-Attention
4.2.3. Feed-Forward Network, FFN
4.2.4. Layer Normalization
4.2.5. Initialization and Regularization
5. Model Training
5.1. Model Architecture and Configuration
5.2. Data Processing and Training Workflow
5.3. Model Evaluation and Generation Instances
- Accuracy/Precision:
- Recall:
- F1 Score:
- Loss:
5.4. Performance of the Dataset When Applied to Other Models
5.4.1. Performance of the Dataset on LSTM
5.4.2. Performance of the Dataset on TKAN
5.5. Performance of the Dataset When Applied to Other Models
6. Model Application and Deployment
6.1. Testing Model Performance and Optimization Strategies
- Sentence Integrity Check: Ensures each candidate output concludes with punctuation such as periods, question marks, or exclamation points, preserving natural sentence integrity.
- Cosine Similarity Evaluation: Evaluate the similarity between newly generated sentences and previous text using TF-IDF vectorization and cosine similarity calculation. When the similarity is too high (exceeding a preset threshold, such as 0.8), trigger a fuse mechanism to terminate further text generation, thus preventing content disassociation and redundancy.
- Filtering Out Sentences Ending with Question Marks: Exclude sentences ending with question marks from the final output to better suit the needs of certain scenarios, such as declarative answers.
6.2. Model Deployment and User Interaction Design
- Loading the Model and Tokenizer: Ensuring the pretrained GPT model and its corresponding tokenizer are efficiently loaded in the deployment environment, facilitating the encoding and decoding of text data.
- Interactive Loop Design: Implementing an infinite loop that waits for user input, where each iteration involves transforming the user’s query into the input format required by the model. The model is then invoked to generate an answer via the generate function, which produces a response of a predetermined length based on the user’s question and instantly presents it to the user.
- Simplified Text Generation Function: For rapid demonstration purposes, we provide the generate_text function. This function directly generates text without incorporating circuit-breaking logic, making it suitable for preliminary model performance validation and swift iterative testing.
- Text Inspection and Circuit-Breaking Handling: The generate_text_with_circuit_breaker function subjects the text generated by the generation function to quality inspection and applies circuit-breaking measures to obtain a final response that adheres to the intent of the input query, ensuring relevance and coherence.
7. Discussion
7.1. Comparison with Existing Language Models and Training Methods
7.2. Comparison of Hardware Resource Requirements with Existing Mainstream Models
7.3. Context Correlation Detection and Fuse Mechanism
7.4. Limitations and Future Prospects
7.5. Conclusions
- (1)
- Verification of Model Performance Superiority: Experimental results show that through the carefully designed Transformer architecture and customized parameter settings, the constructed model achieves an accuracy rate of over 90% in agricultural terminology recognition, significantly outperforming general-purpose models in the agricultural field. This indicates that targeted model optimization can effectively improve the accuracy and practicality of knowledge-based question-and-answer in specific domains, reducing misunderstandings and deviations in knowledge-based question-and-answer, and enhancing the precision of agricultural technology dissemination.
- (2)
- Breakthrough in Lightweight Design: Based on the assessment of hardware requirements for model training, to ensure stability and flexibility during the training process, a GPU with 12 GB of video memory can complete the model training task for larger scales or batch processing. The actual obtained model size is only 1.5 GB, and according to the evaluation of hardware computing power required for loading and applying the model, it can run smoothly in a low-configuration hardware environment supporting 4 GB of video memory, lowering the technical threshold for agricultural intelligent services. This achievement proves that lightweight models, while maintaining high performance, can adapt to resource-constrained environments, facilitating widespread deployment and popularization of the model, especially for remote areas and grassroots agricultural workers, representing a significant technological advancement.
- (3)
- Knowledge Fusion in a Bilingual Environment: This study addresses the absence of minority languages in large language models by constructing a high-quality bilingual corpus in Chinese and Uyghur, promoting technological fairness and linguistic diversity inclusiveness. The development of translation tools and data integration strategies not only enrich the agricultural knowledge base but also build a bridge for cross-lingual knowledge sharing. It is the first time that Uyghur has been proposed and experimentally studied as a large language model corpus to construct a large language model, enhancing Uyghur farmers’ ability to access information, reflecting the humanistic care and social value of technology.
- (4)
- Innovation in Data Processing and Model Training: During the data preprocessing stage, the study innovatively combines OCR technology and text cleaning methods to effectively convert a large number of agricultural professional books into electronic texts, providing rich training data for the model. Meanwhile, through the simplification of traditional characters and custom text tagging strategies, the quality of knowledge graph construction is ensured, enhancing the model’s ability to understand context. This methodological innovation provides valuable experience for the development of specific language models in other fields.
- (5)
- Utilizing SentencePiece to Improve Uyghur Word Segmentation Accuracy Through Bilingual Data, Including:
- 1.
- Using Chinese data to train the model and translating it into Uyghur to create a targeted dictionary;
- 2.
- Integrating translation tools to construct a bilingual dictionary, enhancing multilingual processing capabilities;
- 3.
- Employing the unigram model to balance coverage and efficiency;
- 4.
- Preserving text format and structure, suitable for complex document processing;
- 5.
- Precisely controlling word segmentation to maintain textual naturalness;
- 6.
- Generating a bilingual vocabulary containing 240,000 words, promoting Chinese-Uyghur bilingual information processing technology.
- (6)
- Innovatively applying a context detection and circuit breaker mechanism for text generation, significantly optimizing the quality and novelty of generated text. Features include:
- 1.
- Using the generate_text_with_circuit_breaker function to integrate intelligent circuit breakers to ensure textual coherence and diversity;
- 2.
- Implementing sentence integrity detection and cosine similarity evaluation to automatically screen and prevent redundant content generation;
- 3.
- Customizing similarity thresholds and end symbol filtering to enhance content independence;
- 4.
- Accurately calculating text similarity through TF-IDF vectorization, enhancing semantic understanding depth.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rathi, A.K.A. Pursuing the distilled good practices to improve the quality of Environmental Impact Assessment Reports and hence enhance the EIA effectiveness and help address the concerns of project proponents: An Indian Context. Macro Manag. Public Policies 2023, 5, 26–43. [Google Scholar] [CrossRef]
- Zhu, A.; Dugan, L.; Hwang, A.; Callison-Burch, C. Kani: A Lightweight and Highly Hackable Framework for Building Language Model Applications. arXiv 2023, arXiv:2309.05542. [Google Scholar] [CrossRef]
- Zhang, X.; Zhang, X.; Yu, Y. ChatGLM-6B Fine-Tuning for Cultural and Creative Products Advertising Words. In Proceedings of the 2023 International Conference on Culture-Oriented Science and Technology (CoST), Xi’an, China, 11–14 October 2023; pp. 291–295. [Google Scholar]
- Xia, Z.; Gao, B.; Yu, C.; Han, H.; Zhang, H.; Wang, S. A Hybrid Parallel Strategy for Isogeometric Topology Optimization via CPU/GPU Heterogeneous Computing. Comput. Model. Eng. Sci. 2024, 138, 1103–1137. [Google Scholar] [CrossRef]
- Akilandeswari, K.; Sivakumar, N.R.; Alkahtani, H.K.; Basheer, S.; Ghorashi, S.A. Smart Healthcare Activity Recognition Using Statistical Regression and Intelligent Learning. Comput. Mater. Contin. 2024, 78, 1189–1205. [Google Scholar] [CrossRef]
- Zhong, S.; Yan, Z.; Wei, C.; Feng, L.; Chun, Z. Missing Value Imputation for Radar-Derived Time-Series Tracks of Aerial Targets Based on Improved Self-Attention-Based Network. Mater. Contin. 2024, 78, 3349–3376. [Google Scholar]
- Mazharul, H.Q.; Arif, F.; Aurangzeb, K.; Khan, J.A.; Rubab, S.; Anwar, M.S. Identification of Software Bugs by Analyzing Natural Language-Based Requirements Using Optimized Deep Learning Features. Comput. Mater. Contin. 2024, 78, 4379–4397. [Google Scholar]
- Cui, X.; Song, C.; Li, D.; Qu, X.; Long, J.; Yang, Y.; Zhang, H. RoBGP: A Chinese Nested Biomedical Named Entity Recognition Model Based on RoBERTa and Global Pointer. Comput. Mater. Contin. 2024, 78, 3603–3618. [Google Scholar] [CrossRef]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Punta Cana, Dominican Republic, October 2020; pp. 38–45. Available online: https://aclanthology.org/2020.emnlp-demos.6/ (accessed on 20 May 2024).
- Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A Training Algorithm for Optimal Margin Classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (COLT ’92), Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar] [CrossRef]
- Bashar, M.A.; Nayak, R. ALGAN: Time Series Anomaly Detection with Adjusted-LSTM GAN. Res. Sq. Prepr. 2023. [Google Scholar] [CrossRef]
- López Luna, M.; Taboada-Ortega, M.A.; Alvarez-Amparán, M.A.; Cedeño-Caero, L. Effect of iron incorporation on W based catalysts for oxidative desulfurization of dibenzothiophene compounds. Catal. Today 2022, 394, 336–347. [Google Scholar] [CrossRef]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Raparthi, M.; Dodda, S.B.; Reddy SR, B.; Thunki, P.; Maruthi, S.; Ravichandran, P. Advancements in Natural Language Processing-A Comprehensive Review of AI Techniques. J. Bioinform. Artif. Intell. 2021, 1, 1–10. [Google Scholar]
- Zhao, G.; Wang, Z.; Huang, Y.; Zhang, H.; Ma, X. Transformer-Based Maneuvering Target Tracking. Sensors 2022, 22, 8482. [Google Scholar] [CrossRef] [PubMed]
- Wu, J.; Bai, T.; Li, X. Inverting Chlorophyll Content in Jujube Leaves Using a Back-Propagation Neural Network–Random Forest–Ridge Regression Algorithm with Combined Hyperspectral Data and Image Color Channels. Agronomy 2024, 14, 140. [Google Scholar] [CrossRef]
- Zhang, Y.; Hu, Y.; Chen, X. Context and Multi-Features-Based Vulnerability Detection: A Vulnerability Detection Frame Based on Context Slicing and Multi-Features. Sensors 2024, 24, 1351. [Google Scholar] [CrossRef] [PubMed]
- Ruan, S.; Cang, H.; Chen, H.; Yan, T.; Tan, F.; Zhang, Y.; Duan, L.; Xing, P.; Guo, L.; Gao, P.; et al. Hyperspectral Classification of Frost Damage Stress in Tomato Plants Based on Few-Shot Learning. Agronomy 2023, 13, 2348. [Google Scholar] [CrossRef]
- Bin, W.; Lin, W. Forecasting Grain Yield in China Using Attention-based ADE-Bi-IndRNN Model. Oper. Res. Manag. Sci. 2024, 33, 102–119. [Google Scholar]
- Zheng, Z.; Huang, S.; Weng, R.; Dai, X.-Y.; Chen, J. Improving self-attention networks with sequential relations. IEEE/ACM Trans. Audio Speech Lang. Process. 2020, 28, 1707–1716. [Google Scholar] [CrossRef]
- Xu, G.; Liu, L.; Dong, J. Vulnerability Detection of Ethereum Smart Contract Based on SolBERT-BiGRU-Attention Hybrid Neural Model. Comput. Model. Eng. Sci. 2023, 137, 903–922. [Google Scholar] [CrossRef]
- Zhou, W.; Jiang, X.; Qin, C. C-CORE: Clustering by Code Representation to Prioritize Test Cases in Compiler Testing. Comput. Model. Eng. Sci. 2024, 139, 2069–2093. [Google Scholar] [CrossRef]
- Gillioz, A.; Casas, J.; Mugellini, E.; Abou Khaled, O. Overview of the Transformer-Based Models for NLP Tasks; IEEE: New York, NY, USA, 2020; pp. 179–183. [Google Scholar]
- Kaixu, Z.; Maosong, S. Unified Framework of Performing Chinese Word Segmentation and Part-of-Speech Tagging. China Commun. 2012, 1, 1–9. [Google Scholar]
- Kostić, M.; Batanović, V.; Nikolić, B. Monolingual, multilingual and cross-lingual code comment classification. Eng. Appl. Artif. Intell. 2023, 124, 106485. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Dongmei, Z.; Mengzhen, S. Integrated Development of Tea and Tourism in Taishan Mountain Tea Valley in the Context of Rural Revitalization. Asian Agric. Res. 2024, 16, 1–9. [Google Scholar]
- Li, L.; Li, J.; Wang, H.; Nie, J. Application of the transformer model algorithm in chinese word sense disambiguation: A case study in chinese language. Sci. Rep. 2024, 14, 6320. [Google Scholar] [CrossRef] [PubMed]
- Pressel, D.; Liu, W.; Johnston, M.; Chen, M. Lightweight transformers for conversational ai. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, Seattle, WA, USA, 10–15 July 2022; pp. 221–229. [Google Scholar]
- Zhou, J.; Lin, Q.; Feng, X.; Ren, D.; Teng, J.; Wu, X.; Wu, D.; Zhang, X.; Yuan, X.; Chen, Z.; et al. Evaluating the performance of genomic selection on purebred population by incorporating crossbred data in pigs. J. Integr. Agric. 2024, 23, 639–648. [Google Scholar] [CrossRef]
- Liu, A.; Han, X.; Wang, Y.; Tsvetkov, Y.; Choi, Y.; Smith, N.A. Tuning Language Models by Proxy. arXiv 2024, arXiv:2401.08565. [Google Scholar] [CrossRef]
- Liu, Z.; Yao, W.; Zhang, J.; Yang, L.; Liu, Z.; Tan, J.; Choubey, P.K.; Lan, T.; Wu, J.; Wang, H.; et al. AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System. arXiv 2024, arXiv:2402.15538. [Google Scholar]
- Thawakar, O.; Vayani, A.; Khan, S.; Cholakal, H.; Anwer, R.M.; Felsberg, M.; Baldwin, T.; Xing, E.P.; Khan, F.S. MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT. arXiv 2024, arXiv:2402.16840. [Google Scholar]
- He, C.; Luo, R.; Hu, S.; Zhao, Y.; Zhou, J.; Wu, H.; Zhang, J.; Han, X.; Liu, Z.; Sun, M. UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs. arXiv 2024, arXiv:2404.07584. [Google Scholar]
- Shi, Z.; Xu, X.; Liu, X.; Chen, J.; Yang, M.H. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17482–17491. [Google Scholar]
- Scoones, I. The Politics of Global Assessments: The Case of the International Assessment of Agricultural Knowledge, Science and Technology for Development (IAASTD). J. Peasant. Stud. 2009, 36, 547–571. [Google Scholar] [CrossRef]
- He, S.; Xin, J.; Peng, H.; Zhang, E. Research on Malicious URL Detection Based on Feature Contribution Tendency. In Proceedings of the 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), Chengdu, China, 24–26 April 2021; pp. 576–581. [Google Scholar]
- Chiang, S.-Y.; Lin, T.-Y. Low-Brightness Object Recognition Based on Deep Learning. Comput. Mater. Contin. 2024, 79, 1757–1773. [Google Scholar] [CrossRef]
- Soutner, D.; Müller, L. Application of LSTM neural networks in language modelling. In Proceedings of the 16th International Conference on Text, Speech, and Dialogue (TSD 2013), Pilsen, Czech Republic, 1–5 September 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 105–112. [Google Scholar]
- Yu, Y.; Si, X.; Hu, C.; Zhang, J. A Review of Recurrent Neural Networks: LSTM Cells and Network Architectures. Neural Comput. 2019, 31, 1235–1270. [Google Scholar] [CrossRef]
- Genet, R.; Inzirillo, H. Tkan: Temporal Kolmogorov-Arnold Networks. arXiv 2024, arXiv:2405.07344. [Google Scholar] [CrossRef]
- Ansari, A.S. A Review on the Recent Trends of Image Steganography for VANET Applications. CMC-Comput. Mater. Contin. 2024, 78, 2865–2892. [Google Scholar] [CrossRef]
- Xu, M.; Shen, C.; Zhang, J.; Wang, Z.; Ruan, Z.; Poslad, S.; Xu, P. Improved HardNet and Stricter Outlier Filtering to Guide Reliable Matching. Comput. Mater. Contin. 2023, 75, 4785–4803. [Google Scholar] [CrossRef]
- Hajikhani, A.; Cole, C. A Critical Review of Large Language Models: Sensitivity, Bias, and the Path Toward Specialized AI. Quant. Sci. Stud. 2024, 1–22. [Google Scholar] [CrossRef]
- Hsu, H.H.; Huang, N.F. Xiao-Shih: A Self-Enriched Question Answering Bot with Machine Learning on Chinese-Based MOOCs. IEEE Trans. Learn. Technol. 2022, 15, 223–237. [Google Scholar] [CrossRef]
- Roy, P.K.; Saumya, S.; Singh, J.P.; Banerjee, S.; Gutub, A. Analysis of Community Question-Answering Issues via Machine Learning and Deep Learning: State-of-the-Art Review. CAAI Trans. Intell. Technol. 2023, 8, 95–117. [Google Scholar] [CrossRef]
- Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Chen, B.; Liang, P.; Re, C.; Stoica, I.; Zhang, C. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 31094–31116. [Google Scholar] [CrossRef]
Model | Supported Languages | Deployment Environment | Model Type | Parameter Volume | Key Features |
---|---|---|---|---|---|
ChatGPT v3.5 | English, Chinese, French, German, etc. | High-performance distributed GPU integrated computing | General-purpose large language model | Approximately 175 billion | Strong multi-language processing capability, supports broad domain conversations, resource-intensive, requiring a high-performance computing environment for deployment. |
ChatGLM v6B | Chinese, English | Quantized model with minimum support of 6 cores and 6GB VRAM | General-purpose large language model | Approximately 6 billion | Lower hardware threshold, suitable for a wider range of deployment environments while maintaining good language processing capabilities. |
Llama3 v70B | English, Chinese, etc. | Quantized model with minimum support of 8 cores and 12GB VRAM | General-purpose large language model | Approximately 70 billion | Larger parameter volume brings stronger expressive ability, but the hardware requirement is reduced through quantization technology. |
Wenxin One-Sentence v3.5 | Chinese, English, etc. | High-performance distributed GPU integrated computing | General-purpose large language model | Approximately 100 billion | Optimized for Chinese, with powerful language understanding and generation abilities, requiring higher hardware specifications. |
iFlytek Spark v4.0 | Chinese, English, etc. | High-performance distributed GPU integrated computing | General-purpose large language model | Approximately 100 billion | Highly optimized for speech recognition and processing, especially excelling in Chinese scenarios, with high hardware requirements. |
Metric Names | Loss | Accuracy | Recall | F1 Score |
---|---|---|---|---|
Result | 0.067 | 0.984 | 0.984 | 0.984 |
Metric Names | Loss | Accuracy | Recall | F1 Score |
---|---|---|---|---|
Result | 0.140 | 0.687 | 0.831 | 0.831 |
Metric Names | Loss | Accuracy | Recall | F1 Score |
---|---|---|---|---|
Result | 0.346 | 0.653 | 0.653 | 0.653 |
Metric Names | Loss | Accuracy | Recall | F1 Score |
---|---|---|---|---|
Result | 0.066 | 0.986 | 0.986 | 0.986 |
Model Name | Loss | Accuracy | Recall | F1 Score | Train Time | Data Type |
---|---|---|---|---|---|---|
Transformer | 0.067 | 0.984 | 0.984 | 0.984 | 3.9 | Chinese And Uyghur |
LSTM | 0.140 | 0.687 | 0.831 | 0.831 | 5.2 | Chinese And Uyghur |
TKAN | 0.346 | 0.653 | 0.653 | 0.653 | 7.1 | Chinese And Uyghur |
Transformer | 0.066 | 0.986 | 0.986 | 0.986 | 4.1 | Chinese And Tibetan |
Model Name | Deployment Environment | Model Type | Number of Parameters | Model Size |
---|---|---|---|---|
ChatGPT v3.5 | High-performance distributed GPU integrated computing | General-purpose large language model | Approximately 175 billion | >20 GB |
ChatGLM v6B | Quantized model with minimum support of 6 cores and 6 GB VRAM | General-purpose large language model | Approximately 6 billion | 12.4 GB |
Llama3 v70B | Quantized model with minimum support of 8 cores and 12 GB VRAM | General-purpose large language model | Approximately 70 billion | >14 GB |
Wenxin One-Sentence v3.5 | High-performance distributed GPU integrated computing | General-purpose large language model | Approximately 100 billion | >20 GB |
iFlytek Spark v4.0 | High-performance distributed GPU integrated computing | General-purpose large language model | Approximately 100 billion | >20 GB |
TaLiMu GPT v1.0 | Supports a minimum of 4 cores and 4 G VRAM. | Large language models in specialized fields. | Approximately 78.52 million | 1.5 GB |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pan, K.; Zhang, X.; Chen, L. Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur. Appl. Sci. 2024, 14, 5764. https://doi.org/10.3390/app14135764
Pan K, Zhang X, Chen L. Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur. Applied Sciences. 2024; 14(13):5764. https://doi.org/10.3390/app14135764
Chicago/Turabian StylePan, Kun, Xiaogang Zhang, and Liping Chen. 2024. "Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur" Applied Sciences 14, no. 13: 5764. https://doi.org/10.3390/app14135764
APA StylePan, K., Zhang, X., & Chen, L. (2024). Research on the Training and Application Methods of a Lightweight Agricultural Domain-Specific Large Language Model Supporting Mandarin Chinese and Uyghur. Applied Sciences, 14(13), 5764. https://doi.org/10.3390/app14135764