Research on Compressed Input Sequences Based on Compiler Tokenization
Abstract
:1. Introduction
- Due to the presence of many non-natural language symbols and text in code, pretrained models tend to tokenize sentences more finely compared to natural language text. This finer tokenization consumes more context resources, leading to lower execution efficiency for intelligent code-related tasks.
- Pretrained models primarily use the WordPiece [3] or BPE tokenization algorithms [4] during the tokenization stage, which do not account for semantic information in the code [5,6]. As a result, the same variable may be tokenized differently in different places, leading to inconsistent understanding during subsequent model processing.
- The model also fails to understand the concept of variables. When encountering unfamiliar words, it cannot treat them as ordinary variables but, instead, decomposes them based on subwords existing in the vocabulary.
- We use a targeted compiler for preliminary tokenization of the code and then replace the tokens with lexical units for input into the LLM. With this method, our new model reduces the token input length by 33.7% compared to the baseline model.
- We perform word restoration on the code generated by the LLM, achieving an output accuracy close to that of the baseline model with compressed tokens. Through compiler preprocessing with non-compressed tokens, we achieve accuracy exceeding that of the baseline model, with input token lengths approximately equal to those of the baseline.
2. Related Work
3. Algorithm Architecture
3.1. Challenges
- Compiler-based tokenizer: The appropriate compiler is selected based on the type of source code and serves to break it down into lexical units.
- Secondary tokenization based on LLM: This module tokenizes the lexical units into llama tokens and subsequently extracts tokens according to the principle of maximum similarity. During the extraction process, a context dictionary will be generated for restoration.
- Generation and recovery of the source code: The extracted token list is input into the code-llama model for generation; subsequently, the output is restored using a context dictionary.
3.2. Tokenization of Input Sequences Based on Compiler Lexical Analysis
3.3. Secondary Tokenization of Lexical Unit Sequences
3.4. Contextual Subword Recovery of Output Sequences
- “GroovyScriptNodeFactory” is a lexical unit, used as input.
- The lexical unit is tokenized using the llama tokenizer; the following number represents its the fuzz ratio to the input. We select the most similar llama token, “Factory”, as the representative for this lexical unit.
- “Factory” and “GroovyScriptNodeFactory” are stored in the context dictionary as key–value pairs for subsequent output recovery.
- The code-llama model performs inference, producing a code snippet containing the “Factory” llama token.
- The model output is restored using the context dictionary, resulting in an improved output.
4. Experimental Analysis
4.1. Dataset
4.2. Experimental Model
- Baseline Model: The codellama-7b model was used as the comparison benchmark.
- Full Token Addition Model: The tokens were first processed by the compiler and then tokenized by the llama tokenizer before being input into the model.
- Contextual Recovery Model: Based on the full addition model, the llama token sequence was filtered, where, for each lexical unit, only the most similar token was chosen to be input into the model; this was implemented in order to improve the token compression ratio. A context dictionary completion algorithm was added during model output to complement the generated tokens, improving the quality of the output.
4.3. Metrics
4.4. Analysis of the Results
4.5. Ablation Experiments
- w/o Compiler: In this variant, the compiler-based tokenization module was removed. As the compiler tokenization module was the foundation of the algorithm and could not be isolated, this model is presented as the baseline model.
- Select and Recovery: This refers to the context recovery model proposed in this work, which combines the selection of the most similar token and the context dictionary recovery algorithm.
- w/o Select: In this variant, the step that involved selecting the most similar token from the secondary lexical units was removed. In this model, a token is randomly selected to represent the original lexical unit.
- w/o Recovery: In this variant, the context dictionary recovery algorithm was removed and the model’s output was used directly.
- w/o Select and Recovery: In this variant, both the method of selecting the most similar token and the context dictionary recovery algorithm were removed.
4.6. Case Study
4.6.1. A Study on the Applicability of the Context Recovery Model in Long Input Contexts
4.6.2. Application Examples of Context Recovery Model
- Application of the context recovery modelThe model proposed in this study can be integrated into the main operation process of the coding assistant. The key stages involved in this case are as follows.Vector Database Construction: To build the RAG system, a vector database based on the code repository needs to be created. First, the code must be chunked; moreover, to improve the retrieval performance, the chunked code should be subjected to natural language enhancement, which involves generating comments for the source code using an LLM. At this stage, the input data type is the source code; thus, the context recovery model can be applied.Enhanced Result Inference Stage: At this stage, the coding assistant combines the filtered code snippets with the context information provided by the user and inputs them into an LLM for inference, generating the final answer to be returned to the user. The input data in this stage mainly consist of source code and a small number of natural language questions provided by the user, so the context recovery model can be applied.
- Anticipated effectsEnhanced system operation speeds: With the input remaining constant, a reduction in the number of tokens input into the LLM will decrease the time required for the inference process, thereby accelerating the inference speed. This improvement will significantly enhance the efficiency of vector database construction and result inference.Reduced token costs: As LLMs continue to evolve, their inference costs are rising. The inference time and computational resource consumption increase substantially with the increase in model parameters. In the absence of context length limitations, a smaller number of tokens included in the input sequence will lead to lower token costs.Scalability and flexibility: The context recovery model is applicable to input data consisting of source code, without altering the original workflow. This approach reduces the development and integration costs associated with existing code assistants, making it more adaptable to various applications.
5. Conclusions
- The introduction of compiler tokenization provides the model with a more extensive and nuanced information set, leading to improved inference performance. Compared to the baseline model, the full addition model has demonstrated significant enhancements in the BLEU score, the fuzz ratio, and the fuzz partial ratio.
- The introduction of secondary extraction substantially reduces the token compression rate. However, relying solely on the secondary extraction of word tokenization results yields inference outcomes with suboptimal similarity. Restoring the inference results after secondary extraction not only maintains a relatively low compression rate but also significantly improves the BLEU score, the fuzz ratio, and the fuzz partial ratio. Moreover, during secondary extraction, employing the highest similarity selection algorithm, as opposed to random selection, results in a BLEU score, fuzz ratio, and fuzz partial ratio that more closely align with those of the baseline model.
- The context recovery model performs token sequence compression by selecting the highest similarity tokens based on the full addition model, followed by inference and restoration using the compressed output. This approach achieves a significant reduction in the token compression rate with only a minor decrease in output quality. Compared to the llama2-7b baseline model, the context recovery model reduces the number of tokens by 33.7% while maintaining comparable output quality.
- The effectiveness of the context recovery model was evaluated using a Java code completion task. The results indicated that the baseline model, constrained by the limited context input length, failed to capture the overall architectural information in the source code. In contrast, the context recovery model successfully captured this architectural information and performed the completion task more accurately. This suggests that when processing long input texts, the context recovery model provides higher-quality output compared to the baseline model.
- The lexical analyzer utilized in this study must satisfy two critical requirements: firstly, it is required to support the generation of lexical unit sequences; secondly, it must be capable of exception handling. At present, lexical analyzers that meet these criteria still require manual configuration and are not adaptable to multiple programming languages. Consequently, this increases the cost associated with applying the proposed method in multi-language environments.To address the aforementioned challenges, we propose the following research directions for further investigation. It is necessary to construct a training dataset comprising diverse types of source code and their corresponding lexical unit sequences. It would also be beneficial to train a model to learn tokenization rules across multiple programming languages and to capture exceptions when syntax errors arise. This approach could contribute to the development of a robust model that is capable of performing lexical analysis on various programming languages.
- The approach proposed in this study involves performing LLM inference after selecting tokens derived from tokenization. Another potential research direction consists of enhancing the accuracy and efficiency of model inference through the selection of specific tokens. We suggests that a balance between high-quality model inference and computational efficiency can be achieved by fine-tuning the screening strategy, considering methods such as the inclusion of all tokens, the selective addition of individual tokens, or the removal of redundant tokens.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
LLMs | Large language models |
OOV | Out of vocabulary |
References
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Wu, Y. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Bostrom, K.; Durrett, G. Byte pair encoding is suboptimal for language model pretraining. arXiv 2020, arXiv:2004.03720. [Google Scholar]
- Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Liu, Y.; Zhang, M. Neural network methods for natural language processing. Comput. Linguist. 2018, 44, 193–195. [Google Scholar] [CrossRef]
- Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 39. [Google Scholar]
- Hindle, A.; Barr, E.T.; Gabel, M.; Su, Z.; Devanbu, P. On the naturalness of software. Commun. ACM 2016, 59, 122–131. [Google Scholar] [CrossRef]
- Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022; pp. 1–10. [Google Scholar]
- Wong, M.F.; Guo, S.; Hang, C.N.; Ho, S.W.; Tan, C.W. Natural language generation and understanding of big code for AI-assisted programming: A review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef]
- Hellendoorn, V.J.; Devanbu, P. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 763–773. [Google Scholar]
- Dagan, G.; Synnaeve, G.; Rozière, B. Getting the most out of your tokenizer for pre-training and domain adaptation. arXiv 2024, arXiv:2402.01035. [Google Scholar]
- Karampatsis, R.M.; Sutton, C. Maybe deep neural networks are the best choice for modeling source code. arXiv 2019, arXiv:1903.05734. [Google Scholar]
- Feng, D.; Zhang, Y.; Xu, Z. IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining. arXiv 2024, arXiv:2405.09857. [Google Scholar]
- Sachidananda, V.; Kessler, J.S.; Lai, Y.A. Efficient domain adaptation of language models via adaptive tokenization. arXiv 2021, arXiv:2109.07460. [Google Scholar]
- Rabin, M.R.I.; Hellendoorn, V.J.; Alipour, M.A. Understanding neural code intelligence through program simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA, 23–28 August 2021; pp. 441–452. [Google Scholar]
- Rabin, M.R.I.; Hussain, A.; Alipour, M.A. Syntax-guided program reduction for understanding neural code intelligence models. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, New York, NY, USA, 13 June 2022; pp. 70–79. [Google Scholar]
- Svyatkovskiy, A.; Lee, S.; Hadjitofi, A.; Riechert, M.; Franco, J.V.; Allamanis, M. Fast and memory-efficient neural code completion. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain, 28 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 329–340. [Google Scholar]
- Li, Y.; Qi, S.; Gao, C.; Peng, Y.; Lo, D.; Xu, Z.; Lyu, M.R. A closer look into transformer-based code intelligence through code transformation: Challenges and opportunities. arXiv 2022, arXiv:2207.04285. [Google Scholar]
- Zheng, Y.; Suneja, S.; Zhuang, Y.; Morari, A.; Laredo, J.A. Probing Model Signal Awareness. U.S. Patent App. 17/315,701, 10 November 2022. [Google Scholar]
- Sennrich, R. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Gilda, S. Source code classification using Neural Networks. In Proceedings of the 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), NakhonSiThammarat, Thailand, 7 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
- Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar]
- Allamanis, M.; Sutton, C. Mining source code repositories at massive scale using language modeling. In Proceedings of the 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 207–216. [Google Scholar]
- Karampatsis, R.M.; Babii, H.; Robbes, R.; Sutton, C.; Janes, A. Big code!= big vocabulary: Open-vocabulary models for source code. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, New York, NY, USA, 27 June—19 July 2020; pp. 1073–1085. [Google Scholar]
- Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Xue, N. Steven Bird, Evan Klein and Edward Loper. Natural Language Processing with Python. In Natural Language Engineering; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009; Volume 17, pp. 419–424. ISBN 978-0-596-51649-9. [Google Scholar]
- Levenshtein, V. Binary codes capable of correcting deletions, insertions, and reversals. Dokl. Akad. Nauk. SSSR 1965, 10, 707–710. [Google Scholar]
Data Statistics | GitHub Java Corpus |
---|---|
Number of code fragments | 7176 |
Number of tokens | 3.8 M |
Number of characters | 17 M |
Model | Token Compression Ratio |
---|---|
Baseline Model | 0.313 |
Full Token Addition Model | 0.296 |
Contextual Recovery Model | 0.208 |
Model | BLEU | Fuzz Ratio | Fuzz Partial Ratio |
---|---|---|---|
Baseline Model | 0.241 | 51.7 | 52.1 |
Full Token Addition Model | 0.309 | 56.6 | 56.9 |
Contextual Recovery Model | 0.213 | 50.7 | 51.1 |
Model | Token Compression Ratio | BLEU | Fuzz Ratio | Fuzz Partial Ratio |
---|---|---|---|---|
w/o Compiler | 0.313 | 0.241 | 51.7 | 52.1 |
Select and Recovery | 0.208 | 0.213 | 50.7 | 51.1 |
w/o Select | 0.208 | 0.196 | 49.3 | 49.7 |
w/o Recovery | 0.208 | 0.150 | 49.0 | 49.4 |
w/o Select and Recovery | 0.208 | 0.142 | 47.3 | 47.7 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, Z.; Lu, X. Research on Compressed Input Sequences Based on Compiler Tokenization. Information 2025, 16, 73. https://doi.org/10.3390/info16020073
Li Z, Lu X. Research on Compressed Input Sequences Based on Compiler Tokenization. Information. 2025; 16(2):73. https://doi.org/10.3390/info16020073
Chicago/Turabian StyleLi, Zhe, and Xinxi Lu. 2025. "Research on Compressed Input Sequences Based on Compiler Tokenization" Information 16, no. 2: 73. https://doi.org/10.3390/info16020073
APA StyleLi, Z., & Lu, X. (2025). Research on Compressed Input Sequences Based on Compiler Tokenization. Information, 16(2), 73. https://doi.org/10.3390/info16020073