Research on Compressed Input Sequences Based on Compiler Tokenization

Li, Zhe; Lu, Xinxi

doi:10.3390/info16020073

Open AccessArticle

Research on Compressed Input Sequences Based on Compiler Tokenization

by

Zhe Li

^†

and

Xinxi Lu

^*,†

School of Software, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Information 2025, 16(2), 73; https://doi.org/10.3390/info16020073

Submission received: 13 November 2024 / Revised: 18 January 2025 / Accepted: 20 January 2025 / Published: 21 January 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Current applications of large language models (LLMs) in the field of code intelligence face issues related to low tokenization efficiency. This results in longer token sequences for input to source code types, which leads to the waste of contextual resources for large models. Additionally, the existing LLM tokenization technology struggles to ensure the contextual synonymity of variables. To address these problems, we propose a compiler-based compressed input sequence method. We focus on using the compiler’s lexical analyzer for preliminary tokenization of the input statements, followed by tokenization and filtering through the large model’s tokenizer. This approach results in shorter, semantically clearer, and higher-quality embedded token sequences. Then, using a contextual dictionary, the reduced tokens can be restored to their original state in the output statements. The experimental results show that our compressed input sequence method can be run smoothly in code generation scenarios. Compared to the baseline model, the compiler-based tokenization method can reduce the input token count by 33.7%. This study provides new insights for the application of LLMs in the field of code intelligence.

Keywords:

code generation; pretrained models; tokenization efficiency; lexical analysis; contextual dictionary

Graphical Abstract

1. Introduction

In the past few decades, software has been integrated into all aspects of society. With the increase in the demand for software development, improving the efficiency of this process is more important than ever, and LLMs offer a promising tool to support human programmers. The use-cases of LLMs in major enterprises to improve their productivity in software development include code generation, code interpretation, code repair, unit testing and documentation generation, application modernization, vulnerability detection, and code translation.

In recent years, LLMs have experienced rapid enhancements in their ability to generate and modify code [1,2]. At present, a range of models with impressive coding capabilities are available. However, the current LLMs face limitations in terms of the context length. The context length determines the model’s ability to capture the relationships between input statements. Currently, the performance of LLMs is generally limited by the size of the context window; an example is the 4096 token limit of llama2 [2]. For most code generation scenarios, the LLM’s context length is still insufficient.

In terms of code generation scenarios, we have identified the following issues:

Due to the presence of many non-natural language symbols and text in code, pretrained models tend to tokenize sentences more finely compared to natural language text. This finer tokenization consumes more context resources, leading to lower execution efficiency for intelligent code-related tasks.
Pretrained models primarily use the WordPiece [3] or BPE tokenization algorithms [4] during the tokenization stage, which do not account for semantic information in the code [5,6]. As a result, the same variable may be tokenized differently in different places, leading to inconsistent understanding during subsequent model processing.
The model also fails to understand the concept of variables. When encountering unfamiliar words, it cannot treat them as ordinary variables but, instead, decomposes them based on subwords existing in the vocabulary.

Therefore, this work focuses on addressing the aforementioned three issues. The contributions of this study are as follows:

We use a targeted compiler for preliminary tokenization of the code and then replace the tokens with lexical units for input into the LLM. With this method, our new model reduces the token input length by 33.7% compared to the baseline model.
We perform word restoration on the code generated by the LLM, achieving an output accuracy close to that of the baseline model with compressed tokens. Through compiler preprocessing with non-compressed tokens, we achieve accuracy exceeding that of the baseline model, with input token lengths approximately equal to those of the baseline.

The remainder of this paper is divided into four sections. Section 2 reviews the related literature. Section 3 details each part of the algorithm. Section 4 presents the experimental results. Finally, Section 5 provides the conclusions.

2. Related Work

Statistical language modeling is the task of developing a probabilistic model to predict the next tokens in a sequence given its preceding tokens [7]. For simpler language models, the context is a short sequence of words; for more complex models, the context can be an entire sentence or paragraph [8]. Language models can be used for tasks involving text generation, as well as language understanding. Programming languages also contain predictable statistical properties; therefore, LMs can be used to learn these properties [9].

Tokenization is a crucial step when LLMs handle source code. Different models employ different tokenization methods, which can affect the model’s performance [10,11]. In the field of code intelligence, language models still face certain challenges. Unlike natural language, source code has an unlimited vocabulary as developers can introduce new terms within new scopes [12]. This contrasts with natural language, where the vocabulary size is typically limited during training and does not expand during testing. Therefore, source code models must capture this characteristic.

Due to the problems posed by new vocabulary in source code, the number of tokens generated by the source code after tokenization is often higher than that in natural language. Source code also requires a longer context length because distant information is often critical. Additionally, modern programs often consist of hundreds of thousands of lines of code; this creates stricter requirements regarding the context lengths of LLMs.

Several solutions have been proposed by the academic community to improve the tokenizers in language models for code intelligence, addressing the issue of poor tokenizer performance in this field. Gautier Dagan et al. [13] explored the significance of tokenizers in language models, discovering that the tokenizer design significantly impacts the model’s generation speed, memory usage, and context size. They proposed methods to optimize tokenizers for code generation tasks. Hellendoorn et al. introduced a count-based dynamic language model capable of effectively handling an unlimited vocabulary, demonstrating good performance and flexibility. Their approach achieved low entropy in source code language modeling tasks, comparable to the best models with a limited vocabulary [12]. Karampatsis et al. proposed an open vocabulary neural language model based on subword units, effectively solving the unknown word problem posed by new identifier names in the code [14]. Feng Dawei et al. proposed the Information Gain Optimized Tokenizer, which constructs a custom tokenizer by analyzing downstream task data to identify the most effective technical terms, significantly improving the performance of LLMs in specific domains [15]. Sachidananda Vin et al. identified domain-specific subword sequences from the conditional token distributions of both general and domain-specific corpora, proposing a method to efficiently transfer pretrained language models to new domains [16].

In the above study, the optimized tokenizer was shown to improve the efficiency of source code tokenization. However, the tokens generated by optimized tokenizers cannot be understood by the original language models. For example, the tokenization methods proposed by Gautier Dagan et al. [13] and Hellendoorn et al. [12] cannot be directly applied to LLMs. The Open Vocabulary Neural Language Model [14] and the Information Gain Optimized Tokenizer [15] require fine-tuning for the original language models. Meanwhile, the adaptive tokenization of contextual embeddings [16] requires the training and transfer of domain-specific corpora.

Regarding the issue of insufficient context lengths in LLMs in the field of code intelligence, several authors have discussed how to optimize the representation of source code tokens. Rabin et al. [17,18] implemented a code simplification program that identifies the key features on which the model depends; this helps to improve the interpretability and reliability of these models. Svyatkovskiy et al. [19] significantly reduced the memory consumption of code completion models through a neural reordering model based on static analysis. Yaoxian Li et al. [20] explored the impact of code transformations on code intelligence tasks, finding that models based on abstract syntax trees performed more robustly under most transformations. Yunhui Zheng et al. proposed a systematic approach called prediction-preserving input minimization to assess the ability of language models to capture critical signals in source code understanding tasks [21].

In the above study, the length of the input was reduced by extracting key features from the source code, but there were problems in the downstream task. Sivand-Perses [18] simplified the source code through syntax-guided methods, significantly reducing the input length. However, this simplification process also removes code variables and other information, making it unrecoverable. The reranking neural model [19] reorders the code completion results obtained through static analysis using a language model, but it is only applicable in IDE scenarios. Prediction-preserving input minimization [21] can be used to assess the ability of LLMs to capture critical information in the source code, but it cannot be used to optimize a model’s performance.

3. Algorithm Architecture

This section introduces the base framework applied to optimize the tokenization of input statements in the field of code intelligence.

3.1. Challenges

In the field of code generation, enhancing the treatment efficiency of LLMs is critical. Tokenization is the basis for the inference of LLMs. Increasing the tokenization accuracy could enhance the treatment efficiency of LLMs to some extent. The context length of an LLM indicates the maximum number of input text tokens that the LLM can take into account and analyze. The context length also determines the ability of the LLM to capture the relationships among the tokenization results. Therefore, extending the context length would contribute to improving the processing efficiency of LLMs.

Out-of-vocabulary (OOV) terms represent an essential factor affecting the accuracy of tokenization. The OOV problem refers to the situation in which certain words are not incorporated into the pretraining corpus of the model and therefore cannot be identified and analyzed. This problem is highly prevalent in the tokenization of LLMs, especially when new vocabularies, technical terms, or unusual words are present, and it significantly reduces the tokenization accuracy. At present, the most widely used tokenization algorithms for LLMs regarding natural language are byte pair encoding (BPE) [22] and byte-level BPE (BBPE). These algorithms originated from the field of data compression and enable the representation of variable-length words within a fixed-sized vocabulary, minimizing the occurrence of the OOV problem within such vocabularies.

Source code has a larger vocabulary compared to natural language. Additionally, as words such as variable names, function names, and constant keywords can be created by developers, source code is more likely to be affected by out-of-vocabulary words compared to natural language. LLMs usually adopt the BBPE algorithm for tokenization, which enables any word to be divided into subwords from a fixed vocabulary. Although the BBPE algorithm can effectively address the OOV problem, it does not enable the semantic information within the code to be acquired, and it only considers the statistical word frequencies of the training set. As a result, the same word may be divided into different subwords, thereby diminishing the tokenization accuracy for the source code. Moreover, once the source code has been processed via the BBPE algorithm, it is more likely to produce longer token sequences, and the combination patterns of the subwords in the training set diverge from the inference of the model. Furthermore, the grammatical rules of source code vary across different language types. Despite the fact that the BBPE algorithm has been trained based on multiple language types and can be employed with the same tokenization logic to handle source code from different language types, the tokenization accuracy declines as the application scope is enlarged.

Based on the above, the BBPE algorithm exhibits low efficiency when processing source code corpora. The fundamental reason for this inefficiency is that the BBPE algorithm segments text based on word frequencies; this not only hinders the acquisition of accurate semantic information but also results in an excessive number of tokens.

In contrast to the BBPE algorithm, compiler lexical analysis is grounded in semantics. Theoretically, as long as the input source code is accurate, it can be precisely segmented. Additionally, compiler lexical analysis allows the accurate and complete acquisition of the semantic information of the source code, resulting in fewer token sequences after the analysis. However, as the vocabulary that is generated through compiler lexical analysis is theoretically infinite in size, it is impractical and inefficient to employ this vocabulary directly for the training of LLMs.

The context length refers to the maximum number of tokens of input text that an LLM can analyze. Typically, the context length for such models is fixed. Due to the extensive use of function nesting, function calls, module definitions, and class hierarchies in programming languages, source code exhibits a high degree of information correlation over extended contexts. However, the fixed context length limits the ability of LLMs to effectively capture and analyze the overall features and structured information inherent in source code. Furthermore, source code contains a significant number of symbols, such as assignment operators, arithmetic operators, delimiters, and parentheses. Parentheses are commonly used in function calls and conditional expressions. After tokenization, these symbols often appear as individual tokens, thereby occupying valuable positions within the context length. Extending the context lengths of LLMs can enhance their processing efficiency. However, in order to handle longer contexts, computers require substantial computing power and storage capacities, thereby imposing higher demands on the hardware infrastructure.

For LLMs with a specified context length, it is advantageous to have a fixed-size vocabulary, ensure precise semantic representation, and utilize shorter token sequences.

To achieve tokenization results with the aforementioned characteristics, a two-step tokenization approach is employed. Initially, the source code undergoes compiler-based lexical analysis to generate a shorter sequence of tokens enriched with accurate semantic information. Subsequently, BBPE is applied to transform this token sequence into a fixed vocabulary format that is suitable for LLMs. Moreover, the results undergo selection to compress the number of tokens generated during the initial processing to satisfy the predefined context length of the LLM. This process aims to enhance the effective information density of the token sequence while maintaining the total number of tokens read by the model. Finally, the output text is generated through model inference, and the compressed tokens in the output text are restored in the final step.

The overall structure of the algorithm is shown in Figure 1.

Compiler-based tokenizer: The appropriate compiler is selected based on the type of source code and serves to break it down into lexical units.
Secondary tokenization based on LLM: This module tokenizes the lexical units into llama tokens and subsequently extracts tokens according to the principle of maximum similarity. During the extraction process, a context dictionary will be generated for restoration.
Generation and recovery of the source code: The extracted token list is input into the code-llama model for generation; subsequently, the output is restored using a context dictionary.

3.2. Tokenization of Input Sequences Based on Compiler Lexical Analysis

A lexical unit is the smallest meaningful unit in programming language processing. Lexical units are the basic elements identified and extracted by a compiler or interpreter from the source code during the lexical analysis phase. These basic elements form the foundation of the program syntax. The primary task of the lexical analysis phase is to break down the source code into lexical units. Although the basic principles of this process are the same across different programming languages, the implementation and processing details will vary depending on the characteristics of the language. Therefore, it is necessary to choose the appropriate lexical analyzer according to the type of language.

Source code classification is the process of categorizing code based on criteria such as the functionality and programming language. Initially, this can be determined by the file extension. In the absence of file names, the programming language must be identified through the source code’s content. The academic community has developed several methods for the identification of code types [23]. As the experimental data considered in this study are in Java format, the language identification logic is omitted. Instead, a Java compiler is directly used for the lexical analysis. At present, there are two lexical analysis libraries that are compatible with the proposed approach, namely javalang and Tree-sitter. Among them, javalang is primarily utilized to parse Java code and does not support other programming languages. Nevertheless, its strength lies in its ability to offer high-quality lexical analysis procedures and generate sequences of lexical units; this makes it highly suitable for this study. Hence, javalang is adopted here.

The lexical analyzer carries certain requirements regarding the input source code. If there are formatting errors, the tokenizer will immediately stop working. However, in the practical usage of LLMs, inputs with formatting errors are permitted. This situation is addressed in this study. First, the input data are preprocessed, correcting the code format through multiple regular expression matches. Next, the logic of the lexical analyzer is modified. The lexical analyzer processes the input sequentially, reading the characters one by one from the input buffer and matching them according to defined lexical rules (usually described by regular expression patterns) to identify tokens. If there are still errors after correction, the incomplete tokens are directly outputted, and the tokenizer skips the anomalous characters to begin the next round of tokenization. After these steps, a sequence of lexical units can be obtained.

Although the model described in this work supports the Java language, it can be readily extended to multiple programming languages after selecting an appropriate lexical analyzer. The chosen lexical analyzer must satisfy two key criteria: firstly, it should support the generation of a sequence of lexical units, as with javalang; secondly, it should be capable of capturing exceptions during the lexical analysis phase so as to prevent interruptions in the model’s inference process. The method proposed in this study is flexible regarding the specific programming language used.

3.3. Secondary Tokenization of Lexical Unit Sequences

The obtained lexical units may not be present in code-llama’s vocabulary, making it impossible to perform token embedding calculations. Therefore, it is necessary to conduct the secondary tokenization of the lexical unit list, breaking down each token into a form that is understandable by code-llama; we call these code-llama tokens. In this way, it is possible to convert the lexical unit list into a code-llama token list, where each element can be understood and converted into an embedding vector by code-llama.

The direct use of this token list produces similar results to those obtained when using code-llama’s tokenizer directly, and it fails to achieve the effective compression of the input. Therefore, we next perform a secondary extraction step based on the type of tokens. Through this process, we not only leverage the structural features of the code but also exploit code-llama’s ability to understand tokens, aiming to achieve more accurate semantic parsing and information compression.

We select a token from each lexical unit to represent the original lexical unit. Various selection methods can be used, such as finding the token that is most similar to the original token or randomly selecting a token. Experimental validation shows that selecting the token with the highest similarity is the most suitable method.

In addition to selecting individual tokens, it is also possible to add all tokens. Although this approach does not allow us to compress the input sequence, adding all tokens is beneficial for code generation in code-llama. In the experiments described in the next section, we also consider the scenario in which all tokens are added.

3.4. Contextual Subword Recovery of Output Sequences

Regarding the subword sequences obtained after the initial screening and adjustment, it is crucial to consider the possibility of their reappearance during the model’s generation process. In this case, it is necessary to perform appropriate restoration operations on these subwords to ensure that they function correctly within the context and convey the necessary variable information during generation. The restoration operation ensures the continuity and completeness of the contextual semantics so that the simplification process does not compromise the quality of the model-generated content. Additionally, this subword handling helps to maintain the original sentence structure and intent, avoiding serious deviations from the expected output due to the absence of subwords. Through this optimization strategy, we aim to maintain high consistency in the generated text while achieving the effective compression of the input sequence.

In an LLM, during a single inference process, a context dictionary is constructed. In this dictionary, the keys are llama tokens selected after secondary word segmentation, and the values are the original lexical units prior to simplification. The construction of the context dictionary occurs during the secondary extraction phase. For each lexical unit requiring secondary word segmentation, after obtaining the token sequence, the most similar token is identified and paired with the original lexical unit in the form of key–value pairs within the context dictionary. After the model generates the token sequence, each token is searched in the dictionary in sequence. If a match is found, the lexical unit in the value is replaced with the current llama token.

An example of contextual subword recovery is shown in Figure 2. The subsequent explanations are provided according to the numbering indicated in the figure.

“GroovyScriptNodeFactory” is a lexical unit, used as input.
The lexical unit is tokenized using the llama tokenizer; the following number represents its the fuzz ratio to the input. We select the most similar llama token, “Factory”, as the representative for this lexical unit.
“Factory” and “GroovyScriptNodeFactory” are stored in the context dictionary as key–value pairs for subsequent output recovery.
The code-llama model performs inference, producing a code snippet containing the “Factory” llama token.
The model output is restored using the context dictionary, resulting in an improved output.

4. Experimental Analysis

4.1. Dataset

In this section, we introduce the experimental setup, the dataset used, and the baseline model and verify the effectiveness of the compressed input sequences using the experimental results. We employed the public dataset released by CodeXGLUE [24] to conduct the research task. This dataset was derived from the research by Allamanis et al. [25] and was preprocessed in [26]. The basic information of the dataset is shown in Table 1.

In order to adapt to the lexical analysis process of the compiler, this study has removed special identifiers and adopted the original source code text format. The preprocessed dataset is provided in the Supplementary Materials. For the generation of input and validation data, each piece of data is evenly divided into two parts: the first half serves as the input, while the second half serves as the reference for evaluating the model output.

4.2. Experimental Model

In the experiments described in this section, code-llama [27] was used as the baseline model. Four models were considered.

Baseline Model: The codellama-7b model was used as the comparison benchmark.
Full Token Addition Model: The tokens were first processed by the compiler and then tokenized by the llama tokenizer before being input into the model.
Contextual Recovery Model: Based on the full addition model, the llama token sequence was filtered, where, for each lexical unit, only the most similar token was chosen to be input into the model; this was implemented in order to improve the token compression ratio. A context dictionary completion algorithm was added during model output to complement the generated tokens, improving the quality of the output.

4.3. Metrics

The token compression ratio is defined as the ratio of the number of tokens generated by the model’s tokenizer to the original number of characters in the text. It is calculated using the Formula (1).

Token Compression Ratio = \frac{L_{model} (s)}{L_{char} (s)}

(1)

Here,

L_{model} (s)

denotes the number of tokens generated by the model’s tokenizer from the original text.

L_{char} (s)

represents the total number of characters in the original text.

The Bilingual Evaluation Understudy (BLEU) score is derived from metrics used to evaluate the quality of machine translation [28]. The calculation formula is presented in Formula (Section 4.3).

\begin{matrix} BLEU = & BP \cdot exp (\sum_{n = 1}^{N} w_{n} log P_{n}) \end{matrix}

(2a)

\begin{matrix} BP = & \{\begin{matrix} 1 & if c > r \\ e^{1 - \frac{r}{c}} & if c \leq r \end{matrix} \end{matrix}

(2b)

The brevity penalty (BP) is used to evaluate the length difference between the generated result and the reference result. c refers to the length of the generated result, and r refers to the length of the reference result. The smaller the length of the generated result c, the smaller the BP value; this results in a smaller overall BLEU score, thereby imposing a penalty.

P_{n}

represents the probability of having n consecutive words that are identical between the generated text and the reference text. In this study, we primarily adopt the 4-gram method, using the BLEU calculation method provided by the NLTK package [29]. The BLEU score ranges from 0 to 1, with 1 indicating perfect similarity. In the BLEU score’s calculation, both the generated text and the reference text need to be tokenized first; then, the BLEU score is computed based on subword units. Therefore, this metric can be used to evaluate the model’s generation quality at the subword level.

The fuzz ratio is calculated using the fuzzy ratio algorithm; it is also known as the Levenshtein distance [30]. It reflects the similarity between two strings and is obtained by calculating the minimum number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. The resulting similarity score ranges from 0 to 100, with 100 indicating an exact match. The Levenshtein distance is presented in Formula (3).

lev (a, b) = \{\begin{matrix} |a| & if |b| = 0 \\ |b| & if |a| = 0 \\ lev (tail (a), tail (b)) & if head (a) = head (b) \\ 1 + min \{\begin{matrix} lev (tail (a), b) \\ lev (a, tail (b)) \\ lev (tail (a), tail (b)) \end{matrix} & o t h e r w i s e \end{matrix}

(3)

Let both a and b be strings, where

| a |

denotes the length of string a. When either a or b has a length of 0, the Levenshtein distance is 0. head(a) represents the first character of string a, and tail(a) represents the string obtained by removing the first character of a. This distance is used to measure the edit distance between two strings a and b.

The fuzz ratio is defined in Formula (4).

ratio (a, b) = ⌈100 (1 - 2 * \frac{lev (a, b)}{| a | + | b |})⌉

(4)

The fuzz ratio is calculated based on the lengths of strings a and b, as well as their Levenshtein distance. The upper limit of the Levenshtein distance is max(

| a |

,

| b |

). When the Levenshtein distance between a and b reaches the maximum,

ratio (a, b)

equals 0. Therefore, this metric can be used to assess the generation quality of the model at the character level.

The fuzz partial ratio is determined using a string-matching algorithm; it extends the concept of the fuzz ratio by considering only the best-matching substring between two strings. The similarity score is calculated based on the length of the longest common substring, rather than the entirety of the strings. This method is useful in cases where one string is a subset or prefix of another. Given that string a is longer than string b,

a_{i}

is a substring for a. Mathematically, this can be expressed in Formula (5).

partial {ratio}_{(a, b)} = max (ratio (a_{i}, b))

(5)

4.4. Analysis of the Results

We conducted experiments based on the four models proposed above, and the corresponding experimental results are given below.

The experimental results regarding the token compression ratio are presented in Table 2. The two models employing similar word selection demonstrated a significant compression effect, reducing the number of input sequence tokens by 33.7%. Even with the fully augmented second segmentation, the token count remained lower than that of the original model’s input. This is likely because the compiler’s segmentation process incorporates more code information than the model’s, resulting in more effective segmentation.

The experimental results regarding the model generation quality are presented in Table 3. The models using compiler tokenization performed well. The Full Token Addition Model showed an increase in accuracy. The models employing the contextual subword recovery method maintained their baseline model accuracy, even with a reduced input token length.

In the above-mentioned experiment, the token length for the model’s output text was set to 50. Experiments were also conducted with other text lengths, and the specific results are shown in Figure 3 and Figure 4.

In this study, experiments were conducted on six different maximum output lengths: 10, 20, 30, 50, 75, and 100. The experimental results show that, as the maximum output length increased, both the model’s BLEU score and fuzz ratio gradually decreased. This phenomenon was expected, as longer output texts pose greater difficulty in matching the target text, resulting in a decline in similarity. However, under different parameter settings, the score rankings of the three models remained consistent, indicating that the proposed method is stable and effective.

4.5. Ablation Experiments

This section further discusses the contributions of each component to the output results in the context completion model based on compiler tokenization. The following model variants were designed for the ablation experiments.

w/o Compiler: In this variant, the compiler-based tokenization module was removed. As the compiler tokenization module was the foundation of the algorithm and could not be isolated, this model is presented as the baseline model.
Select and Recovery: This refers to the context recovery model proposed in this work, which combines the selection of the most similar token and the context dictionary recovery algorithm.
w/o Select: In this variant, the step that involved selecting the most similar token from the secondary lexical units was removed. In this model, a token is randomly selected to represent the original lexical unit.
w/o Recovery: In this variant, the context dictionary recovery algorithm was removed and the model’s output was used directly.
w/o Select and Recovery: In this variant, both the method of selecting the most similar token and the context dictionary recovery algorithm were removed.

The results of the ablation experiments are shown in Table 4. It can be observed that both the highest similarity selection algorithm and the context recovery algorithm proposed in this study have a positive impact on the model’s output quality. Specifically, when the context recovery algorithm was removed, both the BLEU score and similarity decreased significantly, indicating the significant contribution of the context recovery algorithm to the model. The BLEU score reflects the generation quality from the perspective of the lexical units, while the similarity operates at the character level. The context recovery algorithm helps to maintain consistency in the variable names across different contexts, thereby improving the BLEU score. Meanwhile, the highest similarity selection algorithm, applied in the secondary extraction process, also contributes to the final result. Through the selection of tokens with higher similarity, this algorithm enables the better representation of the lexical units, thus improving the output. With the exception of the baseline model, all models were based on compiler pre-tokenization and secondary extraction, so the number of tokens in the input was the same. Therefore, with the exception of the baseline model, all models had the same token compression rate.

4.6. Case Study

4.6.1. A Study on the Applicability of the Context Recovery Model in Long Input Contexts

This study conducts a Java code completion task to demonstrate the superiority of the context recovery model in handling long input contexts. The incomplete code and the reference output code are illustrated in Figure 5 and Figure 6, respectively. This code includes multiple methods and intricate control flows, providing a rigorous test of the model’s ability to understand and reason within extended contexts. Specifically, the content within the main function has been concealed, and the model is required to complete the code.

Given that the hardware device and model support a maximum input token length of 550, the source code in the case study was tokenized into 739 tokens by the baseline model. Due to the context length limitation, the baseline model could only process the latter part of the code, resulting in the class definition and data reading function being omitted. In contrast, after tokenization by the context recovery model, the number of tokens was reduced to 529, enabling the model to handle the entire context effectively.

Through the actual operation of model inference, the results of the two models are presented in Figure 7. Specifically, Figure 7a shows the output of the baseline model, while Figure 7b illustrates the output of the context restoration model. Our analysis reveals the following:

Baseline model: Due to its limited context length, this model performs suboptimally in code completion tasks and fails to capture the complete workflow required for data analysis.

Context recovery model: This model accurately identifies the data analysis process and successfully completes the code by incorporating all necessary steps for generating a comprehensive report.

These findings indicate that the context restoration model demonstrates significant advantages in handling long input texts. In code completion tasks, it effectively leverages the full context, thereby providing higher-quality outputs.

4.6.2. Application Examples of Context Recovery Model

To further explore the role of the context recovery model in practical applications, a hypothetical case study is presented that demonstrates how a code generation tool based on LLMs—namely a coding assistant program—can benefit from the proposed model. Suppose that there exists a coding assistant program based on RAG technology, which supports retrieval from a pre-built code repository to provide richer context for the user’s code generation needs, thus enabling more accurate outputs. The operation process of this coding assistant can be divided into three stages: user input, code repository retrieval, and enhanced result generation.

Application of the context recovery model
The model proposed in this study can be integrated into the main operation process of the coding assistant. The key stages involved in this case are as follows.
Vector Database Construction: To build the RAG system, a vector database based on the code repository needs to be created. First, the code must be chunked; moreover, to improve the retrieval performance, the chunked code should be subjected to natural language enhancement, which involves generating comments for the source code using an LLM. At this stage, the input data type is the source code; thus, the context recovery model can be applied.
Enhanced Result Inference Stage: At this stage, the coding assistant combines the filtered code snippets with the context information provided by the user and inputs them into an LLM for inference, generating the final answer to be returned to the user. The input data in this stage mainly consist of source code and a small number of natural language questions provided by the user, so the context recovery model can be applied.
Anticipated effects
Enhanced system operation speeds: With the input remaining constant, a reduction in the number of tokens input into the LLM will decrease the time required for the inference process, thereby accelerating the inference speed. This improvement will significantly enhance the efficiency of vector database construction and result inference.
Reduced token costs: As LLMs continue to evolve, their inference costs are rising. The inference time and computational resource consumption increase substantially with the increase in model parameters. In the absence of context length limitations, a smaller number of tokens included in the input sequence will lead to lower token costs.
Scalability and flexibility: The context recovery model is applicable to input data consisting of source code, without altering the original workflow. This approach reduces the development and integration costs associated with existing code assistants, making it more adaptable to various applications.

5. Conclusions

This work discusses the limitations of the current LLMs in the field of code generation and explores the feasibility of introducing compiler-based tokenization to provide richer semantic information for model inference and conducting secondary extraction after tokenization to reduce the number of input tokens, thereby enhancing computational efficiency.

The authors propose a context restoration model based on compiler tokenization and input sequence compression, and validates its effectiveness through experiments. This model employs a two-step tokenization process: First, it performs lexical analysis on the source code to generate a concise token sequence with accurate semantic information. Next, it applies the BBPE algorithm to transform this token sequence into a fixed vocabulary suitable for LLMs. Following this, the results undergo selection to compress the number of tokens generated during the initial processing to satisfy the predefined context length of the LLM, aiming to enhance the effective information density of the token sequence without altering the number of tokens processed by the model. Finally, after model inference to produce the output text, the compressed tokens in the output are restored.

The introduction of compiler tokenization provides the model with a more extensive and nuanced information set, leading to improved inference performance. Compared to the baseline model, the full addition model has demonstrated significant enhancements in the BLEU score, the fuzz ratio, and the fuzz partial ratio.
The introduction of secondary extraction substantially reduces the token compression rate. However, relying solely on the secondary extraction of word tokenization results yields inference outcomes with suboptimal similarity. Restoring the inference results after secondary extraction not only maintains a relatively low compression rate but also significantly improves the BLEU score, the fuzz ratio, and the fuzz partial ratio. Moreover, during secondary extraction, employing the highest similarity selection algorithm, as opposed to random selection, results in a BLEU score, fuzz ratio, and fuzz partial ratio that more closely align with those of the baseline model.
The context recovery model performs token sequence compression by selecting the highest similarity tokens based on the full addition model, followed by inference and restoration using the compressed output. This approach achieves a significant reduction in the token compression rate with only a minor decrease in output quality. Compared to the llama2-7b baseline model, the context recovery model reduces the number of tokens by 33.7% while maintaining comparable output quality.
The effectiveness of the context recovery model was evaluated using a Java code completion task. The results indicated that the baseline model, constrained by the limited context input length, failed to capture the overall architectural information in the source code. In contrast, the context recovery model successfully captured this architectural information and performed the completion task more accurately. This suggests that when processing long input texts, the context recovery model provides higher-quality output compared to the baseline model.

This study provides a research foundation for the extension of the application of LLMs in the domain of code intelligence and introduces novel perspectives and methodologies regarding the maximization of the context length in LLMs. However, our work exhibits the following limitations:

The lexical analyzer utilized in this study must satisfy two critical requirements: firstly, it is required to support the generation of lexical unit sequences; secondly, it must be capable of exception handling. At present, lexical analyzers that meet these criteria still require manual configuration and are not adaptable to multiple programming languages. Consequently, this increases the cost associated with applying the proposed method in multi-language environments.
To address the aforementioned challenges, we propose the following research directions for further investigation. It is necessary to construct a training dataset comprising diverse types of source code and their corresponding lexical unit sequences. It would also be beneficial to train a model to learn tokenization rules across multiple programming languages and to capture exceptions when syntax errors arise. This approach could contribute to the development of a robust model that is capable of performing lexical analysis on various programming languages.
The approach proposed in this study involves performing LLM inference after selecting tokens derived from tokenization. Another potential research direction consists of enhancing the accuracy and efficiency of model inference through the selection of specific tokens. We suggests that a balance between high-quality model inference and computational efficiency can be achieved by fine-tuning the screening strategy, considering methods such as the inclusion of all tokens, the selective addition of individual tokens, or the removal of redundant tokens.

In conclusion, this work describes a method that can be used to reduce the token lengths of input sequences through compiler tokenization and a context subword recovery method that helps to improve the output quality of the model. The experimental results validate the feasibility and effectiveness of our approach. This research provides new insights into the application of LLMs in the field of code intelligence.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info16020073/s1.

Author Contributions

Conceptualization, Z.L.; methodology, Z.L.; software, Z.L.; validation, Z.L.; investigation, Z.L.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, X.L.; visualization, Z.L.; supervision, X.L.; project administration, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in CodeXGLUE at https://github.com/microsoft/CodeXGLUE (accessed on 10 November 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LLMs	Large language models
OOV	Out of vocabulary

References

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Wu, Y. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Bostrom, K.; Durrett, G. Byte pair encoding is suboptimal for language model pretraining. arXiv 2020, arXiv:2004.03720. [Google Scholar]
Brown, T.B. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
Liu, Y.; Zhang, M. Neural network methods for natural language processing. Comput. Linguist. 2018, 44, 193–195. [Google Scholar] [CrossRef]
Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 39. [Google Scholar]
Hindle, A.; Barr, E.T.; Gabel, M.; Su, Z.; Devanbu, P. On the naturalness of software. Commun. ACM 2016, 59, 122–131. [Google Scholar] [CrossRef]
Xu, F.F.; Alon, U.; Neubig, G.; Hellendoorn, V.J. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, San Diego, CA, USA, 13 June 2022; pp. 1–10. [Google Scholar]
Wong, M.F.; Guo, S.; Hang, C.N.; Ho, S.W.; Tan, C.W. Natural language generation and understanding of big code for AI-assisted programming: A review. Entropy 2023, 25, 888. [Google Scholar] [CrossRef]
Hellendoorn, V.J.; Devanbu, P. Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, Paderborn, Germany, 4–8 September 2017; pp. 763–773. [Google Scholar]
Dagan, G.; Synnaeve, G.; Rozière, B. Getting the most out of your tokenizer for pre-training and domain adaptation. arXiv 2024, arXiv:2402.01035. [Google Scholar]
Karampatsis, R.M.; Sutton, C. Maybe deep neural networks are the best choice for modeling source code. arXiv 2019, arXiv:1903.05734. [Google Scholar]
Feng, D.; Zhang, Y.; Xu, Z. IGOT: Information Gain Optimized Tokenizer on Domain Adaptive Pretraining. arXiv 2024, arXiv:2405.09857. [Google Scholar]
Sachidananda, V.; Kessler, J.S.; Lai, Y.A. Efficient domain adaptation of language models via adaptive tokenization. arXiv 2021, arXiv:2109.07460. [Google Scholar]
Rabin, M.R.I.; Hellendoorn, V.J.; Alipour, M.A. Understanding neural code intelligence through program simplification. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, New York, NY, USA, 23–28 August 2021; pp. 441–452. [Google Scholar]
Rabin, M.R.I.; Hussain, A.; Alipour, M.A. Syntax-guided program reduction for understanding neural code intelligence models. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, New York, NY, USA, 13 June 2022; pp. 70–79. [Google Scholar]
Svyatkovskiy, A.; Lee, S.; Hadjitofi, A.; Riechert, M.; Franco, J.V.; Allamanis, M. Fast and memory-efficient neural code completion. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain, 28 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 329–340. [Google Scholar]
Li, Y.; Qi, S.; Gao, C.; Peng, Y.; Lo, D.; Xu, Z.; Lyu, M.R. A closer look into transformer-based code intelligence through code transformation: Challenges and opportunities. arXiv 2022, arXiv:2207.04285. [Google Scholar]
Zheng, Y.; Suneja, S.; Zhuang, Y.; Morari, A.; Laredo, J.A. Probing Model Signal Awareness. U.S. Patent App. 17/315,701, 10 November 2022. [Google Scholar]
Sennrich, R. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
Gilda, S. Source code classification using Neural Networks. In Proceedings of the 2017 14th International Joint Conference on Computer Science and Software Engineering (JCSSE), NakhonSiThammarat, Thailand, 7 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1–6. [Google Scholar]
Lu, S.; Guo, D.; Ren, S.; Huang, J.; Svyatkovskiy, A.; Blanco, A.; Clement, C.; Drain, D.; Jiang, D.; Tang, D.; et al. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv 2021, arXiv:2102.04664. [Google Scholar]
Allamanis, M.; Sutton, C. Mining source code repositories at massive scale using language modeling. In Proceedings of the 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 207–216. [Google Scholar]
Karampatsis, R.M.; Babii, H.; Robbes, R.; Sutton, C.; Janes, A. Big code!= big vocabulary: Open-vocabulary models for source code. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, New York, NY, USA, 27 June—19 July 2020; pp. 1073–1085. [Google Scholar]
Roziere, B.; Gehring, J.; Gloeckle, F.; Sootla, S.; Gat, I.; Tan, X.E.; Adi, Y.; Liu, J.; Sauvestre, R.; Remez, T.; et al. Code llama: Open foundation models for code. arXiv 2023, arXiv:2308.12950. [Google Scholar]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
Xue, N. Steven Bird, Evan Klein and Edward Loper. Natural Language Processing with Python. In Natural Language Engineering; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2009; Volume 17, pp. 419–424. ISBN 978-0-596-51649-9. [Google Scholar]
Levenshtein, V. Binary codes capable of correcting deletions, insertions, and reversals. Dokl. Akad. Nauk. SSSR 1965, 10, 707–710. [Google Scholar]

Figure 1. Overall structure of the algorithm.

Figure 2. An example of the restoration of a contextual subword.

Figure 3. BLEU score results regarding model output.

Figure 4. Fuzz ratio results regarding model output.

Figure 5. Input data for code completion task.

Figure 6. Reference output data for code completion task.

Figure 7. Model output results. (a) Baseline model output. (b) Context recovery model output.

Table 1. The basic information of the dataset.

Data Statistics	GitHub Java Corpus
Number of code fragments	7176
Number of tokens	3.8 M
Number of characters	17 M

Table 2. Token compression ratio.

Model	Token Compression Ratio
Baseline Model	0.313
Full Token Addition Model	0.296
Contextual Recovery Model	0.208

Table 3. Model generation quality.

Model	BLEU	Fuzz Ratio	Fuzz Partial Ratio
Baseline Model	0.241	51.7	52.1
Full Token Addition Model	0.309	56.6	56.9
Contextual Recovery Model	0.213	50.7	51.1

Table 4. The results of the ablation experiment.

Model	Token Compression Ratio	BLEU	Fuzz Ratio	Fuzz Partial Ratio
w/o Compiler	0.313	0.241	51.7	52.1
Select and Recovery	0.208	0.213	50.7	51.1
w/o Select	0.208	0.196	49.3	49.7
w/o Recovery	0.208	0.150	49.0	49.4
w/o Select and Recovery	0.208	0.142	47.3	47.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Lu, X. Research on Compressed Input Sequences Based on Compiler Tokenization. Information 2025, 16, 73. https://doi.org/10.3390/info16020073

AMA Style

Li Z, Lu X. Research on Compressed Input Sequences Based on Compiler Tokenization. Information. 2025; 16(2):73. https://doi.org/10.3390/info16020073

Chicago/Turabian Style

Li, Zhe, and Xinxi Lu. 2025. "Research on Compressed Input Sequences Based on Compiler Tokenization" Information 16, no. 2: 73. https://doi.org/10.3390/info16020073

APA Style

Li, Z., & Lu, X. (2025). Research on Compressed Input Sequences Based on Compiler Tokenization. Information, 16(2), 73. https://doi.org/10.3390/info16020073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Compressed Input Sequences Based on Compiler Tokenization

Abstract

1. Introduction

2. Related Work

3. Algorithm Architecture

3.1. Challenges

3.2. Tokenization of Input Sequences Based on Compiler Lexical Analysis

3.3. Secondary Tokenization of Lexical Unit Sequences

3.4. Contextual Subword Recovery of Output Sequences

4. Experimental Analysis

4.1. Dataset

4.2. Experimental Model

4.3. Metrics

4.4. Analysis of the Results

4.5. Ablation Experiments

4.6. Case Study

4.6.1. A Study on the Applicability of the Context Recovery Model in Long Input Contexts

4.6.2. Application Examples of Context Recovery Model

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI