A Prompt Example Construction Method Based on Clustering and Semantic Similarity

Chen, Ding; Wang, Jun

doi:10.3390/systems12100410

Open AccessArticle

A Prompt Example Construction Method Based on Clustering and Semantic Similarity

by

Ding Chen

¹

and

Jun Wang

^1,2,*

¹

School of Economics and Management, Beihang University, No. 37 Xueyuan Road, Haidian District, Beijing 100191, China

²

Key Laboratory of Complex System Analysis, Management and Decision, Beihang University, Ministry of Education, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Systems 2024, 12(10), 410; https://doi.org/10.3390/systems12100410

Submission received: 25 August 2024 / Revised: 23 September 2024 / Accepted: 1 October 2024 / Published: 3 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the launch of OpenAI’s ChatGPT, large language models have garnered significant attention, and applications based on these models have proliferated. A critical challenge has emerged: how to rapidly enhance the capabilities of general LLMs in specialized domains. Compared to fine-tuning and other methods, prompt engineering has proven to be a cost-effective approach for improving the performance of LLMs on specific tasks, yielding remarkable results. However, current prompt example construction methods are numerous and lack a universally applicable approach that spans different models and tasks. Furthermore, existing research is predominantly tested and evaluated on a limited range of specific datasets, failing to explore the broader impact of these methods on a wider array of tasks. This paper proposes a prompt example construction method based on clustering and semantic similarity, which combines clustering algorithms with semantic similarity techniques to significantly improve the quality of prompt examples. In comparative tests conducted on six LLMs and seven datasets, the overall accuracy and stability of the proposed method significantly outperforms five other common methods, demonstrating broad applicability and the potential to enhance the output performance of all LLMs. Through comparative experiments, this paper also identifies that as the parameter scale of LLMs increases, the improvement effect of the prompt example construction method on LLM output performance tends to diminish. Additionally, diversified prompt example sets provide a more pronounced enhancement in LLM output performance.

Keywords:

prompt examples; large language models; clustering; semantic similarity

1. Introduction

For a long time, scholars in the field of artificial intelligence have been striving to enable machines to perfectly understand natural language and communicate with humans through it. Traditional language models have consistently fallen short of achieving human-like communication proficiency [1]. However, the launch of OpenAI’s ChatGPT has brought significant attention to large language models. ChatGPT, based on the GPT-3 model architecture, was trained through a three-step process involving unsupervised pre-training, fine-tuning, and reinforcement learning from human feedback (RLHF) [2]. This process endowed it with powerful natural language understanding and generation capabilities. Following ChatGPT, numerous other LLMs based on the Transformer architecture have emerged, such as the Llama model [3], the ChatGLM model [4], and the Bloom model [5]. Many of these models are now evolving towards multimodal capabilities, aiming to better serve multimodal dialogue scenarios [6], with GPT-4 being a prominent example that supports multimodal input as a visual language model. Additionally, many LLMs have been developed for specific vertical domains. These models are generally subjected to incremental pre-training or fine-tuning on domain-specific datasets, thereby enhancing their knowledge and content generation capabilities in particular fields. Examples include the ChatLaw model trained by Peking University for the legal domain [7], and BloombergGPT, a LLM trained by Bloomberg for the financial sector [8].

Currently, LLMs of various architectures and parameter scales are emerging in rapid succession. Through evaluations by various LLM capability testing platforms and comparative experiments by many researchers, detailed conclusions have been drawn regarding the advantages and disadvantages of different foundational architectures of LLMs. Recently, research results focused on optimizing the underlying architectures of LLMs have yielded diminishing returns in terms of performance improvement. Therefore, rather than enhancing the performance of LLMs by perfecting or modifying their underlying architectures, exploring how to better unleash the full capabilities of existing models is likely a more important research direction in the current field of LLM research. For foundational LLMs, without altering their underlying architectures, performance on specific downstream tasks can be further enhanced through methods such as unsupervised incremental pre-training, supervised fine-tuning, reinforcement learning based on human feedback, and prompt engineering. Among these, prompt engineering has the lowest requirements for domain-specific datasets and computational power, making it the easiest and fastest method to improve LLM performance in specific domains. Many prompt engineering methods do not require collecting training datasets for specific tasks to update the model parameters; alternatively, they may require only a small amount of data to update a portion of the model’s parameters to achieve effects similar to fine-tuning. Consequently, prompt engineering has become one of the best methods for improving the performance of LLMs on downstream tasks in various vertical applications.

Currently, there is no unified conclusion regarding the impact of prompt example construction methods on LLM performance, as findings from various studies exhibit significant discrepancies. These differences primarily stem from the diversity in experimental conditions and the complexity of evaluating these methods. For example, Brown et al. found that increasing the number of prompt examples enhances the accuracy of the model’s outputs [2], while Min et al. emphasized the importance of example format, noting that omitting any critical component (such as labels) can lead to a performance drop comparable to that of zero-shot prompts [9]. Although these studies offer valuable insights in specific contexts, their findings cannot be universally applied across different LLMs, tasks, and datasets. The conflicting conclusions across various experiments highlight the need for more comprehensive investigations to systematically evaluate the impact of different prompt example construction methods, thus providing clearer guidance for practical applications.

Moreover, existing studies on prompt example construction methods are often limited to a small number of LLMs and datasets, which restricts the generalizability and applicability of their conclusions. For example, many experiments are conducted on models like GPT-3 or similar, with task datasets often confined to specific domains. While these experiments reveal useful insights for particular cases, they do not account for the performance of these methods across a wider range of models and tasks. Therefore, evaluations of prompt example construction methods need to be extended to a broader set of models, particularly those of varying parameter scales, and more diverse task datasets. This broader scope of experimentation would provide a more comprehensive understanding of the strengths and weaknesses of different methods, thus offering more relevant guidance for practical applications across diverse contexts.

Recent research also indicates that smaller-scale models sometimes outperform larger ones when using the same computational resources, especially for certain tasks [3]. For these smaller models, the construction method of prompt examples becomes critical due to their limited internal knowledge base. Prompt examples serve as essential external cues to help the model understand and complete tasks effectively. In contrast, larger-scale models, with their stronger contextual learning abilities, are less reliant on prompt example construction methods for performance improvements. As a result, prompt example construction has a more pronounced effect on the output quality of smaller models. Although some strategies for prompt example construction have been proposed, there is still a lack of universally applicable and effective methods for optimizing the performance of smaller LLMs. This gap underscores the need for future research to focus on designing prompt example construction strategies that can significantly enhance the accuracy and generalization capabilities of smaller LLMs.

Therefore, we selected multiple datasets and multiple LLMs to conduct comparative tests on several commonly used prompt example construction methods, aiming to explore whether there are general patterns or conclusions regarding the impact of these methods on the accuracy of LLM-generated answers. At the same time, we proposed a new prompt example construction method, and conducted comparative tests on the aforementioned datasets and LLMs to verify its effectiveness and generalizability. The main contributions of this paper are as follows:

This paper proposes a novel prompt example construction method, called the clustering-semantic similarity prompt example construction method. By innovatively combining the clustering and semantic similarity techniques, this method was compared with five other prompt example construction methods across seven datasets and six LLMs. The overall accuracy and stability of this method significantly outperform other methods, effectively improving the accuracy of LLM-generated answers.
The paper conducts comparative testing on seven datasets, evaluating the impact of six prompt example construction methods on the answer accuracy of six LLMs. Through an in-depth analysis of the experimental results, the study summarizes key patterns regarding how different prompt example construction methods affect LLM output performance, providing important insights for selecting prompt example construction methods to enhance LLMs in practical applications:
–
As the parameter scale of LLMs increases, the impact of different prompt example construction methods on the accuracy of LLM-generated answers diminishes.
–
In comparative tests across multiple datasets and LLMs, the semantic similarity, clustering, and clustering–semantic similarity methods performed the best overall, followed by the identical and random methods, with the zero-shot method performing the worst.
–
The random method outperformed the identical method overall, indicating that diverse prompt examples lead to more significant improvements in the accuracy of LLM-generated answers.

The remainder of this paper includes: Section 2, a literature review summarizing the current state of prompt example construction methods; Section 3, an introduction to the clustering–semantic similarity prompt example construction method; Section 4, details of the experiments and an in-depth analysis of the experimental results; and Section 5, the conclusion.

2. Literature Review

Through reviewing the current literature on prompt example construction, this section summarizes the main exploration directions into two broad categories: individual features and overall features, each with three sub-directions. Individual features mainly focus on the selection and construction of each example to ensure that each example provides better information to the LLM. Overall features mainly focus on the overall structure of all examples within a prompt, optimizing the overall structure to enhance the accuracy of the LLM’s generated content.

2.1. Individual Features

Regarding the individual features of examples, the focus is on an in-depth study of a single candidate example, which can be divided into three sub-directions: example structure, example content, and example quality.

2.1.1. Example Structure

The example structure direction mainly explores how to optimize the structure of examples to enhance their effectiveness, such as adding or removing certain components of the examples. Wei et al. were the first to propose that providing examples with intermediate reasoning steps as part of the prompts can significantly improve the LLM’s ability to perform complex reasoning tasks [10], thus expanding the structure of the examples into three parts: the question, the reasoning process, and the answer. Min et al. further demonstrated through experiments that the label (i.e., the answer) part of the example is indispensable, and its absence significantly reduces the effectiveness of the prompt [9].

2.1.2. Example Content

The example content direction mainly explores selecting examples with certain features or meeting specific criteria to enhance their effectiveness. For instance, a common approach is to retrieve examples that are semantically similar to the test question to improve the performance of the LLM’s content generation [11,12,13]. However, Zhang et al. found through experiments that examples with diversified content perform better than those selected based on semantic similarity [14]. Additionally, some researchers use information content as an evaluation metric to filter prompt examples [15,16].

2.1.3. Example Quality

The example quality direction examines how the correctness of each example component and their relationships influence effectiveness, such as the reasoning process. Min et al. found that the correctness of answers in examples does not significantly affect LLM accuracy [9]. However, other researchers have disputed this, claiming that correct answers do impact LLM outputs [16,17]. Additionally, Wei et al. showed that reversing labels in examples can disrupt prior knowledge in large-scale models, but smaller models are not affected [18]. In contrast, Pawelczyk et al. found that label reversal still impacts smaller models’ outputs [19].

2.2. Overall Features

As for the overall features of examples, the focus is on an in-depth study of the overall structure of a selected group of examples, which can be divided into three sub-directions: example order, example quantity, and example distribution.

2.2.1. Example Order

The example order direction examines how arranging examples affects performance. It involves sorting examples based on metrics like semantic similarity or information content. Zhao et al. found that example order significantly affects LLM accuracy [20]. Liu et al. found that changing the order of semantically similar examples has a minor impact [11]. Milios et al. observed that larger models are less sensitive to example order [21]. Lu et al. noted that the effectiveness of example orders varies across different models [22].

2.2.2. Example Quantity

The example quantity direction mainly studies the relationship between the number of examples and the output performance of LLMs. Brown et al. found through experiments that an increase in the number of examples is positively correlated with the accuracy of the LLM’s generated answers [2]. Li et al. found that examples obtained through random sampling may have slightly lower average performance compared to zero-shot scenarios [16]. This indicates that, under the influence of other factors in prompt example construction, an increase in the number of examples may not always achieve the desired effect.

2.2.3. Example Distribution

The example distribution direction explores how the distribution of examples and their components affects the accuracy of LLM-generated answers. For instance, Zhang et al. found that using diverse content in prompt examples, rather than using only semantically similar ones, improves performance [14]. Zhang et al. noted that a balanced distribution of labels in examples enhances accuracy [23]. Similarly, Zhao et al. identified a “Majority Label Bias”, where LLM predictions are heavily influenced by imbalanced label distribution in the prompt [20].

Although some studies have directly used ensemble methods to find suitable examples and provide appropriate overall structures for example groups, such as Wu et al. introducing an adaptive mechanism to help find a set of examples (including selection and ordering) that leads to correct predictions, thereby maximizing model performance [24], each sub-direction of research remains foundational to these ensemble methods.

In summary, research on example construction methods shows many contradictory conclusions and complex methods. The main reason for this is that these methods have been validated on a limited number of datasets and models, lacking comparative testing across multiple models and task datasets. This has resulted in varying findings about the generalizability of different example construction methods for LLM generation.

3. Method

This section provides a detailed introduction to the prompt example construction methods proposed in this paper, including the definition of the target problem and the detailed principles and implementation steps of the method.

3.1. Problem Formulation

The goal of prompt example construction methods is to construct or select the optimal set of prompt examples for a specific problem to enhance the quality of the LLM’s responses. Specifically, for a downstream task

t \in T

(where T is the set of downstream tasks), given the training set

D_{train}^{t} = {(q_{i, train}^{t}, a_{i, train}^{t})}

(1 \leq i \leq n)

and the test set

D_{test}^{t} = {(q_{i, test}^{t}, a_{i, test}^{t})}

(1 \leq i \leq m)

, where q represents the text or question of the example and a represents the label or answer of the example. For the current question

q_{i, test}^{t}

that needs to be answered, the optimization goal of the prompt example construction method is to find the optimal set of prompt examples

S^{*}

:

S^{*} = arg max_{S \subseteq D_{train}^{t}} Q (LLM (q_{i, test}^{t} ∣ S))

(1)

where S represents the set of prompt examples,

LLM (q ∣ S)

represents the output generated by the LLM given the input question q and the set of prompt examples S, and Q represents the evaluation function of the LLM’s output.

3.2. Clustering–Semantic Similarity Prompt Example Construction Method

The method proposed in this paper for constructing prompt examples based on clustering and semantic similarity consists of three main parts: input question processing, sample clustering, and prompt example retrieval (as shown in Figure 1). This method converts the user’s input question into a semantic vector and then, through cluster analysis and vector search, selects the most relevant sample to the input question as the prompt example. The following is a detailed description of the process:

3.2.1. Input Question Processing

Firstly, the text of the user’s input question

q_{i, t e s t}

is received and vectorized using the BGE embedding model (bge-large-en-v1.5) to generate a 1024-dimensional semantic vector

v_{q_{i, t e s t}}

. The advantage of this is that it allows the semantic information of the question text to be represented in the form of a high-dimensional vector, thus facilitating subsequent vector-based similarity searches. Since the context length limit of the bge-large-en-v1.5 model is 512, for question texts

q_{i, t e s t}

that exceed this length, chunking is performed. Assuming the chunked text is

{q_{1}, q_{2}, \dots, q_{n}}

, the semantic vector generated for each chunk is

{v_{q_{1}}, v_{q_{2}}, \dots, v_{q_{n}}}

, and the final semantic vector for the question text

v_{q_{i, t e s t}}

is obtained through average pooling:

v_{q_{i, t e s t}} = \frac{1}{n} \sum_{k = 1}^{n} v_{q_{k}}

This method of average pooling considers the semantic information of each chunk comprehensively, avoiding the loss of important information.

3.2.2. Sample Clustering

Next, the bge-large-en-v1.5 model is used to vectorize the sample texts in the task training set

D_{t r a i n}^{t}

, converting each sample

s_{j} \in D_{t r a i n}^{t}

into a semantic vector

v_{s_{j}}

. This allows all sample semantic information to be represented in a unified vector form, facilitating subsequent clustering. Then, the K-means clustering algorithm is applied to cluster the semantic vectors of all samples, dividing these samples into N clusters

{C_{1}, C_{2}, \dots, C_{N}}

, where each cluster

C_{i}

contains samples with similar semantics:

{C_{1}, C_{2}, \dots, C_{N}} = K - means ({v_{s_{j}}})

The number of clusters N is equal to the number of prompt samples that need to be provided.

The advantage of K-means clustering is that it aggregates samples with similar semantics, reducing the search space in the subsequent retrieval process and improving retrieval efficiency and effectiveness. For samples with excessively long text, they are still processed with chunking and average pooling before clustering, as described above.

3.2.3. Prompt Example Retrieval

Finally, the semantic vector

v_{q_{i, t e s t}}

of the user’s input question is used to search for similar samples from each cluster

C_{i}

in turn. By storing all samples in the training set in a Milvus vector database and using the IVF_FLAT index for the vector search, the semantic similarity

Sim (v_{q_{i, t e s t}}, v_{s_{j}})

is calculated:

Sim (v_{q_{i, t e s t}}, v_{s_{j}}) = \frac{v_{q_{i, t e s t}} \cdot v_{s_{j}}}{∥ v_{q_{i, t e s t}} ∥ ∥ v_{s_{j}} ∥}

IVF_FLAT is a vector retrieval index that first divides high-dimensional vectors into multiple clusters (subspaces) and then uses a linear scan (Flat) within each cluster for search. This method improves search efficiency by reducing the search space while maintaining accuracy. The sample

s_{j}

with the highest semantic similarity from each cluster is selected as the prompt example to form the prompt example set

S_{prompt}

:

S_{prompt} = {s_{1}, s_{2}, \dots, s_{N}}

This retrieval method based on semantic similarity ensures that the selected prompt examples are highly semantically relevant to the user’s input question, thereby enhancing the effectiveness of the prompt examples. Ultimately, these prompt examples are sorted from highest to lowest based on their semantic similarity to the user’s input question, forming the final prompt example set, which is used to support the model’s answer generation.

This method is a novel prompt sample construction approach proposed in this paper, combining semantic similarity-based prompt sample construction with clustering-based prompt sample construction methods. On one hand, by ensuring that each prompt sample comes from a different cluster, this method guarantees the diversity of the prompt sample set, thereby avoiding the narrow distribution of prompt samples that may result from the semantic similarity-based method. On the other hand, by selecting the sample most semantically similar to the test question within each cluster as the prompt sample, this method ensures that each prompt sample maintains a certain degree of relevance to the test question, thus addressing the issue of ignoring the relevance between the prompt sample set and the test question in clustering-based methods. Therefore, the prompt sample set generated by this method balances diversity and relevance, thereby better assisting LLMs in answering the current test question.

4. Experiment and Performance Analysis

This section will present the experimental setup (including datasets, LLM, baseline methods, etc.) as well as a detailed analysis of the experimental results.

4.1. Dataset Description and Preprocessing

To ensure the generalizability and credibility of the experimental conclusions, this paper collects multiple datasets for testing and evaluation, including SST2 [25], SST5 [25], MR [26], Amazon [27], AgNews [28], TREC [29,30], and DBPedia [28,31] (See Table 1 for details).

All of these datasets are for text classification tasks, with SST2, SST5, MR, and Amazon focusing on sentiment classification, while AgNews, TREC, and DBPedia are used for topic or ontology classification tasks. As fundamental natural language processing tasks, these are capabilities that all LLMs (regardless of parameter size) should be able to handle and solve effectively. This is because these tasks are basic and should not be affected by parameter size constraints or special fine-tuning; this is why these datasets are chosen for evaluation in this paper.

Due to differences in the original data formats, sample sizes, label types, and data columns among these datasets, they cannot be directly input into LLMs for uniform testing. Therefore, before conducting the formal experiments, it is necessary to preprocess all the original datasets to standardize their formats. Further details of the datasets and prompt design can be found in Appendix A.

4.2. LLMs for Comparison

To ensure the generalizability of the final experimental conclusions, this paper selects multiple representative LLMs from both domestic and international sources, including the LLaMA2 Chat series models developed by Meta Platform Inc. [32] (including LLaMA2-70B-Chat, LLaMA2-13B-Chat, LLaMA2-7B-Chat), the Baichuan2 Chat series models developed by Baichuan Intelligence inc., Ltd. [33] (including Baichuan2-13B-Chat, Baichuan2-7B-Chat), and the ChatGLM3-6B model developed by Beijing Knowledge Atlas Technology Co., Ltd. [4,34] (For detailed information, see Table 2).

4.3. Prompt Example Construction Methods for Comparison

In the experimental section, we will evaluate the following six types of prompt example construction methods, including the clustering–semantic similarity prompt example construction method proposed in this paper.

Zero-Shot Method. This method does not provide the LLM with any prompt examples. Instead, it directly inputs the description and requirements of the downstream task along with the test questions into the LLM. This method was one of the initial approaches used in prompt engineering. In the experiments conducted in this paper, this method mainly serves as a baseline.

Random Prompt Example Construction Method. This method selects several non-repeating examples from the same domain as the test question to serve as prompt examples. It was one of the initial approaches in prompt engineering. Its main function is to provide the LLM with concrete examples, helping the model understand the format of the questions and answers. For the current question

q_{(i, test)}^{t}

, this method randomly selects k examples from the training set

D_{train}^{t}

without repetition to form the prompt example subset

S_{k}^{random} = {s_{1}, s_{2}, \dots, s_{k}}

.

Identical Prompt Example Construction Method. This method involves randomly selecting a single example from the training set of a given downstream task and repeating it several times in the prompt template. This method, not previously used in prompt engineering research, is designed to explore whether the main factor affecting the large language model’s output accuracy is the content of the prompt examples or their quantity. For the current question

q_{(i, test)}^{t}

, this method randomly selects one prompt example

s = s_{(j, train)}^{t}

from the training set

D_{train}^{t}

and replicates it k times to form the prompt example subset

S_{k}^{same} = {s, s, \dots, s}

.

Semantic Similarity Prompt Example Construction Method. This method selects the most semantically similar examples from the same domain’s dataset. For the current question

q_{(i, test)}^{t}

, this method uses the embedding model bge-large-en-v1.5 to vectorize the question text. It then computes cosine similarity between this vector and the vectors of question texts in the training set

D_{train}^{t}

. The top k examples with the highest cosine similarity scores form the prompt example subset

S_{k}^{similarity} = {s_{1}, s_{2}, \dots, s_{k}}

. These examples are arranged in descending order of their cosine similarity scores.

Clustering Prompt Example Construction Method. This method clusters the training set samples and selects the cluster centers as prompt examples. For the current question

q_{(i, test)}^{t}

, the method uses the k-means clustering algorithm on the training set

D_{train}^{t}

, resulting in k clusters

{C_{1}, C_{2}, \dots, C_{k}}

. It uses cosine similarity between question text vectors, generated by the embedding model bge-large-en-v1.5, to measure distances. The sample closest to each cluster center forms the prompt example subset

S_{k}^{cluster} = {s_{1}, s_{2}, \dots, s_{k}}

.

Clustering–Semantic Similarity Prompt Example Construction Method. This method, proposed in this paper, clusters the training set samples and then selects the most semantically similar sample to the test question from each cluster. It combines the advantages of both the semantic similarity and clustering methods.

In the experiment, for each LLM, four examples (i.e., 4-shot) will be provided for each question using the aforementioned prompt sample construction methods (excluding the zero-shot method). For methods utilizing clustering algorithms, the number of clusters k is set to 4. Additionally, we randomly select 500 unique samples from the test set of each dataset, repeat the process three times, and take the average of the three batches as the final result to avoid experimental randomness.

4.4. Performance Analysis

This section presents and analyzes all the experimental results. Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8 show the test results of different LLMs on various datasets, including the performance of each prompt example construction method and the average values (AVG columns) across all datasets. The accuracy of each method on each dataset is the average of three batches of data to ensure stability and reliability. Some results lack specific average accuracy due to an insufficient number of valid outputs (less than 500), primarily in the LLaMA2-7B-Chat model, which does not significantly affect the overall experimental results.

4.4.1. LLaMA2 Chat Series Results

Table 3, Table 4 and Table 5 present the results of the LLaMA2 Chat series models (LLaMA2-70B-Chat, LLaMA2-13B-Chat, LLaMA2-7B-Chat). Due to the shorter context length and poorer instruction-following ability of the LLaMA2-7B-Chat model, there were many missing values in Table 8. In contrast, the LLaMA2-70B-Chat and LLaMA2-13B-Chat models were more stable. As the parameter scale of the LLaMA2 Chat models increased, the impact of prompt example construction methods on the accuracy of generated answers decreased. The accuracy difference between the best and worst methods in the LLaMA2-7B-Chat model was 24.54%, in the LLaMA2-13B-Chat model it was 11.17%, and in the LLaMA2-70B-Chat model it was only 6.71%. This indicates that larger models (70B and above) do not rely on prompt example construction methods for most tasks.

Specifically, the zero-shot method performed the worst in the LLaMA2-7B-Chat model, confirming the necessity of prompt examples. The identical and random prompt methods performed moderately, but the random method was superior, showing that the model could learn more from diverse prompt examples. The semantic similarity, clustering, and clustering–semantic similarity methods performed the best, with the clustering–semantic similarity method excelling in both accuracy and stability.

4.4.2. Baichuan2 Chat Series and ChatGLM3-6B Results

Table 6, Table 7 and Table 8 display the results of the Baichuan2-13B-Chat, Baichuan2-7B-Chat, and ChatGLM3-6B models. Larger models showed less dependence on prompt example construction methods. In the ChatGLM3-6B, Baichuan2-7B-Chat, and Baichuan2-13B-Chat models, the accuracy differences between the best and worst methods were 26.74%, 14.60%, and 17.80%, respectively.

In the Baichuan2-7B-Chat model, the average accuracy of the identical prompt method was higher than the random prompt method, mainly because the identical prompt method lacked results for the AgNews dataset, resulting in a higher average accuracy. Excluding the AgNews dataset, the accuracy of the random prompt method was close to the identical prompt method.

Overall, these results indicate that as the model parameter scale increases, the impact of prompt example construction methods on model performance decreases. However, for smaller models, optimizing prompt example construction methods remains crucial.

4.4.3. Overall Analysis

Figure 2 uses bar charts to display the average accuracy of each prompt example construction method on different LLMs, clearly comparing the impact of different methods on answer accuracy. It shows that except for the LLaMA2-70B-Chat model, the clustering–semantic similarity method outperformed other methods on all other models. The performance of the six prompt example construction methods formed three distinct tiers in all models, consistent with the conclusions drawn from Table 3, Table 4, Table 5, Table 6, Table 7 and Table 8.

Specifically:

Zero-shot method: Overall the worst performer, but it confirms that prompt examples can significantly enhance the accuracy of LLMs. These models indeed have contextual learning abilities, learning to solve similar problems and standardizing output formats from the provided examples.
Identical and random prompt methods: Performed better than the zero-shot method, with the random method outperforming the identical method, indicating that models benefit more from diverse prompt examples. Although the identical prompt method is straightforward, its repeated use of the identical example provides insufficient information, making it less effective than the random prompt method.
Semantic similarity, clustering, and clustering–semantic similarity methods: Performed the best, particularly the clustering–semantic similarity method, which showed more stable performance and the highest average accuracy across different datasets. These methods consider the semantic information and diversity of prompt examples, providing richer reference information and improving model performance.

The analysis of Figure 2 further underscores the importance of optimizing prompt example construction methods. The clustering–semantic similarity method consistently performed well across most models, significantly enhancing accuracy, especially in smaller models. This demonstrates that through the careful design of prompt examples, the application effectiveness of LLMs can be significantly improved, particularly when resources are limited. Additionally, as the parameter scale of LLMs increases, the impact of different prompt example construction methods on the accuracy of generated answers decreases. This is primarily because larger models have stronger fundamental capabilities and thus rely less on prompt examples.

5. Conclusions

With the advent of large language models (LLMs) such as OpenAI’s ChatGPT, enhancing model capabilities in specific domains has become a critical challenge. Existing prompt example construction methods are numerous but lack universality, and current research is often based on limited datasets, failing to explore the broader impact of these methods across various tasks. Therefore, this paper proposes a prompt example construction method based on clustering and semantic similarity, which combines clustering algorithms with semantic similarity techniques to significantly improve the quality of prompt examples. Comparative tests on six LLMs and seven datasets demonstrate that our method significantly outperforms five other common methods in terms of accuracy and stability, particularly in enhancing the accuracy of LLM-generated answers.

Additionally, this paper explores the impact of different prompt example construction methods on LLM output performance, revealing the patterns of influence on accuracy. The study shows that as the parameter scale of the models increases, the impact of different construction methods on the accuracy of generated answers diminishes. Among the datasets and models compared, the semantic similarity, clustering, and clustering–semantic similarity methods performed best, while the random method outperformed the identical method. This indicates that more diverse prompt examples lead to significant improvements in the accuracy of LLM-generated answers. This suggests that in practical applications, selecting high-quality prompt examples can effectively enhance the output performance of LLMs, especially for smaller-scale models, while more diverse prompt examples can further amplify this effect.

However, there are some limitations in this study. While the proposed method performs excellently across multiple models and datasets, its effectiveness in unconventional or extreme task scenarios still needs further validation. Furthermore, the current experimental datasets and model scales may not cover all practical application scenarios, so future research should consider a broader range of models and datasets to further validate the generalizability and applicability of the proposed method.

Author Contributions

Conceptualization, D.C. and J.W.; methodology, D.C. and J.W.; validation, D.C. and J.W.; formal analysis, D.C.; investigation, D.C.; resources, D.C.; data curation, D.C.; writing—original draft preparation, D.C.; writing—review and editing, J.W.; visualization, D.C.; supervision, J.W.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China [grant number 72171008]; and the National Civil Aircraft Project Foundation of China [grant number MJZ2-3N21].

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The data preprocessing steps are as follows:

The training set, test set, and validation set (if applicable) from the original dataset are merged, and the entire dataset is randomly shuffled before being split again to ensure uniformity.
The columns of the original dataset are inspected, and based on their structure, columns are either merged or split to form two standardized columns: text and label_text. For example, if the original dataset separates titles and body text, they are combined into a single text column for consistency in later experiments.
Each sample is processed by adjusting the original label values according to a designed mapping, ensuring large language models can better recognize the data. The text length is also checked, and samples with text that is too long or too short are discarded based on predefined limits.
All valid samples are consolidated into a new dataset, from which 2000 samples are drawn as the training set and 1000 as the test set. If the processed dataset contains fewer than 3000 samples, it is split into training and test sets in a 2:1 ratio. The final dataset is saved as a JSON file at a specified path.

After preprocessing, the results of all datasets are shown in Table A1.

Table A1. Preprocessed dataset information.

Dataset Name	Label Range	Number of Training Samples	Number of Test Samples
SST2	positive/negative	2000	1000
SST5	very positive/positive/neutral/negative /very negative	2000	1000
MR	positive/negative	2000	1000
Amazon	positive/negative	2000	1000
AgNews	World/Sports/Business/Technology	2000	1000
TREC	Abbreviation/Entity/Description/Person/Location /Number	1000	500
DBPedia	Company/School/Artist/Athlete/Politician/ Transportation/Building/Nature/Village/Animal/ Plant/Album/Film/Book	2000	1000

All datasets have been properly processed, and experiments will be conducted on these processed datasets.

In addition, it is necessary to design general prompt templates. To enable the LLM to understand the specific downstream tasks represented by each dataset, it is crucial to describe the downstream task requirements (problem description, answer format, answer label range, etc.) in natural language to form corresponding prompt templates. These prompt templates should be able to embed any given set of prompt examples and test questions.

Table A2 provides the prompt templates corresponding to all datasets. Each prompt template consists of five parts: role positioning, task description, prompt examples, test questions, and input requirements. Role positioning sets an appropriate role for the LLM, allowing it to play this role when answering questions, which helps improve the specificity and accuracy of the answers. The task description includes the requirements and answer label range of the specific downstream task corresponding to the dataset. Prompt examples are a series of prompt strings given by different prompt example construction methods. Input requirements strictly regulate the output of the LLM.

Table A2. Dataset task prompt template list.

Dataset	Prompt Template
SST2	You are an expert in sentiment analysis, please classify the sentiment of the sentence into positive/negative. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.
SST5	You are an expert in sentiment analysis, please classify the sentiment of the sentence into very positive/positive/neutral/negative/very negative. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.
MR	You are an expert in sentiment analysis, please classify the sentiment of the movie review into positive/negative. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.
Amazon	You are an expert in sentiment analysis, please classify the sentiment of the product review into positive/negative. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.
AgNews	You are an expert in topic classification, please classify the topic of the academic news article into world/sports/business/technology. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.
TREC	You are an expert in topic classification, please classify the topic of the question into abbreviation/entity/description/person/location/number. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.
DBPedia	You are an expert in ontology classification, please classify the ontology of the sentence into company/school/artist/athlete/politician/transportation/building/nature/village/animal/plant/album/film/book. {examples} Please answer: {question} You only need to output the answer, no additional explanation is needed.

References

Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems, New York, NY, USA, 6–12 December 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 320–335. [Google Scholar] [CrossRef]
BigScience Workshop; Scao, T.L.; Fan, A.; Akiki, C.; Pavlick, E.; Ilić, S.; Hesslow, D.; Castagné, R.; Luccioni, A.S.; Yvon, F.; et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv 2023, arXiv:2211.05100. [Google Scholar]
Wu, C.; Yin, S.; Qi, W.; Wang, X.; Tang, Z.; Duan, N. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models. arXiv 2023, arXiv:2303.04671. [Google Scholar]
Cui, J.; Ning, M.; Li, Z.; Chen, B.; Yan, Y.; Li, H.; Ling, B.; Tian, Y.; Yuan, L. Chatlaw: A Multi-Agent Collaborative Legal Assistant with Knowledge Graph Enhanced Mixture-of-Experts Large Language Model. arXiv 2024, arXiv:2306.16092. [Google Scholar]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. BloombergGPT: A Large Language Model for Finance. arXiv 2023, arXiv:2303.17564. [Google Scholar]
Min, S.; Lyu, X.; Holtzman, A.; Artetxe, M.; Lewis, M.; Hajishirzi, H.; Zettlemoyer, L. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 11048–11064. [Google Scholar] [CrossRef]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 24824–24837. [Google Scholar]
Liu, J.; Shen, D.; Zhang, Y.; Dolan, B.; Carin, L.; Chen, W. What Makes Good In-Context Examples for GPT-3? In Proceedings of the Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, Dublin, Ireland, 27 May 2022; pp. 100–114. [Google Scholar] [CrossRef]
Rubin, O.; Herzig, J.; Berant, J. Learning To Retrieve Prompts for In-Context Learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, DC, USA, 10–15 July 2022; pp. 2655–2671. [Google Scholar] [CrossRef]
Su, H.; Kasai, J.; Wu, C.H.; Shi, W.; Wang, T.; Xin, J.; Zhang, R.; Ostendorf, M.; Zettlemoyer, L.; Smith, N.A.; et al. Selective Annotation Makes Language Models Better Few-Shot Learners. arXiv 2022, arXiv:2209.01975. [Google Scholar]
Zhang, Z.; Zhang, A.; Li, M.; Smola, A. Automatic Chain of Thought Prompting in Large Language Models. arXiv 2022, arXiv:2210.03493. [Google Scholar]
Liu, H.; Wang, Y. Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 15825–15838. [Google Scholar] [CrossRef]
Li, X.; Qiu, X. Finding Support Examples for In-Context Learning. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 6219–6235. [Google Scholar] [CrossRef]
Yoo, K.M.; Kim, J.; Kim, H.J.; Cho, H.; Jo, H.; Lee, S.-W.; Lee, S.-G.; Kim, T. Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2422–2437. [Google Scholar] [CrossRef]
Wei, J.; Wei, J.; Tay, Y.; Tran, D.; Webson, A.; Lu, Y.; Chen, X.; Liu, H.; Huang, D.; Zhou, D.; et al. Larger Language Models Do In-Context Learning Differently. arXiv 2023, arXiv:2303.03846. [Google Scholar]
Pawelczyk, M.; Neel, S.; Lakkaraju, H. In-Context Unlearning: Language Models as Few Shot Unlearners. arXiv 2024, arXiv:2310.07579. [Google Scholar]
Zhao, Z.; Wallace, E.; Feng, S.; Klein, D.; Singh, S. Calibrate Before Use: Improving Few-shot Performance of Language Models. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR, Proceedings of Machine Learning Research. Volume 139, pp. 12697–12706. [Google Scholar]
Milios, A.; Reddy, S.; Bahdanau, D. In-Context Learning for Text Classification with Many Labels. In Proceedings of the 1st GenBench Workshop on (Benchmarking) Generalisation in NLP, Singapore, 6 December 2023; pp. 173–184. [Google Scholar] [CrossRef]
Lu, Y.; Bartolo, M.; Moore, A.; Riedel, S.; Stenetorp, P. Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1: Long Papers, pp. 8086–8098. [Google Scholar] [CrossRef]
Zhang, Y.; Feng, S.; Tan, C. Active Example Selection for In-Context Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9134–9148. [Google Scholar] [CrossRef]
Wu, Z.; Wang, Y.; Ye, J.; Kong, L. Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1: Long Papers, pp. 1423–1436. [Google Scholar] [CrossRef]
Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.; Potts, C. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, DC, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; pp. 115–124. [Google Scholar] [CrossRef]
McAuley, J.; Leskovec, J. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, New York, NY, USA, 12–16 October 2013; RecSys ’13. pp. 165–172. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, J.; LeCun, Y. Character-level Convolutional Networks for Text Classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]
Hovy, E.; Gerber, L.; Hermjakob, U.; Lin, C.Y.; Ravichandran, D. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, San Diego, CA, USA, 18–21 March 2001; HLT ’01. pp. 1–7. [Google Scholar] [CrossRef]
Li, X.; Roth, D. Learning question classifiers. In Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, 26–30 August 2002; COLING ’02. Volume 1, pp. 1–7. [Google Scholar] [CrossRef]
Lehmann, J.; Isele, R.; Jakob, M.; Jentzsch, A.; Kontokostas, D.; Mendes, P.N.; Hellmann, S.; Morsey, M.; van Kleef, P.; Auer, S.; et al. DBpedia—A large-scale, multilingual knowledge base extracted from Wikipedia. Semant. Web 2015, 6, 167–195. [Google Scholar] [CrossRef]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Yang, A.; Xiao, B.; Wang, B.; Zhang, B.; Bian, C.; Yin, C.; Lv, C.; Pan, D.; Wang, D.; Yan, D.; et al. Baichuan 2: Open Large-scale Language Models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X.; et al. GLM-130B: An Open Bilingual Pre-trained Model. arXiv 2023, arXiv:2210.02414. [Google Scholar]

Figure 1. Framework of clustering–semantic similarity prompt example construction method.

Figure 2. Overall results.

Table 1. Information on evaluation datasets.

Dataset Name	Task Type	Number of Classes	Training Set Size	Validation Set Size
SST2	Text Sentiment Classification	2	6920	872
SST5	Text Sentiment Classification	5	8544	1101
MR	Movie Review Classification	2	10,662	-
Amazon	Product Review Classification	2	3,600,000	400,000
AgNews	News Topic Classification	4	120,000	7600
TREC	Question Text Classification	6	5452	500
DBPedia	Text Ontology Classification	14	560,000	70,000

Table 2. Brief Information on LLMs to be Tested.

Model Name	Parameter	Type	Development Organization	Release Date
LLaMA2-70B-Chat	70B	Chat	Meta Platform Inc.	July 2023
LLaMA2-13B-Chat	13B	Chat	Meta Platform Inc.	July 2023
LLaMA2-7B-Chat	7B	Chat	Meta Platform Inc.	July 2023
Baichuan2-13B-Chat	13B	Chat	Baichuan Intelligence Inc.	September 2023
Baichuan2-7B-Chat	7B	Chat	Baichuan Intelligence Inc.	September 2023
ChatGLM3-6B	6B	Chat	Beijing Knowledge Atlas Technology Co., Ltd.	October 2023

Table 3. LLaMA2-7B-Chat model experiment results.

Construction Method	SST2	SST5	MR	Amazon	AgNews	TREC	DBPedia	AVG
Zero-Shot	50.36%	-	33.76%	58.31%	45.41%	15.67%	63.78%	44.55%
Identical	71.01%	-	55.66%	-	31.24%	42.12%	29.24%	45.85%
Random	80.20%	-	-	-	26.32%	36.99%	-	47.84%
Semantic Similarity	86.30%	-	78.13%	-	57.49%	54.14%	-	69.01%
Clustering	84.11%	-	86.90%	-	26.42%	49.21%	-	61.66%
Clustering–Semantic Similarity	87.87%	-	78.24%	-	57.27%	52.98%	-	69.09%

Note: - indicates that the current evaluation item cannot calculate a reliable average accuracy due to the insufficient number of valid results. The best performance in each column is bolded, and the second-best performance is underlined.

Table 4. LLaMA2-13B-Chat model experiment results.

Construction Method	SST2	SST5	MR	Amazon	AgNews	TREC	DBPedia	AVG
Zero-Shot	93.77%	34.55%	88.75%	96.67%	57.98%	52.39%	76.05%	71.45%
Identical	89.39%	39.80%	87.54%	94.14%	53.24%	62.74%	77.26%	72.02%
Random	94.04%	43.12%	92.25%	95.86%	48.40%	57.89%	85.65%	73.89%
Semantic Similarity	92.95%	42.94%	88.34%	96.54%	64.95%	71.74%	87.95%	77.92%
Clustering	92.64%	48.27%	89.64%	88.37%	69.02%	62.90%	77.57%	75.49%
Clustering–Semantic Similarity	93.84%	49.87%	89.99%	96.43%	81.98%	71.28%	94.96%	82.62%

Note: The best performance in each column is bolded, and the second-best performance is underlined.

Table 5. LLaMA2-70B-Chat model experiment results.

Construction Method	SST2	SST5	MR	Amazon	AgNews	TREC	DBPedia	AVG
Zero-Shot	94.87%	53.50%	90.11%	97.33%	74.25%	63.70%	85.31%	79.87%
Identical	93.75%	51.96%	90.63%	94.77%	72.20%	62.42%	87.28%	79.00%
Random	95.86%	51.03%	91.70%	94.72%	74.93%	67.53%	89.95%	80.82%
Semantic Similarity	95.67%	50.10%	90.85%	96.65%	84.47%	85.04%	97.19%	85.71%
Clustering	95.60%	52.85%	91.89%	96.86%	74.93%	76.29%	81.69%	81.44%
Clustering–Semantic Similarity	95.20%	50.13%	91.37%	96.46%	86.40%	83.40%	93.96%	85.27%

Note: The best performance in each column is bolded, and the second-best performance is underlined.

Table 6. Baichuan2-13B-Chat model experiment results.

Construction Method	SST2	SST5	MR	Amazon	AgNews	TREC	DBPedia	AVG
Zero-Shot	78.17%	36.64%	79.03%	80.61%	-	-	-	68.61%
Identical	75.81%	41.38%	69.95%	79.07%	41.30%	-	-	61.50%
Random	70.40%	45.08%	79.04%	87.60%	72.78%	50.28%	-	67.53%
Semantic Similarity	90.12%	46.58%	79.78%	93.22%	87.64%	61.27%	92.97%	78.80%
Clustering	93.32%	46.77%	85.61%	92.69%	66.87%	45.42%	43.60%	67.76%
Clustering–Semantic Similarity	90.91%	48.33%	83.46%	92.65%	86.05%	62.58%	91.10%	79.30%

Note: - indicates that the current evaluation item cannot calculate a reliable average accuracy due to the insufficient number of valid results. The best performance in each column is bolded, and the second-best performance is underlined.

Table 7. Baichuan2-7B-Chat model experiment results.

Construction Method	SST2	SST5	MR	Amazon	AgNews	TREC	DBPedia	AVG
Zero-Shot	80.91%	34.05%	69.60%	88.23%	39.42%	15.19%	72.32%	57.10%
Identical	78.67%	39.32%	71.59%	86.70%	-	40.67%	78.19%	65.86%
Random	65.34%	29.75%	70.42%	92.35%	38.63%	38.12%	84.73%	59.90%
Semantic Similarity	79.20%	36.77%	71.90%	92.10%	74.07%	48.14%	89.95%	70.31%
Clustering	86.78%	30.94%	86.43%	90.46%	64.11%	22.78%	77.94%	65.64%
Clustering–Semantic Similarity	83.20%	38.12%	77.72%	92.33%	80.80%	40.91%	88.83%	71.70%

Note: - indicates that the current evaluation item cannot calculate a reliable average accuracy due to the insufficient number of valid results. The best performance in each column is bolded, and the second-best performance is underlined.

Table 8. ChatGLM3-6B model experiment results.

Construction Method	SST2	SST5	MR	Amazon	AgNews	TREC	DBPedia	AVG
Zero-Shot	81.22%	40.23%	68.02%	83.18%	33.21%	42.71%	53.00%	57.37%
Identical	91.46%	45.27%	84.69%	92.79%	69.68%	63.32%	84.54%	75.96%
Random	94.27%	48.19%	89.60%	96.60%	67.83%	58.73%	83.92%	77.02%
Semantic Similarity	92.07%	47.26%	87.40%	95.19%	76.09%	81.10%	92.98%	81.73%
Clustering	95.47%	50.23%	89.80%	97.06%	78.49%	72.25%	82.87%	80.88%
Clustering–Semantic Similarity	94.67%	49.63%	89.00%	95.47%	82.96%	83.41%	93.67%	84.11%

Note: The best performance in each column is bolded, and the second-best performance is underlined.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, D.; Wang, J. A Prompt Example Construction Method Based on Clustering and Semantic Similarity. Systems 2024, 12, 410. https://doi.org/10.3390/systems12100410

AMA Style

Chen D, Wang J. A Prompt Example Construction Method Based on Clustering and Semantic Similarity. Systems. 2024; 12(10):410. https://doi.org/10.3390/systems12100410

Chicago/Turabian Style

Chen, Ding, and Jun Wang. 2024. "A Prompt Example Construction Method Based on Clustering and Semantic Similarity" Systems 12, no. 10: 410. https://doi.org/10.3390/systems12100410

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Prompt Example Construction Method Based on Clustering and Semantic Similarity

Abstract

1. Introduction

2. Literature Review

2.1. Individual Features

2.1.1. Example Structure

2.1.2. Example Content

2.1.3. Example Quality

2.2. Overall Features

2.2.1. Example Order

2.2.2. Example Quantity

2.2.3. Example Distribution

3. Method

3.1. Problem Formulation

3.2. Clustering–Semantic Similarity Prompt Example Construction Method

3.2.1. Input Question Processing

3.2.2. Sample Clustering

3.2.3. Prompt Example Retrieval

4. Experiment and Performance Analysis

4.1. Dataset Description and Preprocessing

4.2. LLMs for Comparison

4.3. Prompt Example Construction Methods for Comparison

4.4. Performance Analysis

4.4.1. LLaMA2 Chat Series Results

4.4.2. Baichuan2 Chat Series and ChatGLM3-6B Results

4.4.3. Overall Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI