Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification

Ray, Sujan; Sarker, Arpita Nath; Chatterjee, Neelakshi; Bhowmik, Kowshik; Dey, Sayantan

doi:10.3390/digital5020012

Open AccessArticle

Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification

by

Sujan Ray

^1,2,*

,

Arpita Nath Sarker

³,

Neelakshi Chatterjee

⁴

,

Kowshik Bhowmik

¹ and

Sayantan Dey

¹

Computer Science and Engineering (EECS), University of Cincinnati, Cincinnati, OH 45221, USA

²

Business School, University of the Cumberlands, Williamsburg, KY 40769, USA

³

Department of Biology, University of West Georgia, Carrollton, GA 30118, USA

⁴

Department of Biostatistics, Health Informatics and Data Science, University of Cincinnati, Cincinnati, OH 45221, USA

^*

Author to whom correspondence should be addressed.

Digital 2025, 5(2), 12; https://doi.org/10.3390/digital5020012

Submission received: 24 January 2025 / Revised: 23 March 2025 / Accepted: 27 March 2025 / Published: 8 April 2025

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The advent of transformer technology and large language models (LLMs) has further broadened the already extensive application field of artificial intelligence (AI). A large portion of medical records is stored in text format, such as clinical trial texts. Part of these texts is information regarding eligibility criteria. We aimed to harness the immense capabilities of an LLM by fine-tuning an open-source LLM (Llama-2) to develop a classifier from the clinical trial data. We were interested in investigating whether a fine-tuned LLM could better decide the eligibility criteria from the clinical trial text and compare the results with a more traditional method. Such an investigation can help us determine the extent to which we can rely on text-based applications developed from large language models and possibly open new avenues of application in the medical domain. Our results are comparable to the best-performing methods for this task. Since we used state-of-the-art technology, this research has the potential to open new avenues in the field of LLM application in the healthcare sector.

Keywords:

artificial intelligence; clinical trial; deep neural networks; fine-tuning; large language models; machine learning

1. Introduction

The introduction of transformer architecture has revolutionized the field of Natural Language Processing (NLP) by providing an efficient and effective way to model natural language, leveraging self-attention mechanisms and parallel processing capabilities for improved performance, addressing the challenge of capturing relationships between distant words. This advancement propelled the emergence of language models that, due to their number of parameters and the amount of data they were trained on, are known as large language models (LLMs). These LLMs, such as Llama2 and GPT-4, have shown capabilities for text understanding and generation tasks that closely resemble a human’s.

Such capabilities have sparked the interest of professionals across various fields, including the healthcare community. In healthcare, the appeal of LLMs lies in their ability to process and synthesize vast amounts of data, including, but not limited to, medical literature and patient records. The scale of healthcare data is not only daunting but also complex due to its heterogeneous nature. Healthcare professionals increasingly view LLMs as powerful tools to help manage information overload. The application of LLMs in healthcare has the potential to drive cutting-edge research and improve patient care by analyzing available data, extracting essential insights, and applying them to real-world scenarios [1].

While the application of LLMs in the healthcare domain has the potential to transform the sector, researchers have identified several issues that must be addressed for future advancements in this area. These include patient record privacy, bias in training data, the lack of explainability in responses provided by LLMs, and the significant carbon footprint associated with the high memory requirements for training healthcare-specific LLMs. Although LLMs have demonstrated a high level of generalization, making them suitable for tasks beyond their original training, the high-risk nature of the healthcare domain necessitates fine-tuning pre-trained models to ensure optimal performance on new datasets and downstream tasks.

Parameter-efficient fine-tuning (PEFT) techniques enable the efficient adjustment of large models over various downstream tasks [2]. Large language models, as their name suggests, comprise a huge number of parameters ranging in the billions. Fully fine-tuning a pre-trained model for a specific domain and task not only requires significant computational power but mayalso lead to “catastrophic forgetting” where the model forgets the information it was previously trained on while acquiring new knowledge [3]. PEFT allows for the adjustment of the parameters of a pre-trained model to adapt it to a specific task or domain, at the same time minimizing the number of additional parameters introduced to the base model and the necessary computational resources. PEFT techniques, such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA), also enable modularity by allowing fine-tuned weights to be stored independently and merged with the base model during deployment.

In this paper, we investigate the applicability of a fine-tuned LLM to a sensitive task in the healthcare domain: predicting the eligibility of cancer patients for clinical trials based on short, free-text statements. For the experiments, we utilized the open-source LLaMA-2 7B chat model and fine-tuned it using QLoRA techniques. We hypothesized that the fine-tuned language model would excel at this task. The results demonstrate that our approach achieved 93% accuracy and 93% for F1-score, both of which are comparable to the machine learning and deep learning methods previously applied to this task. These encouraging results motivate us to propose more high-performing, modular, fine-tuned models for specific high-impact tasks in the healthcare domain, paving the way toward addressing many existing challenges in employing large language models in healthcare. The rest of this article is divided into five sections. Section 2 provides an explanation of the background and related work, followed by a detailed explanation of the methodology in Section 3. The experiment part is discussed in Section 4. Then, this paper discusses the results in Section 5. Section 6 of this article presents the conclusions of this study.

2. Background and Related Work

Bustos and Pertusa [4] explored the application of deep learning models to predict the eligibility of cancer patients for clinical trials based on short, free-text statements. They addressed the issue of restrictive eligibility criteria that often exclude patients with comorbidities, prior treatments, or older age, which limits the applicability of trial results. To address this, the authors created a dataset from public cancer trial protocols over 18 years and used deep neural networks (DNNs) to classify inclusion and exclusion criteria. Training on a dataset of 6 million short clinical texts, they tested several classifiers, finding that convolutional neural networks (CNNs) and k-nearest neighbor (KNN) achieved over 90% accuracy in predicting eligibility. Specifically, CNNs reached 91% accuracy and KNN 93%, with Cohen’s Kappa scores indicating nearly perfect agreement. The models showed strong generalization to unseen datasets and could cluster related medical terms, demonstrating promise for clinical decision support systems. The research highlights the potential of AI to enhance clinical trial-matching and improve patient outcomes by better navigating complex eligibility criteria.

Jasmir et al. [5] focused on enhancing the performance of k-nearest neighbor (KNN) to classify clinical trial text data. The study explored the integration of a fine-grained algorithm (FGA) with KNN to improve computational efficiency, especially with large datasets of clinical trials. The findings indicate that incorporating FGA significantly boosted the performance of KNN. For example, the computational time for KNN dropped from 388,274 s to 260,641 s for a dataset of 1 million samples. KNN paired with FGA showed better precision, recall, and F1 score than KNN alone. However, when the FGA was applied to another supervised learning model, SVM (support vector machine), the results were not as favorable, with SVM’s performance decreasing compared to its base model without FGA. This suggests that while FGA effectively improves KNN’s performance, it may only sometimes apply to other supervised learning models. Overall, the study emphasizes the potential to combine FGA with KNN to achieve more efficient and scalable classification, particularly in clinical trials where text data are extensive and complex. Future research could explore integrating FGA with other machine learning techniques or optimizing the FGA further for broader applications.

Menger et al. [6] compared deep learning models with traditional machine learning methods to predict inpatient violence using clinical text records. The study evaluated neural networks (CNN, RNN, LSTM) against support vector machines (SVMs), decision trees, and Naive Bayes on a hospital dataset. The results indicate that deep learning models outperform classical ML approaches in identifying risk factors for inpatient violence, but they require larger training datasets and greater computational resources. The study underscores the importance of feature representation in clinical NLP tasks.

Sutanto and Nayak [7] introduced a fine-grained clustering algorithm to enhance document ranking and grouping in social media analytics. Unlike traditional clustering methods (e.g., k-means, hierarchical clustering), this approach incorporated ranking-based techniques to better distinguish nuanced topics in large datasets. The proposed model was applied to Twitter and Facebook datasets, demonstrating superior performance in topic detection and content recommendation. The study underscores the role of fine-grained clustering in improving information retrieval from dynamic high-volume text streams.

Another study [8] explored the use of large language models (LLMs) for text classification through ensemble approaches. It combined predictions from multiple LLMs to enhance classification accuracy and robustness. This method took advantage of the strengths of various pre-trained models, utilizing their contextual understanding to classify text effectively. LLMs were either fine-tuned or used as feature extractors, contributing to the ensemble’s decision-making process. This strategy improves performance in situations where individual models may face challenges.

Research on LLMs is becoming increasingly important in the legal sector due to their ability to lower costs and risks in legal document review through improved text classification precision. Fine-tuning LLMs has been shown to enhance performance significantly for specialized applications, such as legal text classification. One study [9] assessed fine-tuned LLMs and found that they consistently outperformed standard pre-trained models in classification accuracy at both document and snippet levels. The research also explored snippet-level classification, which could sometimes achieve higher precision than document-level approaches, depending on project needs. These improvements are valuable for legal reviews, where precision translates into cost savings. Traditional models like logistic regression remain relevant, performing competitively at various recall rates alongside fine-tuned LLMs, suggesting that they can complement each other in effective text classification strategies. Additionally, the study identified infrastructure challenges, including the need for substantial GPU resources and time for fine-tuning. By utilizing Hugging Face’s default tuning parameters, the research indicated opportunities for further optimization, such as layer-specific tuning. This focus on snippet-level classification and layer tuning highlights promising directions for future research, emphasizing the potential of fine-tuned LLMs to drive innovation and efficiency in legal text classification.

Another study [10] examined how large language models (LLMs) can be integrated with classical machine learning techniques for classification tasks. In this approach, LLMs served as feature extractors, transforming raw or multimodal input into enriched representations that are then used by traditional supervised classifiers. This hybrid method improved prediction accuracy, particularly for binary classification and transfer learning tasks, where the distribution of the test data differed from that of the training data. The experiments demonstrated that models enhanced with LLMs outperformed those that rely solely on classical machine learning methods.

3. Methodology

Large Language Model: A large language model (LLM) is a form of artificial intelligence (AI) that specializes in recognizing and generating text, along with performing various other tasks. LLMs are trained on extensive datasets, which is why they are referred to as “large”. They are based on machine learning and employ a specific type of neural network called a transformer model [11]. Figure 1 is a thematic illustration of how an LLM is trained and applied in downstream tasks.

LLMs employ a type of machine learning called deep learning to understand how characters, words, and sentences relate to each other. This approach involves the probabilistic analysis of unstructured data, enabling the deep learning model to distinguish between different pieces of content without any human intervention.

LLMs are subsequently refined through processes such as fine-tuning or prompt-tuning, which are designed to tailor the models for specific tasks. These tasks may include interpreting questions and generating responses, as well as translating text between different languages. However, existing top-performing methods still have high forgetting rates, lacking intra-domain knowledge extraction and inter-domain common prompting strategies. To address these challenges, Feng et al. [13] proposed CP-Prompt, which introduced a framework that compositionally inserts personalized prompts into multi-head self-attention layers, enhancing the model’s ability to retain and integrate knowledge across domains.

LLaMA 2: LLaMA2, introduced by Meta in 2023, is an open-source LLM [14]. It belongs to the LLaMA (large language model) family, which includes a variety of models with different capacities, ranging from 7 billion to 70 billion parameters [15].

Architecture Overview of LLaMA 2: LLaMA 2 is built upon the transformer architecture, akin to other prominent large language models like GPT-4. Its notable architectural features include the following [14,16]:

1.: Decoder-Only Transformer: LLaMA 2 utilizes a decoder-only architecture specifically optimized for autoregressive text generation.
2.: Multi-Head Self-Attention (MHSA): This component allows the model to focus simultaneously on various parts of the input text.
3.: Rotary Position Embedding (RoPE): This enhancement significantly improves the model’s capability to effectively manage longer context windows.
4.: Layer Normalization (Pre-Norm): This technique is applied before the self-attention and feed-forward layers, contributing to more stable training.
5.: Feed-Forward Networks (FFNs): The model features Gated Linear Units (GLUs), which enhance its expressiveness and computational efficiency.

The number of parameters is a crucial aspect of LLMs as it influences their ability to learn from data and generate responses. Generally, a higher number of parameters allows for more nuanced and complex capabilities in the model. The LLaMA series of models stands out because it offers several variants, each with a different number of parameters, catering to various use cases.

LLaMA2 has been trained on a vast dataset of 2 trillion tokens, offering an impressive context length of 4096 tokens, twice that of its predecessor, LLaMA1. This context length indicates the amount of input text the model can process at any given moment, which is vital for generating coherent and contextually appropriate responses.

LLaMA-2 was selected for this research because of its open-source availability, and the 7B variant was chosen because its size is more manageable compared to the 13B or 70B options [16]. This balance of performance and resource efficiency is critical for demanding application domains. Additionally, the chat variant has been refined using reinforcement learning from human feedback (RLHF), which enhances both its response quality and safety. These features make it particularly well suited for real-world applications where ethical considerations and reliability are paramount. Moreover, unlike some open-source models that come with more restrictive licenses, LLaMA-2 is available for both research and commercial use, subject to an acceptable use policy [14].

Fine-tuning in machine learning involves adapting a pre-trained model for specific tasks. This technique is essential in deep learning, especially for training foundation models used in generative artificial intelligence, allowing for improved performance and specificity [17].

Fine-tuning LLMs entails adapting a pre-trained model to enhance its performance on specific tasks or deepen its understanding of particular domains. This adaptation is achieved by training the model on a new dataset that is closely aligned with the target task or domain. During the fine-tuning process, the weights of the model’s neural network are adjusted, which enables it to make more precise predictions or generate more relevant responses based on the new information [15]. Below are several key concepts commonly associated with LLM fine-tuning:

1.: Supervised Fine-Tuning (SFT)
2.: Reinforcement Learning from Human Feedback (RLHF)
3.: Prompt Templates
4.: Parameter-Efficient Fine-Tuning (PEFT) utilizing methods such as LoRA or QLoRA

Low-Rank Adaptation (LoRA): Full fine-tuning of a model involves retraining all parameters of a given model. It is becoming less feasible as the latest large language models pre-trained on general domain data are consistently getting larger. Low-Rank Adaptation (LoRA) reduces the number trainable parameters by freezing the weights of the pre-trained model and injecting trainable rank decomposition matrices into the layers of the transformer architecture [18]. This reduces the memory footprint required in adapting the pre-trained model for downstream tasks in particular domains. The LoRA technique also improves the modularity as the pre-trained model can have different LoRA modules adapted to different tasks. This reduces the storage requirements and enables task switching as these smaller modules can be merged with the frozen weights when deployed. There is also no additional latency during inference when compared to a fully fine-tuned model.

QLoRA is a parameter-efficient fine-tuning technique that extends LoRA by quantizing the precision of the weight parameters of a pre-trained LLM [19]. Usually, the weight parameters of general-purpose, pre-trained models are stored in 32 bit format. QLoRA reduces the memory footprint even further than LoRA by compressing the weight parameter precision and storing them in a 4 bit format. The innovations that QLoRA introduces include a new data type, 4 bit NormalFloat (NF4). This data type is based on a quantization method known as “Quantile Quantization”.

Quantiles can be understood as cutoff points, which, based on the number of bits in a data type, divide the given data into equal parts. Since NF4 uses 4 bits to quantize the weight parameters of a pre-trained neural network, it creates 16 bins (categories) between the range of [−1, 1]. They are normalized and eventually represented by the bin between the range that the normalized value is closest to. The weights of the pre-trained neural network are assumed to follow a zero-centered normal distribution, which ensures that each bin or category will have an equal number of weights of the network. However, outliers present in the input data can greatly impact the quantization process. To reduce the impact of outliers in the quantization process, the researchers propose block-wise quantization, which divides the input tensors into smaller blocks and then quantizes these blocks independently.

Fully fine-tuning large models is not only computationally expensive but also susceptible to “catastrophic forgetting”. In contrast, QLoRA offers a highly efficient alternative that significantly reduces both memory and computational demands while preserving performance. Although methods like P-tuning [20] and prefix-tuning [21] update only a tiny fraction of the model (typically by adjusting prompt embeddings), they still rely on full-precision weights. As a result, even with fewer trainable parameters, the underlying model uses higher-precision arithmetic, which can lead to increased computing costs when scaling to very large models. In QLoRA, however, the pre-trained model’s weights are quantized to 4 bit precision using the NF4 format and kept frozen during fine-tuning, with only a small set of low-rank adapter parameters being updated. Empirical results show that QLoRA achieves performance nearly on par with full 16 bit fine-tuning but at a fraction of the computational cost [19].

4. Experiment

4.1. Dataset

In this sub-section, we introduce the dataset “Clinical Trials on Cancer” from Kaggle. The dataset is centered around the diagnostics related to an interventional clinical trial. Interventional clinical trials are designed to capture the intervention of a new treatment/drug that is being introduced. In an interventional clinical trial, researchers keep a close tab on the observations and measure outcomes to learn about the efficacy of the treatment/drug, with the standard treatment/drug as the baseline. Cancer clinical trials are more restrictive than others due to their correlation with quite a few conditions like comorbidity, side effects, age limits, etc. Interventional cancer clinical trials add to the layers of restriction since such trials also eliminate concomitant treatments.

The dataset contains 6,186,572 labeled clinical statements. All the clinical statements are derived from an interventional cancer clinical trial. The labeled statements were extracted from 49,201 interventional clinical trial protocols on cancer.

The dataset gave us labels of eligibility for a clinical trial on cancer. There were two labels for the data. We had 49,999 data points for label 0 and 49,999 data points for label 1. Label 0 signified eligible candidates and label 1 signified not eligible candidates. In the following sub-section, we describe the implementation of different text analyses to see the most common words used in the study intervention of both labels separately. We also conducted certain data visualization separately for both labels to see the differences in the used words in the study interventions.

4.2. Data Visualization

We started by performing a textual analysis of the data. Creating a list of the most frequently used words in the study intervention, which was segregated into two labels, seemed to add value to our research question. We visualized the differences in the word count by plotting a bar plot for both labels. This was achieved by first creating the frequency table for study interventions of both labels and then taking the intersection of the commonly used words in the study interventions of the labels. Thereafter, we sorted the words with the highest frequencies and plotted the bar plots for the top 30 most frequently used in both labels. Figure 2 shows us the commonly used words in both labeled data along with the differences in the word count.

To gain a deeper understanding of the differences in the word counts for different labels, we tabulated the common words used in the study interventions of both the labels, sorted by the highest differences in word counts between interventions of the two labels. Table 1 gives us the snapshot of the word count differences in tabular format.

4.3. Statistical Tests

We wanted to statistically confirm that there were significant differences between the study interventions of the two different labels of the dataset. We fitted a generalized linear model (GLM). A GLM is an extension of traditional linear regression that allows response variables to follow distributions from the exponential family, such as normal, binomial, and Poisson distributions [22]. In a GLM, the response is typically assumed to have a distribution in the exponential family. In our scenario, we were dealing with word counts, so we assumed that the response followed a Poisson distribution. This is also known as Poisson regression. The model [23] was as follows:

μ = E (y | x) = λ

(1)

λ = exp β^{'} x

(2)

y | x, β \sim P o i s s o n (λ = exp β^{'} x)

(3)

The summary of the model given in Table 2 gives us an idea whether word counts of label 0 were significantly different from word counts of label 1.

In a GLM, for categorical variables, one category is treated as the baseline. Here,

L a b e l_{0}

was treated as the reference level in the model we fitted. The intercept represented the predicted value for the baseline category

L a b e l_{0}

. The negative coefficient for

L a b e l_{1}

indicated that, on average, the predicted values for

L a b e l_{1}

were slightly lower than for

L a b e l_{0}

. The p-value (0.00101) reflected a real difference between

L a b e l_{1}

and

L a b e l_{0}

. This difference was statistically significant, implying a meaningful distinction between the two labels. The significance suggests that the label variable contributed to the model in a meaningful way [24].

4.4. Google Colab Pro

Google Colab Pro is a premium version of Google Colab [25], a cloud-based platform developed by Google Research for Python programming and machine learning. It provides enhanced resources, including powerful NVIDIA GPUs (like Tesla T4), up to 32 GB of RAM, and session durations of up to 24 h.

With priority resource allocation, users experience reduced wait times, making it suitable for demanding tasks such as training deep learning models. Colab Pro also allows background execution, enabling notebooks to run even after the browser is closed, and includes Type-C quick charging for compatible devices [25].

However, resource availability may be limited during peak times, and privacy concerns can arise with sensitive datasets. Overall, Google Colab Pro is a cost-effective solution for researchers and developers seeking scalable cloud infrastructure for AI and machine learning projects [26].

The installed version of Python is 3.10.0, which introduces several important features. One of the key additions is structural pattern matching, allowing for more expressive and readable code when handling various data structures. Additionally, Python 3.10 improves error messages, making it easier for developers to identify and resolve issues. This version also includes performance optimizations and enhancements to type hinting, such as the new TypeGuard feature. In general, these improvements make Python more powerful and user-friendly for developers working on complex applications [27].

4.5. Data Transformation

As the base model, we selected the LLaMA-2 7B chat model, which expects input data to be in a conversational format. To adapt the model for classifying eligibility criteria in cancer clinical trials, we developed a data transformation pipeline that reformatted the raw data into this expected format while incorporating task-specific context through a system message.

The original dataset consisted of two fields: Study Intervention and Label. To enhance the granularity of the data, we split the Study Intervention column at period signs, resulting in three fields: Study Intervention, Condition, and Label, where the label indicated whether the eligibility criteria fell under “Eligible” (__label__0) or “Not Eligible” (__label__1). The transformation process for fine-tuning involved the following steps:

Field Extraction: The Study Intervention, Condition, and Label fields were extracted and stripped of unnecessary whitespace. Invalid or incomplete entries were discarded to ensure data quality.

Context Incorporation: A predefined system message was appended to each example to provide task-specific context. This message described the problem domain: predicting whether short clinical statements are inclusion or exclusion criteria in interventional cancer clinical trials. Additionally, the message defined the assistant’s role and explained how eligibility criteria were split into the Study Intervention and Condition fields.

Prompt Construction: The Study Intervention and Condition fields were combined into a single prompt for the model. For training samples, the prompt included the corresponding label as the assistant’s response, while, for evaluation samples, only the prompt was provided, enabling inference. The transformed data were formatted in the conversational style expected by LLaMA-2, comprising system («SYS»), human ([INST]), and assistant messages. The dataset was split using stratified sampling to allocate 70% of the data for training and 30% for testing, ensuring balanced label representation. Training examples included the label as the assistant’s response, while testing examples omitted the response to allow the fine-tuned LLM to generate predictions.

This transformation ensured that the data were structured to provide task-specific context while adhering to the input requirements of LLaMA-2. The conversational formatting enhanced the model’s ability to comprehend the task, utilize its pre-trained knowledge, and deliver optimal performance. Algorithm 1 illustrates the data transformation and training process, which will be discussed next.

Algorithm 1 Workflow for fine-tuning LLM on clinical trial data

1:: Input: Raw clinical trial data file
2:: Output: Fine-tuned LLM and evaluation metrics
3:: procedure Preprocessing
4:: raw_data ← LOAD_DATA(“clinical_trials_raw_data.csv”)
5:: for each record in raw_data do
6:: study ← EXTRACT_FIELD(record, “Study Intervention”)
7:: condition ← EXTRACT_FIELD(record, “Condition”)
8:: label ← EXTRACT_FIELD(record, “Label”)
9:: if study is empty or condition is empty then
10:: continue
11:: system_message ← “Context: Interventional cancer clinical trials have strict inclusion and exclusion criteria…”
12:: human_prompt ← FORMAT(“Study Intervention: {study}\nCondition: {condition}”)
13:: if record is for training then
14:: conversation ← FORMAT(“<s>[INST] «SYS»{system_message}«/SYS»\n{human_prompt} [/INST] {label} </s>”)
15:: else
16:: conversation ← FORMAT(“<s>[INST] «SYS»{system_message}«/SYS»\n{human_prompt} [/INST]”)
17:: transformed_data.ADD(conversation, label)
18:: SAVE_DATA(transformed_data, “transformed_data.pkl”)
19:: procedure DataSplitting
20:: $(t r a i n i n g_s e t, t e s t i n g_s e t) \leftarrow SPLIT_DATA (transformed_data, stratify_by = “ Label ”, train_ratio = 0.70)$
21:: procedure FineTuning
22:: base_model ← LOAD_MODEL(“LLaMA-2_base”)
23:: qlora_config ← SET_QLORA_CONFIG(rank=64, alpha=16, dropout=0.1, quant_type=“nf4”, use_4bit=True)
24:: training_params ← SET_TRAINING_PARAMS(batch_size=4, learning_rate=0.0002, epochs=1, scheduler=”cosine”, …)
25:: fine_tuned_model ← FINE_TUNE_MODEL(base_model, training_set, qlora_config, training_params)
26:: SAVE_MODEL(fine_tuned_model, “fine_tuned_model_path”)
27:: procedure Evaluation
28:: for each sample in testing_set do
29:: prompt ← EXTRACT_PROMPT(sample)
30:: prediction ← GENERATE_RESPONSE(fine_tuned_model, prompt)
31:: sample.prediction ← prediction
32:: metrics ← COMPUTE_METRICS(testing_set, true_labels=”Label”, predicted_labels=”prediction”)
33:: PRINT(metrics)
34:: Main:
35:: Preprocessing
36:: Datasplitting
37:: Finetuning
38:: Evaluation

4.6. Training

The open-source Llama 2 7B model (NousResearch/Llama-2-7b-chat-hf, New York, NY, USA) was chosen as the base model. This pre-trained model from Hugging Face was fine-tuned with 700,000 data points with the view to adapting it for a classification task in the domain of interventional cancer clinical trials. The fine-tuning was achieved using QLoRA (Quantized Low-Rank Adaptation), which allows for the fine-tuning of LLMs with limited computational resources by applying low-rank updates to specific model weights. The fine-tuning process utilized the following parameters, which will be discussed under five broad categories:

QLoRA Parameters
−
The lora_r value denoted the rank of the low-rank matrices that were used in the parameter-efficient LoRA updates. While a lower-value could significantly lower the computation cost and memory footprint, choosing a higher value ensured that the model captured more fine-grained relationship in the data, which was crucial for the task and application domain in question. In our experiment, the value for lora_r was chosen as 64.
−
A scaling factor, lora_alpha, which determined how much the low-rank weights contributed to the model updates, was set to 16.
−
The lora_droupout value was set to 0.1 in order to randomly deactivate a fraction of the neurons during the training process. This technique prevented the model from being overfit to the training data.
Quantization Parameters
−
To reduce memory consumption and allow the fine-tuning process to be completed on a consumer-grade GPU, 4 bit quantization was enabled for the base model with use_4bit = True.
−
To strike a balance between memory efficiency and numerical stability, the precision used for computation during training was set using bnb_4bit_compute_dtype = “float16”.
−
To use NF4 (Normal Float 4), bnb_4bit_quant_type was set to “nf4”. By using NF4 for quantization purposes instead of FP4, precision was improved.
−
Nested quantization was disabled in order to simplify training.
Training Hyperparameters
−
Since the training dataset was large, the model was trained for one epoch, which ensured adequate training while managing the computational cost for training.
−
Each GPU processed four samples per batch during training.
−
The gradients were accumulated over two steps to simulate a larger batch size without requiring additional memory.
−
The sequence input sequence length was limited to 512 tokens, while the learning rate was set to 0.0002. The chosen input sequence length was intended to align the requirements of the task as well as the hardware constraints, and the chosen learning rate was meant to balance the need for effective updates and ensure training stability.
−
In order to prevent overfitting, model weights were regularized by penalizing large weights by setting the weight_decay value to 0.001.
−
Gradient clipping was applied to stabilize training and avoid exploding gradients.
−
The cosine learning rate schedule was used for smooth decay, improving training convergence.
−
In order to stabilize early training, the learning rate was gradually increased from 0 to the target rate over the first 3% of steps.

The values of the hyperparameters were chosen based on insights from the existing literature, our understanding of the involved trade-offs, and the practical constraints of our computational resources, rather than from dedicated pilot tests. We selected a value of 64 for lora_r because it struck a reasonable balance between computational efficiency and model expressiveness. A lower value could have risked overfitting to the training data, while a higher value would have increased computational cost and memory usage. Similarly, the batch size was partly limited by the available GPU memory since training was performed on consumer-grade hardware (e.g., using Google Colab Pro with a Tesla T4). A smaller batch size was necessary to manage memory constraints, and we mitigated potential issues by accumulating gradients over multiple steps to simulate a larger effective batch size. Overall, these hyperparameter choices—especially the combination of a moderate rank and a small batch size—kept the fine-tuning memory usage within feasible limits while achieving robust classification metrics.

5. Results and Discussion

Figure 3 illustrates the training loss with respect to the number of training steps during the fine-tuning process. The plot indicates that the training loss declined steeply in the initial phase. This was an indication of rapid convergence as the model adjusted its weights. Beyond approximately 25,000 steps, the loss stabilized and converged, which suggested that the model reached a steady state with minimal fluctuations. The fluctuation at around 62,000th step was an indication of the change in batch size during training. Following this brief fluctuation, the training loss resumed its convergence, demonstrating the model’s ability to adapt and maintain stability.

Figure 4 is a visualization of the learning rate schedule during model fine-tuning, plotted against training steps. To ensure early training stability, the learning rate was made to gradually rise from 0 to a peak value of 0.0002 at approximately 10,000 steps. Following the peak, it decreased smoothly, following a cosine decay schedule, ensuring gradual fine-tuning of the model’s parameters. This approach balanced rapid convergence in early training with stability and precision during later stages.

Table 3 shows the class-wise metrics, while Table 4 shows the overall metrics of our experiments. Table 5 shows that our fine-tuned LLaMA-2 (7B) model with QLoRA achieved an impressive accuracy of 93% and an F1-score of 0.93, showcasing well-balanced precision and recall across various classes. While the KNN model with FGA [5] exhibited slightly superior performance, attaining 94.5% accuracy and a 0.945 F1-score, it tended to be more computationally intensive and less scalable for larger datasets. In comparison, deep neural networks (DNNs) [4] recorded an accuracy of 91% and an F1-score of 0.91. Although they demonstrated strong performance, they fell short of the overall effectiveness seen with both LLaMA-2 and KNN. Importantly, LLaMA-2 excelled in its ability to understand context, capturing nuanced relationships in free-text eligibility criteria, a challenge that traditional numerical similarity-based models like KNN and feature extraction-based DNNs struggled to address.

In terms of computational efficiency, our fine-tuning techniques effectively balanced high accuracy with optimized resource utilization, thereby outperforming the heavy computational demands of KNN and the extensive hyperparameter tuning necessary for DNNs. While KNN is well suited for smaller, structured datasets and DNNs can achieve reliable results with considerable tuning, LLaMA-2 emerges as the most versatile and scalable solution. Its exceptional performance in both accuracy and contextual reasoning underscores the transformative potential of fine-tuned LLMs in healthcare applications. This sets a new standard for clinical text classification tasks and paves the way for more robust, AI-driven healthcare solutions.

We acknowledge that our study relied on a single, publicly available Kaggle dataset, which may inherently contain biases related to demographic or clinical factors. Although the dataset is de-identified and meets standard privacy requirements, these limitations could affect the generalizability of our findings. Future work will involve validation across more diverse datasets and the incorporation of advanced bias mitigation techniques to further enhance the robustness and fairness of our model. We have also reached out to the authors of [4,5] to obtain more information in connection with the details of misclassification results based on each label (label 0 and label 1) since this was not reported in either paper. This information will help us study comparisons with other models that have been used already and the LLaMA2 model studied in detail in this research using some statistical tests such as the chi-square test (two proportions) [28], McNemar’s test, etc.

6. Conclusions

In this study, we explored the effectiveness of fine-tuning a large language model (LLM) for the classification of clinical trial eligibility criteria. LLMs have demonstrated success across a wide range of applications, including those in the medical domain. However, given the high sensitivity of this domain and task, we opted against using prompt-engineering techniques with a general-purpose LLM. While few-shot learning can elicit responses that approximate the desired output format, relying on a black-box architecture for such a critical application was deemed inadvisable.

At the same time, the available dataset, though reasonably large, was insufficient for training an LLM from scratch. Moreover, building a domain-specific LLM from the ground up demands substantial computational resources and infrastructure, which were beyond our means. Consequently, we chose to fine-tune an open-source LLM: the Llama 2-7B chat model. To facilitate the fine-tuning process, 70% of the dataset was transformed into a conversational format compatible with the model’s architecture. We employed the parameter-efficient fine-tuning (PEFT) technique, QLoRA, which significantly reduced computational demands while enabling the development of lightweight modular adapters with minimal storage requirements. During inference, these adapters could be seamlessly integrated with the base LLM.

The results of our fine-tuning are highly promising. The reported metrics suggest that this approach is both reliable and effective, making it a viable alternative to general-purpose LLMs for specialized tasks in the medical domain. Furthermore, the compact size of the adapters underscores the practicality of fine-tuning general-purpose LLMs for various applications. This modular approach allows for the efficient deployment of tailored models across different tasks, leveraging the strengths of open-source LLMs while ensuring domain-specific reliability.

Author Contributions

Conceptualization, S.R., A.N.S. and K.B.; methodology, S.R. and A.N.S.; experiment, S.R., K.B. and S.D.; investigation, A.N.S. and N.C.; resources, S.R. and K.B.; data collection, S.R. and A.N.S.; dataset analysis: N.C. and S.D.; writing—original draft preparation, S.R., A.N.S., N.C., K.B. and S.D.; writing—review and editing, S.R. and K.B.; supervision, S.R. and A.N.S.; project administration, S.R. and N.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is freely available for download at https://www.kaggle.com/datasets/auriml/eligibilityforcancerclinicaltrials/data (accessed on 12 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Nazi, Z.A.; Peng, W. Large language models in healthcare and medical domain: A review. Informatics 2024, 11, 57. [Google Scholar] [CrossRef]
Han, Z.; Gao, C.; Liu, J.; Zhang, J.; Zhang, S.Q. Parameter-efficient fine-tuning for large models: A comprehensive survey. arXiv 2024, arXiv:2403.14608. [Google Scholar]
Luo, Y.; Yang, Z.; Meng, F.; Li, Y.; Zhou, J.; Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv 2023, arXiv:2308.08747. [Google Scholar]
Bustos, A.; Pertusa, A. Learning Eligibility in Cancer Clinical Trials Using Deep Neural Networks. Appl. Sci. 2018, 8, 1206. [Google Scholar] [CrossRef]
Jasmir, J.; Nurmaini, S.; Tutuko, B. Fine-Grained Algorithm for Improving KNN Computational Performance on Clinical Trials Text Classification. Big Data Cogn. Comput. 2021, 5, 60. [Google Scholar] [CrossRef]
Menger, V.; Scheepers, F.; van Bolhuis, W.; Schoevers, R.; van der Lei, J.; Wang, Y. Comparing Deep Learning and Classical Machine Learning Approaches for Predicting Inpatient Violence Incidents from Clinical Text. Appl. Sci. 2018, 8, 981. [Google Scholar] [CrossRef]
Sutanto, T.; Nayak, R. Fine-grained Document Clustering via Ranking and its Application to Social Media Analytics. Soc. Netw. Anal. Min. 2018, 8, 29. [Google Scholar] [CrossRef]
Abburi, H.; Suesserman, M.; Pudota, N.; Veeramani, B.; Bowen, E.; Bhattacharya, S. Generative AI Text Classification Using Ensemble LLM Approaches. arXiv 2023, arXiv:2309.07755. [Google Scholar]
Wei, F.; Keeling, R.; Huber-Fliflet, N.; Zhang, J.; Dabrowski, A.; Yang, J.; Qin, H. Empirical Study of LLM Fine-Tuning for Text Classification in Legal Document Review. In Proceedings of the 2023 IEEE International Conference on Big Data (BigData), Sorrento, Italy, 15–18 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2786–2792. [Google Scholar] [CrossRef]
Wu, Y.; Wang, Y.; Wang, C.; Zheng, Z. Large Language Model Enhanced Machine Learning Estimators for Classification. arXiv 2024, arXiv:2405.05445. [Google Scholar]
Cloudflare. What Is a Large Language Model? Available online: https://www.cloudflare.com/learning/ai/what-is-large-language-model/ (accessed on 15 July 2024).
CivilsDaily. Generative AI Systems. Available online: https://www.civilsdaily.com/news/crafting-safe-generative-ai-systems/ (accessed on 15 July 2024).
Feng, Y.; Tian, Z.; Zhu, Y.; Han, Z.; Luo, H.; Zhang, G.; Song, M. CP-Prompt: Composition-Based Cross-modal Prompting for Domain-Incremental Continual Learning. arXiv 2024, arXiv:2407.21043. [Google Scholar]
Meta. LLaMA 2: Open Foundation and Fine-Tuned Chat Models. Available online: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/ (accessed on 15 July 2024).
Run.ai. LLaMA 2 Fine-Tuning Guide. Available online: https://www.run.ai/guides/generative-ai/llama-2-fine-tuning (accessed on 18 September 2024).
Face, H. LLaMA 2-7B Chat Model. Available online: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf (accessed on 13 March 2025).
IBM. Fine-Tuning. Available online: https://www.ibm.com/topics/fine-tuning (accessed on 21 September 2024).
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2024, 36, 14314. [Google Scholar]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.L.; Du, Z.; Yang, Z.; Tang, J. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv 2021, arXiv:2110.07602. [Google Scholar]
Li, X.L.; Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv 2021, arXiv:2101.00190. [Google Scholar]
McCullagh, P.; Nelder, J.A. Generalized Linear Models. In Monographs on Statistics and Applied Probability, 2nd ed.; Chapman and Hall: London, UK, 1989; Volume 37. [Google Scholar] [CrossRef]
Chen, G. Generalized Linear Models (GLMs). Available online: https://www.sjsu.edu/faculty/guangliang.chen/Math261a/Ch13slides-generalized-linear-models.pdf (accessed on 23 July 2024).
Kianifard, F.; Gallo, P. Poisson regression analysis in clinical research. J. Biopharm. Stat. 1995, 5, 115–129. [Google Scholar] [CrossRef] [PubMed]
Research, G. Google Colab. Available online: https://colab.research.google.com/ (accessed on 12 October 2024).
Cloud, G. Google Colab Pro. Available online: https://colab.research.google.com/signup (accessed on 12 October 2024).
Foundation, P.S. Python 3.10.0. Available online: https://www.python.org/downloads/release/python-3100/ (accessed on 19 October 2024).
BMJ Publishing Group. Chi-Squared Tests. Available online: https://www.bmj.com/about-bmj/resources-readers/publications/statistics-square-one/8-chi-squared-tests (accessed on 24 February 2025).

Figure 1. Large language model [12].

Figure 2. Comparison of word counts for labels 0 and 1 using bar plots.

Figure 3. Training and evaluation loss.

Figure 4. Learning rate schedule.

Table 1. Differences in word counts between label 0 and label 1.

Words	Differences in Counts
two	157
doc	127
no	121
lymphoma	108
prior	108
carcinoma	91
three	70
iv	60
bilirubin	49
cancer	39
melanoma	39
recurrent	35
doxorubicin	35
greater	33
paclitaxel	24
twenty	24
renal	23
neoplasm	23
creatinine	21
diffuse	21
criteria	20
mm	18
allowed	18
phosphate	17
concurrent	17
ten	16
chemotherapy	15
primary	15
iii	14
weeks	14
squamous	14
urine	14
trastuzumab	14
up	14
months	13
thousand	13
drug	13
systemic	13
transplantation	13
negative	12
sarcoma	12
hydrochloride	12
female	12
group	11
malignant	11
related	11
clinical	11
immunoblastic	11
syndrome	11

Table 2. Summary of the GLM model.

Variable	Estimate	Std. Error
Intercept	7.806082	0.002883
$L a b e l_{1}$	−0.013446	0.004091
Variable	z Value	Pr (>\|z\|)
Intercept	2707.726	<2 × 10⁻¹⁶ ***
$L a b e l_{1}$	−3.287	0.00101 **

The asterisks in Table 2 also indicate that the variables are significant. *** suggests p-value less than 0.001 and ** suggests p-value less than 0.01.

Table 3. Class-wise metrics.

Class	Precision	Recall	F1-Score	Support
__label__0	0.92	0.93	0.93	150,000
__label__1	0.93	0.92	0.93	150,000

Table 4. Overall metrics.

Metric	Value
Accuracy	0.92946
Precision	0.9295117192432504
Recall	0.92946
F1-Score	0.9294578764343995

Table 5. Comparison of model performance.

Model	Accuracy (%)	F1-Score
LLaMA-2	93.0	0.93
KNN with FGA [5]	94.5	0.945
DNN (deep neural networks) [4]	91.0	0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ray, S.; Sarker, A.N.; Chatterjee, N.; Bhowmik, K.; Dey, S. Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification. Digital 2025, 5, 12. https://doi.org/10.3390/digital5020012

AMA Style

Ray S, Sarker AN, Chatterjee N, Bhowmik K, Dey S. Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification. Digital. 2025; 5(2):12. https://doi.org/10.3390/digital5020012

Chicago/Turabian Style

Ray, Sujan, Arpita Nath Sarker, Neelakshi Chatterjee, Kowshik Bhowmik, and Sayantan Dey. 2025. "Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification" Digital 5, no. 2: 12. https://doi.org/10.3390/digital5020012

APA Style

Ray, S., Sarker, A. N., Chatterjee, N., Bhowmik, K., & Dey, S. (2025). Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification. Digital, 5(2), 12. https://doi.org/10.3390/digital5020012

Article Menu

Leveraging Large Language Models for Clinical Trial Eligibility Criteria Classification

Abstract

1. Introduction

2. Background and Related Work

3. Methodology

4. Experiment

4.1. Dataset

4.2. Data Visualization

4.3. Statistical Tests

4.4. Google Colab Pro

4.5. Data Transformation

4.6. Training

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI