Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models

Shah-Mohammadi, Fatemeh; Finkelstein, Joseph

doi:10.3390/biomedinformatics4040116

Open AccessArticle

Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models

by

Fatemeh Shah-Mohammadi

^*

and

Joseph Finkelstein

Department of Biomedical Informatics, School of Medicine, University of Utah, Salt Lake City, UT 84108, USA

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2024, 4(4), 2173-2185; https://doi.org/10.3390/biomedinformatics4040116

Submission received: 18 September 2024 / Revised: 16 October 2024 / Accepted: 21 October 2024 / Published: 28 October 2024

Download

Browse Figure

Versions Notes

Abstract

:

Background/Objectives: Clinical trials frequently employ diverse terminologies and definitions to describe similar outcomes, leading to ambiguity and inconsistency in data interpretation. Addressing the variability in clinical outcome reports and integrating semantically similar outcomes is important in healthcare and clinical research. Variability in outcome reporting not only hinders the comparability of clinical trial results but also poses significant challenges in evidence synthesis, meta-analysis, and evidence-based decision-making. Methods: This study investigates variability reduction in outcome measures reporting using rule-based and large language-based models. It aims to mitigate the challenges associated with variability in outcome reporting by comparing these two models. The first approach, which is rule-based, will leverage well-known ontologies, and the second approach exploits sentence-bidirectional encoder representations from transformers (SBERT) to identify semantically similar outcomes along with Generative Pre-training Transformer (GPT) to refine the results. Results: The results show that the relatively low percentages of outcomes are linked to established rule-based ontologies. Analysis of outcomes by word count highlighted the absence of ontological linkage for three-word outcomes, which indicates potential gaps in semantic representation. Conclusions: Employing large language models (LLMs), this study demonstrates its ability to identify similar outcomes, even with more than three words, suggesting a crucial role in outcome harmonization efforts, potentially reducing redundancy and enhancing data interoperability.

Keywords:

semantic variability; clinical outcome; ontology; large language model

1. Introduction

ClinicalTrials.gov (CTG) is a comprehensive and publicly accessible online database and registry of clinical trials conducted worldwide. It is a service provided by the United States National Library of Medicine (NLM) and the National Institutes of Health (NIH). CTG serves as a valuable resource for researchers, healthcare professionals, patients, and the public interested in clinical research. Upon completion of all registered clinical trials, primary outcomes are required to be submitted to the CTG as a part of the mandatory information required by the current regulations. These outcomes are provided independently by clinical trial teams that lead to boosting variability and inconsistency in outcome reporting.

On the other hand, the dissemination of clinical trial outcomes is vital for advancing evidence-based medicine, with results traditionally published in peer reviewed scientific journals to validate their reliability and relevance. This publication process enriches the scientific discourse, aids healthcare providers, guides policymakers, and educates the public on medical advancements. Importantly, published results enable meta-analysts and researchers to access and integrate diverse study outcomes, enhancing the breadth and depth of systematic reviews and meta-analyses. This comprehensive approach enhances the examination of intervention effectiveness by aggregating and analyzing patterns and overall treatment effects across multiple studies. By incorporating data from published trials, meta-analyses can integrate varied research findings, thereby increasing the statistical robustness and reliability of conclusions drawn from the aggregated data. However, variation in outcome definition can make the comparison of outcomes between studies difficult and combination in meta-analysis impossible. Addressing the variability is crucial for data harmonization, leading to greatly facilitating transparency, reproducibility, cross-trial comparison, and systematic reviews. Linking clinical trial outcomes to well-known ontologies can certainly help address variability in outcome reporting. Ontologies such as Medical Subject Heading (MeSH) terms provide a standardized and structured vocabulary for categorizing medical concepts and outcomes, which can significantly reduce the ambiguity and inconsistency often present in outcome descriptions across different studies. Ontologies establish a common language for describing outcomes, ensuring that researchers, healthcare professionals, and reviewers can readily understand and compare outcomes across studies. This promotes clarity and consistency in reporting.

The objective of this study is to investigate the effectiveness of two distinct approaches, including rule-based and machine learning-based approaches, in identifying and aligning outcome measures that share semantic similarity. This research aims to mitigate the challenges associated with variability in outcome reporting by comparing these two methods. The first approach, which is rule-based, will leverage well-known ontologies, and the second exploits large language models (LLMs). Advancements in artificial intelligence (AI), especially in the fields of natural language processing (NLP) and LLMs, have marked a new era in human–computer interaction. These developments have greatly enhanced the capacity to analyze and interpret extensive textual data, allowing machines to understand and produce language that closely mimics human communication in both fluency and precision. Exhibiting remarkable capability in various domains such as machine translation, content creation, and virtual assistance, these models have significantly improved interactions between humans and machines. Their application has brought significant changes across multiple sectors, including healthcare, education, and customer service, by streamlining processes, delivering critical insights, and driving innovation. LLMs are deep learning transformer-based sequence models with billions of parameters [1]. Transformer architecture has become the foundation for many state-of-the-art LLMs like generative pretrained transformer (GPT) and bidirectional encoder representations from transformers (BERT). In this study, we use the GPT-4 model for generative question answering and BERT to produce semantically meaningful sentence embeddings. The details of how this model is utilized will be explained in the following section.

The goal is to address the variability in outcome measures that leads to facilitate cross-trial comparison and metadata analysis by introducing a fully automated NLP-assisted pipeline to identify semantically identical outcomes in the original sources. Our proposed pipeline first extracts outcomes of clinical trials and then finds similar outcomes. Our proposed pipeline also standardizes outcomes by linking them to well-known ontologies, including the MeSH (Medical Subject Heading) term, which is the National Library of Medicine (NLM) controlled vocabulary thesaurus used for indexing articles for PubMed [2], Clinical Data Interchange Standards Consortium (CDISC), and the National Cancer Institute (NCI) Ontologies. We specifically incorporated MeSH vocabulary because NLM provides biomedical keywords for all of the literature in PubMed as MeSH terms. This can facilitate the scalability of our currently developed pipeline to integrate not only studies on CTG but also any published articles indexed in PubMed.

We summarize our contributions as follows: Utilizing cutting-edge methods to harmonize outcome measures and introducing a fully automated and time-efficient system to identify trials that report semantically similar outcomes. This facilitates cross-compatibility and enhances metadata.

2. Materials and Methods

The main data source for analysis in this paper was formed using CTG via the RESTful application programming interface (API) to execute a search query with specific search terms. This API streamlines the process of submitting search queries from computer programs, enhancing the ability to locate relevant clinical trials. It offers several search functionalities, including the option to search by ‘condition or disease’, allowing users to pinpoint trials specific to certain medical conditions or diseases. Additionally, the ‘intervention/treatment’ feature lets users filter trials based on particular medical interventions or treatments, such as drugs, therapies, or medical devices. The resultant output can be selected to be formatted as CSV or XML for further processing. In the following, every component of our developed pipeline will be described.

2.1. Step 1: Connecting to the CTG Search API Endpoint

As the test use case, the condition and intervention search terms selected were “COVID-19” and “Hydroxychloroquine”, respectively. Hydroxychloroquine gained widespread attention in early 2020 as a potential treatment for COVID-19. This drug, originally developed to treat malaria and certain autoimmune conditions, showed promise in laboratory studies for its antiviral properties. However, its effectiveness in treating COVID-19 became a subject of intense debate and controversy [3]. Various clinical trials were conducted to evaluate its safety and efficacy [4], but the results are mixed and often inconclusive. COVID-19 trials were also chosen as a use case for our analysis because all of them were conducted relatively recently and reported a large number of identical clinical outcomes which were worded differently in CTG. The format of the output was selected to be in CSV. The outputted file contained 261 unique trials at the time of our inquiry, along with general information about the trials such as national clinical trial (NCT) number, list of interventions, a link to the trial on CTG for more detailed information, and the locations and current statuses of each clinical trial. Figure 1 presents data on the current statuses of clinical trials and identifies the countries where the highest proportions of these trials are conducted. According to this figure, just over 35% of trials were completed. The highest percentage of conducted trials belongs to the United States (around 20% of trials). The contribution of other countries is less than 7%.

The NCT numbers were utilized once more to interface with the CTG, allowing for the downloading of each trial’s complete records in XML format. We employed Python’s native Xml.etree module to parse the XML data, from which we extracted the primary outcomes. These extracted primary outcomes, along with their corresponding NCT numbers, were cataloged in a dataframe. Table 1 presents a snapshot of the pipeline’s output at this stage. Reviewing the outputs at this stage of the pipeline revealed that the most frequently reported outcomes appear in less than 1% of the trials. Remarkably, thousands of outcomes are unique to just one trial each. This highlights the extensive variability in how outcomes are reported across studies.

2.2. Step 2: Text Normalization and Outcome Alignment

To normalize the texts associated with primary outcomes, we applied simple NLP techniques such as changing the string to all lowercase, striping the spaces from both ends, and replacing all occurrences of two or more space marks with one space mark and deduplication. The number of unique primary outcomes after normalization and deduplication was 475. After manual inspection of the extracted outcomes, we noticed that some of them are semantically similar while written differently. For instance, “positive pcr” and “positive for SARS-CoV-2” are semantically the same, although they read differently. The lack of consistency in outcome measure reporting introduces variability into clinical trial data, making it challenging to compare results across different studies. Implementing the automatic alignment of semantically similar outcomes can significantly reduce this variability, thus enhancing the harmonization, cross-trial comparison, and meta-data analysis. To achieve the alignment of semantically similar outcomes, our methodology incorporates both a rule-based approach and a machine learning-based approach.

2.2.1. Rule-Based Approach: Ontology Linkage

This approach takes the outcomes and links them to well-known ontologies. The number of ontologies in healthcare domain is continually growing, and there are various ontologies tailored to specific aspects of healthcare and medical knowledge. These ontologies serve various purposes, from organizing medical concepts and terminology to supporting data integration and knowledge representation. Among all well-known healthcare ontologies, we selected MeSH and Clinical Data Interchange Standards Consortium (CDISC). MeSH terms and CDISC both play important roles in the organization and standardization of data and information in the field of healthcare and clinical research. MeSH offer a standardized vocabulary that enhances the indexing and retrieval of the biomedical and healthcare literature. Concurrently, CDISC provides frameworks for clinical trial data collection, management, and reporting. Together, these systems facilitate interoperability and streamline the exchange of data across various platforms. We also engaged with the National Cancer Institute (NCI) Thesaurus, which, although not classified as a traditional ontology, functions as a comprehensive controlled vocabulary and terminology system. It parallels the purpose of an ontology by supplying standardized terminology and concepts specifically tailored for oncology, cancer research, and associated biomedical fields. The NLM provides several APIs that allow for MeSH terms to be accessed programmatically, among which we used MeSH browser API. Regarding CDISC, we considered CDISC library API [5] to obtain search results across the CDISC library. We extracted the concept id for the highest scored search result. NCI also provides REST API search endpoints to search through the NCI thesaurus and finds NCI ids for medical concepts [6]. As mentioned, the goal behind linking outcome measures to the ontologies is that the outcomes with similar ids can be mapped to each other, which then leads to reduced variability in outcome reporting. It should be noted that among all the different types of matching criteria that can be employed to refine the search results, the “exact match” search strategy was used.

2.2.2. Machine Learning-Based Approach

According to Table 2 and Table 3, the significant majority of outcomes consist of more than three words. However, none of these are captured by MeSH and NCI, and only 1% by CDISC. This discrepancy suggests that while longer outcome phrases are common, they are significantly not standardized according to these terminologies. As a result, the machine learning-based approach was investigated to assess the possibility of the aggregation of many more outcomes (especially the ones with more than three words). This approach incorporated the use of LLMs, not only for finding similar outcomes but also for aligning them.

LLMs are first pretrained on a large corpus with trillions of words in order to predict the next token, which results in base models such as GPT, PaLM, and LLaMA [7,8,9]. These base models are then fine-tuned to better understand human instructions for various purposes [10,11], resulting in the development of assistant models like ChatGPT [12]. LLMs are capable of learning to solve new tasks at inference time using in-context learning (ICL) [7]. It was shown that ICL makes LLMs excel on many NLP tasks [13,14], as well as biomedical tasks such as literature retrieval [15], question answering [2,16,17], and clinical trial design [18]. The utilization of LLMs also offers a promising solution to address the pervasive issue of variability in reporting outcomes in clinical trials. LLMs excel in their ability to process vast amounts of the medical literature, trial protocols, and patient records, enabling them to identify and align outcome measures that are semantically similar. To achieve this alignment of semantically similar outcomes, LLMs leverage powerful techniques such as embeddings, which play a critical role in processing and comparing vast amounts of textual data. Embeddings are a cornerstone technology in NLP that transform words, phrases, or even entire documents into numerical vectors within a continuous vector space. This transformation facilitates the comparison and analysis of textual data, enabling algorithms to detect and utilize linguistic patterns effectively. Embeddings are crucial for a range of NLP applications, including text classification, sentiment analysis, named entity recognition, and topic modeling, as cited in the literature [19,20,21]. These embedding techniques, which encode semantic and syntactic meanings of text, vary from conventional fixed representations to more sophisticated contextual models. The earlier embedding methods such as static or distributional embeddings provide a single, context-independent representation for each word. These models, while useful, do not account for the varying meanings words can have in different contexts. In contrast, the advent of contextual embeddings has revolutionized this approach. Contextual models like ELMo [22], GPT, and BERT [23] dynamically adjust word vectors based on the surrounding text, allowing for a richer and more precise understanding of word usage in varied contexts. These advancements have gained significant attention recently, pushing the boundaries of what’s possible in NLP by enabling more nuanced interpretations and applications of language data.

The ELMo model introduces a sophisticated approach to word representation through its use of a deep, bidirectional LSTM architecture, pretrained on expansive text corpora to generate context-sensitive embeddings. This design allows ELMo to capture complex linguistic nuances by considering the entire sentence context, which enhances its ability to understand varying word meanings based on surrounding text. On the other hand, BERT and GPT leverage the powerful transformer architecture, specifically its encoder component, to process text. The transformer’s ability to handle sequences of words in parallel allows these models to efficiently manage large-scale language understanding tasks. RoBERTa, a variant of BERT, has further advanced these capabilities, achieving unprecedented benchmarks in semantic textual similarity and setting new standards for NLP models. However, BERT’s architecture, while robust for many tasks, is not optimized for semantic similarity searches or unsupervised tasks like clustering, which requires a different approach to handle semantic nuances effectively [24,25]. In response to these limitations, this study employs Sentence-BERT (SBERT), an adaptation of BERT that is specifically tailored to produce semantically meaningful sentence embeddings. SBERT modifies the standard BERT model to generate embeddings that are directly comparable via cosine similarity, significantly reducing time complexity while preserving BERT’s high accuracy [25]. This method is particularly useful in our study for aligning similar outcomes across datasets, aiming to reduce the variability of reported outcome measures and enhance the comparability of clinical trial data.

At this phase of the study, we incorporated Sentence-BERT (SBERT) into our analytical pipeline to automate the extraction of contextualized sentence embeddings for each outcome, followed by the identification of the “n” most semantically similar outcomes. This variable “n”, representing the number of similar outcomes to retrieve, is user-defined, allowing for flexibility in the breadth of outcome comparison. We employed cosine similarity as the metric to assess the closeness between these embeddings. For illustration, Table 4 displays the five outcomes most semantically similar to “mortality” and “time to clinical improvement”. Notably, terms like “time of clinical improvement” appear alongside “time to clinical improvement”, highlighting their semantic proximity despite slight variations in phrasing. This underscores the effectiveness of SBERT in recognizing outcomes that, although articulated differently, share equivalent meanings. Grouping such outcomes together facilitates their consolidation into unified categories for enhanced analytical clarity and consistency in outcome reporting. At this point in the pipeline, the results are organized by frequency, listing outcomes from the most common to the rarest. Additionally, for each specific outcome, the pipeline identifies a set of “n” semantically similar outcomes, alongside the NCT numbers of trials that have declared these outcomes as their primary focus.

The results in Table 4 underline SBERT’s effectiveness across various outcome descriptions. The SBERT model successfully matched outcomes such as “Mortality” and “Overall mortality,” with specific outcomes like “Mortality rate” even linked directly to a relevant MeSH term, demonstrating precise semantic understanding. This indicates SBERT’s utility in bridging similar concepts that differ in phrasing, which enriches the metadata for clinical research. The absence of linked MeSH terms for some outcomes like “Reduce mortality” or “Time of clinical improvements (days)” confirms the limitations of the rule-based approach of directly linking every outcome to existing ontological frameworks. However, the model’s capacity to identify similar outcomes across varying lengths—from one-word outcomes like “mortality” linked to “mortality rates” to longer phrases like “time of clinical improvement” associated with “number of days to clinical improvement”—shows its potential to reduce semantic variability within clinical datasets. In our previous work [26], SBERT was also applied for aligning outcomes. However, in this current work, the scope was changed by selecting different search terms. Additionally, a key innovation in this study is the incorporation of CDISC and NCI as new rule-based approaches. These are well-established ontologies in clinical research, and their inclusion significantly enhances the robustness and comprehensiveness of our outcome harmonization research process.

To assess SBERT’s efficacy in identifying semantically similar clinical outcomes, we conducted an evaluation involving a domain expert who reviewed both the 100 most frequent and the 100 least frequent outcomes, along with their associated lists of similar outcomes. The accuracy of the outcome clustering was determined based on whether outcomes with similar semantic meanings were appropriately grouped together that resulted in an accuracy of 40%. For example, SBERT incorrectly clustered “mortality” and “mortality outcome” as similar. While “mortality” refers to the state of being subject to death, “mortality outcome” refers to a measured result or conclusion related to death in a clinical context. This highlights a key limitation of SBERT in distinguishing nuanced meanings. Another clear example of SBERT’s error was the grouping of “mortality” and “reduce mortality” as similar. “Mortality” refers to the concept of death, while “reduce mortality” refers to efforts or interventions aimed at decreasing the number of deaths. This distinction is crucial, as one refers to an event and the other to an action aimed at altering that event’s frequency. SBERT failed to differentiate between the concept and the action.

To enhances the accuracy of clustering, we employed the GPT, recognized for its optimized performance in chat-based tasks. In our study, we leveraged the GPT-4 model via the chat completions API [27] to automatically assess the SBERT results. The GPT-4’s advanced natural language processing capabilities allow it to better understand contextual nuances and subtle semantic differences between terms, which can enhance the accuracy of outcome clustering. By assessing SBERT’s output, GPT-4 ensures that outcomes with similar meanings are grouped more accurately by refining SBERT’s initial clustering. Additionally, GPT-4 excels in handling diverse language patterns, which is crucial in clinical terminology where multiple terms may describe similar outcomes. The GPT’s effectiveness was previously highlighted in the literature, where it was used to evaluate medical responses, achieving a median score of 5.5 out of 6 across 248 medical questions, translating to an impressive accuracy rate of approximately 92% [28]. Moreover, the incorporation of GPT automates the assessment process of the SBERT output, which then offers a significant advantage over manual checks by reducing time and effort and ensuring the scalable evaluation of semantic similarity across large datasets.

Through meticulous prompt engineering, we refined the inputs to ensure precise, relevant, and effective responses from GPT to ultimately enhance the robustness of our outcome verification process. The finalized prompt was selected to be as follows: “Are <first outcome> and <second outcome> semantically the same?”. One of the experimented prompts was as follows: “Are <first outcome> and <second outcome> related?”. This prompt was broader and often led to less precise responses, as it inquired about a general relationship rather than meaning. This could lead to a response that addresses other aspects beyond meaning, which is less useful when the goal is semantic comparison. We needed to check semantic similarity between terms, which makes the finalized prompt the more appropriate choice. Moreover, GPT’s response to this prompt starts with a simple “yes” or “no” response, followed by a rationale. This format allows us to apply simple NLP techniques to extract the words “yes” and “no” to be then abled to effectively filter and confirm the outcomes listed by SBERT as semantically analogous.

Table 5 provides examples of three responses generated by GPT in reaction to our finalized prompt. This table visually illustrates the level of accuracy and completeness in the responses produced by GPT to these specific queries. Reviewing the results in this table shows that the incorporation of GPT-4 for the verification of SBERT’s output enhances the precision in grouping semantically similar clinical outcomes. By automating this verification process, GPT-4 adds a layer of validation that is crucial for applications in clinical research and practice. This dual-model approach leverages GPT-4’s advanced language comprehension abilities to confirm the accuracy of SBERT’s semantic analysis and ensure that the outcomes grouped together truly share similar meanings.

3. Results

Utilizing “COVID-19” as a condition and “Hydroxychloroquine” as an intervention in our search parameters resulted in the retrieval of 261 unique clinical trials. These trials collectively reported a total of 475 distinct primary outcomes (after normalization and deduplication), demonstrating the range of research interests and investigative focuses within the context of this specific treatment and disease interaction.

Table 4 encapsulates the results of applying SBERT to discover semantically similar outcomes for two distinct clinical outcome measures: “Mortality” and “Time of clinical improvement”. For the outcome “Mortality”, SBERT identified several related outcomes, including “Mortality outcome”, “Overall mortality”, “Mortality rate”, “Reduce mortality”, and “Mortality rates”. Despite the semantic similarity identified by SBERT among these outcomes, the linkage to MeSH terms demonstrates variability in standardization; only “Mortality rate” and “Mortality rates” were successfully mapped to a MeSH term (D009026), indicating an inconsistency in ontology matching for the remaining terms. In the case of “Time of clinical improvement”, the SBERT model generated outcomes such as “Time of clinical improvements (days)”, “Time to clinical improvement pivotal stage”, “Days to clinical improvement”, and “Number of days to clinical improvement”. None of these outcomes, however, were linked to any MeSH terms, highlighting a significant gap in the existing ontology for adequately capturing and standardizing terms related to clinical improvement timelines. This underscores the challenge of aligning more complex outcomes with standardized ontological entries, reinforcing the need for advanced semantic analysis tools like SBERT to enhance outcome alignment and facilitate more robust meta-analyses. The results in this table confirm the data presented in Table 3 on underscoring the limitations of current ontology systems like MeSH, NCI, and CDISC in handling complex and lengthy clinical trial outcomes. This observation is particularly evident in the abrupt drop-off in coverage as the length of the outcome descriptions increases. For outcomes consisting of only one word, there is substantial coverage with MeSH, NCI, and CDISC terms, indicating high alignment rates of 75%, 88%, and 88%, respectively. This high percentage suggests that standardized terms are effective at capturing simple, single-word outcomes. However, as the complexity of the outcome descriptions increases to two and three words, there is a notable decrease in the alignment with these ontologies. For two-word outcomes, the percentage of alignment with MeSH drastically falls to 13%, while NCI and CDISC show reduced but still moderate alignment. This trend is more pronounced for three-word outcomes, where the alignment with MeSH drops further to a mere 5%, and NCI and CDISC show minimal to no alignment for most entries. The most striking observation is for outcomes containing more than three words, where the alignment with MeSH, NCI, and CDISC plummets to 0%, 0%, and 1%, respectively, indicating almost no coverage. These data explicitly illustrate the challenges that current ontologies face in effectively cataloging more complex and verbose outcomes, highlighting a significant gap that necessitates the development of more advanced tools like SBERT for semantic alignment and integration in clinical metadata analysis.

Since the majority of outcomes had not been assigned with any ontology considered in this study, the approach of the integration of SBERT was followed to align many more similar outcomes, even for outcomes with more than three words. According to Table 4, the outcomes, spanning from single words to phrases exceeding three words, were successfully matched with semantically similar outcomes, underscoring the model’s robust capability to discern and link related clinical terms. After incorporating GPT, we added an automated layer of verification of the SBERT results. Table 5 shows, for outcomes composed of different numbers of words, examples of the GPT’s output (response) at this stage. According to this table, an outcome similar to “hospitalization”, as a one-word outcome, is “hospitalization due to SARS-CoV-2 infection”; the three-word outcome “duration of hospitalization” is semantically similar to “duration of hospitalization (in days)”; and a similar outcome to the two-word outcome “clinical improvement” is outcome “improvement of clinical status”. Table 6 details GPT’s responses, verifying the semantic accuracy of SBERT’s suggestions (previously listed in Table 4) by determining if pairs of terms are semantically equivalent. This table illustrates how GPT is applied on the SBERT output to ensure accuracy. It can be seen that for the outcome “Mortality”, GPT’s responses highlight a discerning approach to semantic similarity. While “overall mortality” is affirmed as semantically similar to “mortality”, more nuanced phrases like “mortality outcome” and “reduce mortality” are correctly identified as not equivalent, showcasing the complex interplay between semantic similarity and the precision of term matching in clinical outcomes. Similarly, the outcomes related to “Time of Clinical Improvement” reveal a mixture of accurate matches and semantic discrepancies. GPT confirms that phrases like “time to clinical improvement” and “days to clinical improvement” reflect the same concept, emphasizing the importance of duration in clinical improvement metrics. However, it also points out that similar-looking phrases like “time to clinical improvement pivotal stage” and “time of clinical improvements (days)” do not carry the same meaning. According to Table 4, after harnessing GPT to refine the output of SBERT, 68% of outcomes have been assigned with at least one similar outcome, while this value was at most 10% when the rule-based approach was adopted.

We manually reviewed the performance of GPT in refining the SBERT output. The evaluated accuracy of GPT’s verification was 90%. The time required by the pipeline to download and parse the trials, asking the user to enter “n” (the number of similar outcomes), finding outcomes that are semantically similar using SBERT, querying GPT for refining the final list of similar outcomes, and finding the trials that listed semantically similar outcomes as their primary outcome, was 20 min.

4. Discussion

The results presented in this study provide insights into the realm of clinical trial outcomes and the challenges associated with their standardization and semantic integration. The initial query focusing on trials using hydroxychloroquine as a treatment for COVID-19 yielded 261 unique trials and 475 unique primary outcomes. However, the relatively low percentages of outcomes linked to well-known ontologies considered in this study, i.e., MeSH, NCI, and CDISC, highlighted the fragmentation and heterogeneity in outcome reporting within the clinical research community.

The distribution of outcomes by word count puts further emphasis on the diversity of outcome descriptions. While a significant portion of outcomes consist of more than three words, most one-word outcomes are associated with MeSH, NCI, and CDISC ids. This indicates a higher level of standardization for shorter descriptors. However, the reverse is true for three-words outcomes, which exhibit lower rates of ontology linkage. The absence of MeSH and NCI ids for three-words outcomes is noteworthy, suggesting potential gaps in semantic representation for more complex outcome descriptions.

The study’s objective of using a rule-based approach, i.e., linking outcomes to ontologies, was to establish semantic similarity between outcomes sharing common ids. However, the findings raised questions about the effectiveness of existing ontological frameworks in addressing the variability in clinical trial outcomes. Accordingly, a machine learning-based model, i.e., SBERT and GPT, was used. This approach exhibits a significantly higher coverage, with 68% of outcomes aligned with at least one semantically similar outcome, while, according to Table 2, at most 10% of outcomes have been covered by well-known ontologies. This underscores the strength of leveraging LLMs in capturing nuanced semantic relationships among outcomes, surpassing the coverage achieved by the rule-based method. The substantial difference in percentages between machine learning-based and rule-based ontological mappings highlights the potential of natural language processing techniques, especially LLMs, to comprehensively address outcome variability in clinical research. This suggests that LLM’s language understanding capabilities can contribute to outcome harmonization efforts by recognizing semantically similar outcomes, thus potentially reducing redundancy and enhancing data interoperability. Our study extends the previous work on outcome extraction from ClinicalTrials.gov [29,30] and alignment of heterogenous clinical trial data in a harmonized dataset for systematic analyses [31].

5. Conclusions

This study focused on outcome variability in clinical research, presenting valuable insights into the challenges and opportunities associated with different methodologies for outcome assignment. The rule-based approach, relying on ontological mappings to MeSH, NCI, and CDISC, reveals limitations in capturing the diverse landscape of clinical outcomes, as indicated by the low percentages of outcomes assigned to unique identifiers within these ontologies. This emphasizes the need for complementary strategies to enhance coverage and address the complexity of clinical outcome reporting. The results further illustrated the distribution of outcomes based on the number of words. Notably, a significant majority (83%) consists of outcomes with more than three words, emphasizing the importance of handling complex and nuanced clinical descriptions. The GPT-based approach stands out with a remarkable 68% coverage in identifying at least one semantically similar outcome, underscoring the efficacy of leveraging large-language models in addressing outcome variability, particularly in capturing nuanced semantic relationships among outcomes with varying word lengths. The findings collectively suggest that a hybrid approach, combining the strengths of rule-based ontological mappings and machine learning-based semantic similarity identification, could offer a comprehensive solution to mitigate outcome variability in clinical research. By acknowledging the limitations of rule-based methods in capturing diverse clinical language and leveraging the capabilities of advanced language models, researchers can enhance the robustness and inclusivity of outcome assignment strategies.

Author Contributions

Conceptualization, J.F.; methodology, J.F. and F.S.-M.; software, F.S.-M.; validation, J.F. and F.S.-M.; formal analysis, F.S.-M.; investigation, F.S.-M.; resources, J.F.; data curation, F.S.-M.; writing—original draft preparation, F.S.-M.; writing—review and editing, J.F.; visualization, F.S.-M.; supervision, J.F.; project administration, J.F. and F.S.-M.; funding acquisition, F.S.-M. and J.F. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by a grant R33HL143317 from the National Institutes of Health, as well as a grant from the University of Utah’s One Data Science Hub.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article. All data used in this project are publicly available at ClinicalTrials.gov.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. arXiv:1706.03762v7. [Google Scholar]
Liévin, V.; Hother, C.E.; Winther, O. Can large language models reason about medical questions? arXiv 2022, arXiv:2207.08143. [Google Scholar] [CrossRef] [PubMed]
Deng, J.; Zhou, F.; Heybati, K.; Ali, S.; Zuo, Q.K.; Hou, W.; Dhivagaran, T.; Ramaraju, H.B.; Chang, O.; Wong, C.Y.; et al. Efficacy of chloroquine and hydroxychloroquine for the treatment of hospitalized COVID-19 patients: A meta-analysis. Future Virol. 2022, 17, 95–118. [Google Scholar] [CrossRef] [PubMed]
Gautret, P.; Lagier, J.C.; Parola, P.; Meddeb, L.; Mailhe, M.; Doudier, B.; Courjon, J.; Giordanengo, V.; Vieira, V.E.; Dupont, H.T.; et al. Hydroxychloroquine and azithromycin as a treatment of COVID-19: Results of an open-label non-randomized clinical trial. Int. J. Antimicrob. Agents 2020, 56, 105949. [Google Scholar] [CrossRef] [PubMed]
CDISC Library API Documentation. Available online: https://www.cdisc.org/cdisc-library/api-documentation (accessed on 12 December 2023).
NCI REST API Documentation. Available online: https://api-evsrest.nci.nih.gov/swagger-ui/index.html#/ (accessed on 16 December 2023).
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. Palm: Scaling language modeling with pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2021, arXiv:2109.01652. [Google Scholar]
Schulman, J.; Zoph, B.; Kim, C.; Hilton, J.; Menick, J.; Weng, J.; Uribe, J.F.; Fedus, L.; Metz, L.; Pokorny, M.; et al. ChatGPT: Optimizing Language Models for Dialogue. OpenAI blog. 2022. Available online: https://autogpt.net/chatgpt-optimizing-language-models-for-dialogue/ (accessed on 12 December 2023).
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
OpenAI; Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; et al. GPT-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Jin, Q.; Leaman, R.; Lu, Z. Retrieve, Summarize, and Verify: How will ChatGPT impact information seeking from the medical literature? J. Am. Soc. Nephrol. 2023, 34, 1302–1304. [Google Scholar] [CrossRef] [PubMed]
Jin, Q.; Yuan, Z.; Xiong, G.; Yu, Q.; Ying, H.; Tan, C.; Chen, M.; Huang, S.; Liu, X.; Yu, S. Biomedical question answering: A survey of approaches and challenges. ACM Comput. Surv. (CSUR) 2022, 55, 35. [Google Scholar] [CrossRef]
Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S.; et al. Large language models encode clinical knowledge. arXiv 2022, arXiv:2212.13138. [Google Scholar] [CrossRef]
Wang, Z.; Xiao, C.; Sun, J. AutoTrial: Prompting Language Models for Clinical Trial Design. arXiv 2023, arXiv:2305.11366. [Google Scholar]
Asudani, D.S.; Nagwani, N.K.; Singh, P. Impact of word embedding models on text analytics in deep learning environment: A review. Artif. Intell. Rev. 2023, 56, 10345–10425. [Google Scholar] [CrossRef] [PubMed]
Oubenali, N.; Messaoud, S.; Filiot, A.; Lamer, A.; Andrey, P. Visualization of medical concepts represented using word embeddings: A scoping review. BMC Med. Inform. Decis. Mak. 2022, 22, 83. [Google Scholar] [CrossRef]
Naseem, U.; Razzak, I.; Khan, S.K.; Prasad, M. A comprehensive survey on word representation models: From classical to state-of-the-art word representation language models. Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 74. [Google Scholar] [CrossRef]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
Shah-Mohammadi, F.; Finkelstein, J. Contextualized Large Language Model-Based Architecture for Outcome Measure Alignment in Clinical Trials. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; pp. 4411–4417. [Google Scholar] [CrossRef]
OpenAI Chat Completion API. Available online: https://platform.openai.com/docs/api-reference/chat/create (accessed on 13 May 2024).
Johnson, D.; Goodman, R.; Patrinely, J.; Stone, C.; Zimmerman, E.; Donald, R.; Chang, S.; Berkowitz, S.; Finn, A.; Jahangir, E.; et al. Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. Res. Sq. 2023. [Google Scholar] [CrossRef]
Finkelstein, J.; Chen, Q.; Adams, H.; Friedman, C. Automated Summarization of Publications Associated with Adverse Drug Reactions from PubMed. AMIA Jt. Summits Transl. Sci. Proc. 2016, 2016, 68–77. [Google Scholar] [PubMed] [PubMed Central]
Elghafari, A.; Finkelstein, J. Automated Identification of Common Disease-Specific Outcomes for Comparative Effectiveness Research Using ClinicalTrials.gov: Algorithm Development and Validation Study. JMIR Med. Inform. 2021, 9, e18298. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Borziak, K.; Parvanova, I.; Finkelstein, J. ReMeDy: A platform for integrating and sharing published stem cell research data with a focus on iPSC trials. Database 2021, 2021, baab038. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]

Figure 1. Distribution of the status of the 261 chosen clinical trials in CTG and of the countries with the highest numbers (by percent) of these trials.

Table 1. Pipeline output overview in step 1.

NCT Number	Outcome
NCT05766176	Quantitative anti-SARS-CoV-2
NCT04902157	Prognosis
NCT04333550	Mortality rate

Table 2. Overview of semantic linking approaches.

	Number of trials considered in this study using the query terms “COVID-19” and “Hydroxychloroquine”	261
	Number of unique primary outcomes	475
	MeSH terms
	Percentage of outcomes assigned with a MeSH ID	4%
Rule-based approach	NCI
	Percentage of outcomes assigned with an NCI ID	7%
	CDISC
	Percentage of outcomes assigned with a CDISC ID (concept ID)	10%
Machine learning-based approach	SBERT + GPT
Machine learning-based approach	Number of outcomes assigned with an at least one semantically similar outcome	68%

Table 3. Results of rule-based approach.

Length of Outcome	Percentage (%)	MeSHs (%)	NCI (%)	CDISC (%)
One word	3	75	88	88
Two words	6	13	60	67
Three words	8	5	5	30
More than three words	83	0	0	1

Table 4. Overview of SBERT output.

SBERT Output for the Outcome “Mortality”	MeSH Term
Mortality outcome	‘ ’
Overall mortality	‘ ’
Mortality rate	‘D009026’
Reduce mortality	‘ ’
Mortality rates	‘D009026’
SBERT output for the outcome “Time to clinical improvement”	MeSH Term
Time of clinical improvements (days)	‘ ’
Time to clinical improvement pivotal stage	‘ ’
Days to clinical improvement	‘ ’
Number of days to clinical improvement	‘ ’
Time of clinical improvement	‘ ’

Table 5. GPT’s responses to the following prompt: are <first outcome> and <second outcome> semantically the same?

One Word Outcome

Answer: Yes, “hospitalization” and “hospitalization due to SARS-CoV-2 infection” are semantically the same.

Answer: Yes, “mortality” and “death” are semantically the same.

Two-words outcome

Answer: Yes, “disease severity” and “severity of COVID-19 disease” are semantically the same.

Answer: Yes, “clinical improvement” and “improvement of clinical status” are semantically the same.

Three-words outcome

Answer: Yes, “duration of hospitalization” and “duration of hospitalization (in days)” are semantically the same.

Answer: Yes, “in hospital mortality” and “in hospital mortality (binary yes/no)” are semantically the same.

More than Three-words outcome

Answer: Yes, “proportion of patients requiring invasive mechanical ventilation or dying” and “invasive mechanical ventilation or mortality” are semantically the same.

Answer: Yes, “length of total stay at the hospital” and “duration of hospitalization (in days)” are semantically the same.

Table 6. Overview of SBERT + GPT output.

SBERT Output for the Outcome “Mortality”	GPT Response
Mortality outcome	No, “mortality” and “mortality outcome” are not the same terms.
Overall mortality	Yes, “mortality” and “overall mortality” are the same.
Mortality rate	No, “mortality” and “mortality rate” are not the same things.
Reduce mortality	No, “mortality” and “reduce mortality” are not the same.
Mortality rates	No, “mortality” and “mortality rates” are not the same, although they are related concepts.
SBERT output for the outcome “Time to clinical improvement”	GPT Response
Time to clinical improvements (days)	Yes, “time to clinical improvement” and “time to clinical improvement (days)” generally refer to the same concept.
Time to clinical improvement pivotal stage	No, “time to clinical improvement” and “time to clinical improvement pivotal stage” are not the same.
Days to clinical improvement	Yes, “time to clinical improvement” and “days to clinical improvement” generally refer to the same concept—the duration or number of days it takes for a patient’s clinical condition to improve.
Number of days to clinical improvement	Yes, “time to clinical improvement” and “number of days to clinical improvement” generally refer to the same concept.
Time of clinical improvement	No, the phrases “time to clinical improvement” and “time of clinical improvement” have different meanings.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shah-Mohammadi, F.; Finkelstein, J. Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models. BioMedInformatics 2024, 4, 2173-2185. https://doi.org/10.3390/biomedinformatics4040116

AMA Style

Shah-Mohammadi F, Finkelstein J. Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models. BioMedInformatics. 2024; 4(4):2173-2185. https://doi.org/10.3390/biomedinformatics4040116

Chicago/Turabian Style

Shah-Mohammadi, Fatemeh, and Joseph Finkelstein. 2024. "Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models" BioMedInformatics 4, no. 4: 2173-2185. https://doi.org/10.3390/biomedinformatics4040116

APA Style

Shah-Mohammadi, F., & Finkelstein, J. (2024). Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models. BioMedInformatics, 4(4), 2173-2185. https://doi.org/10.3390/biomedinformatics4040116

Article Menu

Addressing Semantic Variability in Clinical Outcome Reporting Using Large Language Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Step 1: Connecting to the CTG Search API Endpoint

2.2. Step 2: Text Normalization and Outcome Alignment

2.2.1. Rule-Based Approach: Ontology Linkage

2.2.2. Machine Learning-Based Approach

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI