1. Introduction
Recent advances in the field of artificial intelligence (AI) have enabled the development of increasingly powerful linguistic models capable of generating text fluently and coherently. Among these models, “large language models” (LLMs) stand out for their imposing size and their ability to learn enormous amounts of textual data. Models like GPT-3.5 [
1] and GPT-4 [
2] developed by OpenAI [
3], or Bard, created by Google [
4], have billions of parameters and have demonstrated impressive comprehension skills and language generation.
These fascinating advancements in natural language processing (NLP) have promising implications in many fields, including healthcare [
5,
6,
7]. Indeed, they offer new perspectives for improving patient care [
8,
9], accelerating medical research [
10,
11], supporting decision-making [
12,
13], accelerating diagnosis [
14,
15], and making health systems more efficient [
16,
17].
These models could also provide valuable assistance to healthcare professionals by helping them interpret complex patient records and develop personalized treatment plans [
18] as well as manage the increasing amount of medical literature.
In the clinical domain, these models could help make more accurate diagnoses by analyzing patient medical records [
19]. They could also serve as virtual assistants to provide personalized health information or even simulate therapeutic conversations [
20,
21,
22]. The automatic generation of medical record summaries or examination reports is another promising application [
23,
24,
25].
For biomedical research, the use of extensive linguistic models paves the way for rapid information extraction from huge databases of scientific publications [
26,
27]. They can also generate new research hypotheses by making new connections in the literature. These applications would significantly accelerate the discovery process in biomedicine [
28,
29].
Additionally, these advanced language models could improve the administrative efficiency of health systems [
30]. They would be able to extract key information from massive medical databases, automate the production of certain documents, or even help in decision-making for the optimal allocation of resources [
31,
32,
33].
LLMs such as GPT-3.5, GPT-4, Bard, LLaMA, and Bloom have shown impressive results in various activities related to clinical language comprehension. Nevertheless, it is essential to evaluate them in detail in the medical context, which is characterized by its distinct nuances and complexities compared to common texts.
It is therefore essential to continue studies in order to verify the capacity of these innovative linguistic models in real clinical situations, whether for decision-making, diagnostic assistance, or the personalization of treatments. Rigorous evaluation is the key to taking full advantage of this technology and transforming medicine through better mastery and use of clinical language.
Although promising, the use of this AI in health also raises ethical and technological challenges that must be addressed. However, their potential for accelerating medical progress and improving the quality of care seems immense. The coming years will tell us to what extent these revolutionary models are capable of transforming medicine and health research.
The main contributions of this paper are the following:
We analyze major large language model (LLM) architectures such as ChatGPT, Bloom, and LLaMA, which are composed of billions of parameters and have demonstrated impressive capabilities in natural language understanding and generation.
We present recent trends in the medical datasets used to train such models. We classify these datasets according to different criteria, such as their size, source (e.g., patient files, scientific articles), and subject matter.
We highlight the potential of LLMs to improve patient care through applications like assisted diagnosis, accelerate medical research by analyzing literature at scale, and optimize the efficiency of health systems through automation.
We discuss key challenges for practically applying LLMs in medicine, particularly important ethical issues around privacy, confidentiality, and the risk of algorithmic biases negatively impacting patient outcomes or exacerbating health inequities. Addressing these challenges will be critical to ensuring that LLMs can safely and equitably benefit public health.
As depicted in
Figure 1, this study is deployed according to a well-defined architecture that aims to enlighten the reader on the different facets of LLMs and their relevance in medical diagnoses. We will begin, in
Section 2, our exploration with a brief history of LLMs.
Section 3 serves as an introduction to the transformer architecture, laying the foundation for understanding the following sections. Building on this foundation, in
Section 4, we will delve deeper into the specific architecture of LLMs.
Section 5 illustrates the practical applications of LLMs, with an emphasis on their use in medical diagnosis.
Section 6 presents a comprehensive review of medical datasets, segmenting them into three key categories.
Section 7 focuses on the major innovation, which is the advent of foundation models in the AI landscape, highlighting their relevance in the clinical domain. Before concluding this article in the final
Section 9,
Section 8 offers critical reflections, assessing both the benefits of LLMs and not neglecting the challenges inherent in them.
3. Transformers and Their Architecture
Transformers, a category of deep learning (DL) models, are predominantly employed for tasks related to NLP. These models were first introduced by Google in 2017 through a paper titled “Attention is All You Need” [
53]. The fundamental element of transformers is the attention mechanism (see
Section 3.1). This mechanism enables the model to comprehend the contextual associations between words (or other elements) within a sentence.
Transformers use multi-head self-attention to analyze the relationship between words in a sentence. Self-attention means that words attend to their relationship with other words in the same sequence, without regard to their relative or absolute position.
The basic architecture of transformers consists of an encoder and a decoder. The encoder helps process the input sequence, while the decoder generates the output sequence. Common transformer architectures include BERT [
4], GPT [
54], T5 [
55], etc. BERT (Bidirectional Encoder Representations from Transformers) introduced bidirectional training, which looks at the context from both left and right. GPT (Generative Pre-trained Transformer) models like GPT-2 [
56], GPT-3 [
3], and GPT-4 [
2] are able to generate new text. T5 (Text-To-Text Transfer Transformer) can perform a wide range of text-based tasks like summarization, QA, and translation [
57].
All transformer models follow the same basic structure—embedding, encoding, and decoding. However, they differ in pre-training objectives, model size, number of encoder/decoder stacks, attention types, etc. Later models like T5, GPT-3, and GPT-4 have billions of parameters to handle more complex tasks through transfer learning from massive text corpora.
3.1. The Attention Mechanism
The attention mechanism aims to palliate the loss of information transmitted to the decoder, as it is only the hidden state created during the last phase by the encoder that is provided as input to the decoder.
The original work of Larochelle and Hinton [
58] introduced this approach to the field of computer vision. By analyzing several regions of an image separately, i.e., by considering different extracts, it is possible for a learning algorithm to gradually accumulate knowledge about the shapes and objects present. By analyzing each segment in turn, the model can build up a global understanding of the image as a whole, which will ultimately enable it to assign a relevant category to it. Authors initially proposed this method, which involves examining the various parts of an image and, after assimilating the details of each, arriving at a precise classification.
According to Equation (
1), the attention mechanism, recommended by [
59,
60], has played a crucial role in improving the performance of machine translation systems. This approach offers the model the ability to focus on essential segments of the input sequence.
The main idea behind attention is to evaluate the relationship between parts of two different sequences. In NLP, especially in a sequence-to-sequence framework, the attention mechanism aims to signal to the model which word in sequence “B” should be privileged in relation to a specific word in sequence “A”.
A model with attention differs from a classic seq2seq model in two major respects: Firstly, instead of only transmitting the ultimate hidden state from the encoder to the decoder, it transmits all the hidden states to the decoder, thus enriching the information transmitted. Secondly, before producing its output, the decoder integrates an additional step. It evaluates all the hidden states received from the encoder, assigning them a score via multiplication by their softmax value, to better target the crucial elements of the input.
For each word in the input sequence, a contextualized attention layer (self-attention) generates a vector representative of its importance (attention vector). Although explanations are presented here as vectors for simplicity’s sake, calculations actually involve matrix representations (several vectors juxtaposed), with each word associated with a row of a matrix. Thus, attention vectors actually describe the relative influence of each row of a matrix representing the sequence as a whole, allowing contextual interactions to be captured at all levels.
To calculate the attention vector for each encoded input, we need to consider three types of vectors:
The query vector, q, represents the encoded sequence up to that point.
The key vector, noted k, corresponds to a projection of the encoded entry under consideration.
The value vector, v, contains the information relative to this same encoded entry.
Thus, to obtain the attention vector associated with a given encoder input, the mechanism takes into account these three vectors: q, k, and v, respectively, the request vector, key vector, and value vector.
These vectors q, k, and v are indexed by the position of the word in the processed sequence. For example, for the first word, we will have the vectors , , and . They are generated by projecting the x vector representation of each word using three matrices:
The Q matrix produces the q query vectors.
Matrix K produces key vectors k.
Matrix V generates value vectors v.
These three matrices,
Q,
K, and
V are learned during the training process of the transformer model [
53] so as to optimize the calculation of contextual attention. The vectors
q,
k, and
v for each word are thus derived by multiplying the vector representation
x with these matrices.
is the hidden dimensionality for keys. is the transpose of a V matrix.
3.6. Transformers
The development of the digital world would not have been possible without advancements in automatic NLP. However, for a long time, NLP techniques remained limited due to insufficient progress in AI.
Recurrent neural networks (RNN) [
61,
63,
65], and convolutional neural networks (CNN) [
53], widely used in the past for NLP tasks, had the disadvantage of processing text sequentially. This approach proved inefficient with massively parallel computing units such as GPUs, which became essential with the advent of DL on large datasets.
It was against this backdrop that the transformer model emerged in 2017 by Vaswani et al. [
53] of Google Brain and Google Research. Based on attention as a basic primitive, it enabled sequences to be processed in a truly parallel way for the first time, thanks to multi-headed calculations.
In terms of performance, transformer proved superior to RNN and CNN on natural language understanding and generation tasks. It was able to learn more efficiently and in parallel, better capturing long-range dependencies. Its excellent evaluation results have made the transformer an essential component of modern NLP.
It has revolutionized the field by enabling language processing on a very large scale, paving the way for major advancements such as human-quality machine translation.
Encoders and decoders are the two essential components of the original transformer model (see
Figure 2). Each part is composed of a stack of N = 6 layers that are all identical. The output of one layer is the input of the next layer until the final representation (prediction) is reached [
53].
Transformer models, as represented in
Figure 2, are distinguished by the presence of two fundamental elements: encoders and decoders. These two components form the basis of the architecture. Each of these elements is designed around a set of six identical layers, where the output of each layer feeds the input of the next, enabling a progressive transformation of information until the final prediction is formulated.
The encoder, which forms the left-hand side of this architecture, is structured into two main blocks, both of which are neural networks.
A self-attention layer works to preserve the dependencies between words in a sequence. It analyzes each word in relation to the others, in order to identify its context. A feed-forward neural network, which performs complex transformations on the data received from the self-attention layer. Its main task is to transform an input sequence into a series of continuous representations, which then serve as inputs for the decoder.
The decoder, located on the right-hand side of the architecture, is structured in a similar way as the encoder but also incorporates an “Encoder-Decoder Attention” layer. The latter plays a crucial role: it facilitates the attention mechanism between the input sequence (once encoded) and the output sequence (during its decoding phase). The aim is to ensure that each word in the output sequence takes account of all the words in the input sequence.
The decoding process is based on the combination of the encoder output and the output generated by the decoder during the previous time step, enabling the output sequence to be built up progressively.
One of the main strengths of transformer architectures such as LLMs lies in their ability to extract essential intermediate features, such as syntactic structures and semantic integration of words. This ability enables LLMs to represent human language knowledge accurately and richly, making them powerful in a variety of tasks such as QA, sentiment classification, and machine translation. Moreover, LLMs are versatile, as they can be adapted to new challenges by reusing the same pre-trained model, which often puts them head and shoulders above previous methods. Their ability to extract and exploit relevant linguistic information makes LLMs powerful and versatile tools in NLP.
In the following section, we explain the architecture of LLMs in detail, highlighting the key mechanisms that contribute to their interesting results in text understanding and generation.
4. LLMs’ Architecture
LLMs have made significant advancements in recent years. First, the transformer models introduced a novel attention-based neural architecture for NLP.
These advanced models, with their billions of parameters, have become feasible thanks to recent advancements in computational capabilities and model architecture [
66], as shown in
Figure 3. For example, GPT-3, which relies on extensive data, has around 175 billion parameters [
67]. Meanwhile, the open-source LLaMA model series ranges between 7 and 70 billion parameters [
68,
69].
Table 1 summarizes the main LLMs, giving their general architecture as well as an indication of the number of parameters in some of the basic versions used in research. This enables a quick comparison of the size and specific features of each model.
The initial phase of LLM training, called
pre-training, uses a self-supervised method using large amounts of unlabeled data, which include sources such as general web content, Wikipedia, Github repositories, social media, and BooksCorpus [
68,
70]. The main goal of training is to predict the next word in a series, which requires significant resources [
71]. This involves converting the text into tokens before entering it into the template [
72]. The outcome of this process is a base model designed primarily for basic language generation but unable to perform complex tasks.
After initial pre-training, the model undergoes a
fine-tuning phase to specialize it for certain tasks [
72]. Here, the model can be trained on narrower domain-specific datasets, like medical records for healthcare applications.
This fine-tuning can be improved through various techniques. A constitutional AI approach embeds predefined rules or principles directly into the model architecture [
73]. Reward-based training involves human evaluators assessing the quality of multiple model outputs to provide feedback [
70,
74]. Reinforcement learning based on human feedback uses a comparison-based system to optimize responses through iterative human feedback [
74].
This fine-tuning step requires less computational power but more human resources compared to pre-training. It customizes the model to perform a specific task, like a chatbot, with controlled, targeted results. The fine-tuned model resulting from this phase is then deployed for flexible applications in its domain of specialization.
The goal is to leverage general pre-training and then fine-tune the model’s capabilities for the nuances of its intended use case through various focused training approaches during fine-tuning. The result is a model that is both adapted and versatile for its specialist field of application.
LLMs have an impressive ability to adapt to unfamiliar tasks and demonstrate remarkable reasoning skills [
75,
76]. However, to fully exploit their potential in specialized fields such as medicine, more specific training strategies are essential. These strategies could encompass direct prompting methods such as few-shot learning [
3], in which a limited set of task examples during testing guides the model’s results, and zero-shot learning [
77], in which the model operates without any prior specific examples. In addition, refined techniques such as chain-of-thought prompting [
78], which prompts the model to sequentially decompose its reasoning, and self-consistency checks [
79], which ask the model to confirm the consistency of its responses, are also crucial.
Reinforcement learning (RL) is also a complementary technique in the fine-tuning process of LLMs, aimed at improving and better aligning pre-trained models. To enhance the performance of these LLMs, optimization methods inspired by RL or directly derived from RL are implemented. Notable examples include RL from human feedback (RLHF), direct preference optimization (DPO) and proximal policy optimization (PPO).
RLHF [
2,
69] integrates RF techniques with human feedback to refine the language model. DPO, introduced by Rafailov et al. [
80], focuses on direct preference optimization for model-generated responses. PPO, initially conceived by Schulman et al. [
81] and later adapted by Tunstall et al. [
82], employs proximal policy optimization for LLM fine-tuning.
These RL-based approaches have proven effective, particularly when combined with instruction tuning, in improving the relevance and quality of responses generated by LLMs. However, it is important to note that, like instruction tuning, these methods are primarily aimed at improving the quality of responses and their behavioral appropriateness, without necessarily increasing the breadth of knowledge that the model can demonstrate.
Instruction
prompt tuning, developed by Lester et al. [
83], is a promising technique for efficiently updating model parameters, enhancing its performance on many downstream medical tasks. This approach has advantages over prompt methods in a few examples. In general, these methods enrich the initial model fine-tuning processes, enhancing their suitability for medical problems. A recent example is the Flan-PaLM model [
84].
Figure 4 compares classical fine-tuning and prompt tuning for adapting pre-trained language models to specific tasks. In classical fine-tuning, each task requires the creation of a dedicated model, leading to multiple versions of the model with different configurations for each task. This approach is complex and requires significant resources. Prompt tuning, on the other hand, simplifies the adaptation process. Instead of creating separate models, we simply provide the original pre-trained model with a “prompt” specific to each task. This “prompt” acts as an instruction that allows the model to handle different tasks without requiring major modifications.
The major advantage of prompt tuning lies in its efficiency. For example, with a large T5 model [
55], traditional fine-tuning for each task can require around 11 billion parameters, which is equivalent to creating new models for each task. In contrast, prompt tuning only requires about 20,480 parameters per task “prompt”, representing a significant reduction of five orders of magnitude. It is therefore possible to use a single adaptable model with minimal modifications to accomplish various tasks.
Consequently, prompt tuning allows for the adaptation of a single pre-trained model to a multitude of tasks, significantly reducing the computational resources required and improving scalability compared to traditional fine-tuning. This breakthrough paves the way for more widespread use of language models in various domains.
Understanding training methods and their ongoing evolution provides essential insights for assessing the current capabilities of models and identifying their potential future applications. In specialized fields such as healthcare, where precision and expertise are essential, an in-depth analysis of model specialization techniques is crucial to assessing their full potential.
Among these techniques, retrieval augmented generation (RAG) was introduced by Lewis et al. [
85]. This method aims to enhance the capabilities of LLMs, especially in knowledge-intensive tasks, by integrating external sources of information. Originally, the application of RAG required task-specific training. However, subsequent research [
86] revealed that a pre-trained model incorporating the RAG method could increase its performance without the need for additional training.
The fundamental principle of RAG is based on the use of an auxiliary knowledge base and an input query. The RAG architecture is then used to identify relevant documents in this database that are close to the query. These retrieved documents are then merged with the initial query, providing the model with an enriched, in-depth context on the queried subject.
The success of LLMs is attributed to their ability to generalize from the massive amounts of data that they are exposed to during training. This enables them to perform a wide array of language-related tasks with impressive accuracy, including text completion, language translation, sentiment analysis, and more. LLMs have also demonstrated proficiency in understanding context and generating contextually appropriate responses in conversational settings. Among the instances of LLMs, notable examples include PaLM [
87], GPT-3 [
3], LLaMA [
68], PaLM2 [
88] utilized in the BARD chatbot, Bloom [
89], and GPT-4 [
2]. These models are trained on billions of tokens sourced from datasets such as Common Crawl, WebText2, Wikipedia, Stack Exchange, PUBMED, ArXiv, Github, and many other diverse repositories.
Additionally, researchers are exploring, using LLMs, the creation of systems capable of understanding and applying complex instructions from text. In this context, models such as Alpaca [
90], StableLM [
91], and Dolly [
92] stand out. Alpaca, based on Meta’s LLaMA 7B model, showed that a lighter model can compete with heavier models like OpenAI’s text-DaVinci-003. In addition, Alpaca is more economical and easy to reproduce thanks to its open structure.
This indicates that it is entirely possible for less demanding models to compete in performance while still being transparent and accessible, thanks to their free nature. StableLM and Dolly are also working in this direction, seeking to better respond to requests formulated in natural language.
LLMs now have trillions of learning data tokens and the number of their parameters has grown rapidly [
68]. Rather than improvements in architectural design, the amount of data, parameters, and computational resources appear to be a driving factor in the capabilities of LLMs [
93].
While LLMs have opened up new possibilities for understanding and generating natural language, they also pose challenges and ethical considerations. The risk of bias in learning data, ethical concerns relating to content generation, and issues of data privacy are some areas that researchers and practitioners continue to address.
4.1. ChatGPT
ChatGPT, developed by OpenAI and based on the GPT architecture [
54], was officially released in June 2020. This model, which has undergone several updates, has been trained using a vast corpus of textual sources such as books, web content, and articles, giving it the ability to capture the complexities of human language.
ChatGPT is able to produce texts that reflect human articulation in response to specific prompts. This prowess makes it invaluable in a variety of NLP applications, from chatbots to translation to text summarization. Furthermore, its adaptability means that it can be fine-tuned for particular tasks using more concentrated datasets.
To explore its origins, the first GPT-1 [
54] used a 40 GB text dataset for training. In contrast, its successors—GPT-2 [
56], GPT-3 [
3], and GPT-4 [
2]—benefited from much larger datasets. This progressive increase in training data propelled ChatGPT’s effectiveness, with GPT-4’s results becoming practically indistinguishable from human-written text.
Table 2 illustrates the evolution of the different versions of GPT, specifically in the context of the healthcare sector.
Each GPT version has marked a significant advancement over the previous one, with improvements in terms of model complexity, language understanding, and polyvalence of application domains. GPT-4’s capability to manage multimodal inputs (text and images) is a notable evolution over previous versions [
2].
Researchers have conducted investigations to assess the usefulness of ChatGPT and InstructGPT [
74] for healthcare, as well as their suitability for specific medical tasks. For example, Luo et al. [
94] developed BioGPT, a model based on the GPT-2 framework pre-trained on 15 million PUBMED abstracts. BioGPT outperformed other state-of-the-art models in a range of tasks, including QA, relationship extraction, and document classification. Likewise, BioMedLM 2.7B (formerly called PUBMEDGPT [
95]), which is pre-trained on both PUBMED abstracts and full text, illustrates ongoing progress in this area. In addition, researchers have used GPT-4 to create multimodal medical LLMs, and early results are promising in this area.
Current research in the biomedical sector focuses mainly on the use of ChatGPT for data assistance. These approaches train models of small size using ChatGPT’s distilled or translated knowledge. A notable example is Chatdoctor [
32], which marks the first initiative to adapt language and machine learning (LLM) models to the biomedical domain. They fine-tuned the LLaMa model through dialogue simulations generated via ChatGPT.
Another example is DoctorGLM [
96], which uses ChatGLM-6B as a base model and fine-tunes it using the Chinese translation of the ChatDoctor dataset, which was obtained using ChatGPT. In addition, Chen et al. [
97] developed an improved Chinese and medical linguistic model in their LLM collection.
These works collectively demonstrate the potential of LLMs to be successfully applied in the biomedical domain. They highlight the adaptability of LLMs to meet the specific needs of medical assistance and underline the importance of using synthesized and translated data to form specialized models in this field.
4.2. PaLM
The Pathways Language Model (PaLM), introduced in 2019 [
87], is a powerful architecture built upon the transformer decoder, a densely connected language model. PaLM stands out from other models as it prioritizes text generation over input processing, resulting in exceptional efficiency for tasks like generating text. Google developed the Pathways approach specifically for training large-scale machine learning models, and PaLM benefits from this methodology. Pathways enables flexible and scalable model training, a critical factor in handling the immense volumes of data necessary to train such models.
PaLM has been trained on a massive dataset consisting of 780 billion tokens. This dataset includes a variety of textual sources, such as web pages, Wikipedia articles, source code, conversations on social networks, press articles, and books. By incorporating these different sources, PaLM is exposed to various languages, knowledge, and text styles. This enables it to acquire a thorough understanding of language, ranging from structured and specialized knowledge to informal expressions and literary narratives.
With 540 billion parameters, PaLM has a significant ability to model complex relationships, which is advantageous for analyzing and synthesizing detailed medical information. It can potentially be used to generate and summarize medical information, help interpret health data, and answer general medical questions [
98]. One of its technical innovations, the “Chain-of-Thought Prompting” method [
78], enables it to process complex queries in several stages, which could prove beneficial for solving complex medical problems or interpreting health data in several phases.
A standout feature of PaLM is its competence in addressing out-of-vocabulary (OOV) words, which are terms not seen during its training phase. Rather than stumbling over these unfamiliar words, PaLM can generate fitting contextual replacements, thereby elevating its overall performance in text generation.
Researchers first adapted the PaLM model to medical QA, leading to the creation of [
84]. This model established state-of-the-art benchmarks for QA. Based on these results, the Med-PaLM model was introduced using instruction tuning, proving its effectiveness in areas such as clinical expertise, scientific consensus, and medical reasoning processes [
99]. This approach has been extended to develop a multimodal medical LLM. These PaLM-centric models highlight the benefits of adapting basic models to the specific needs of medical scenarios. Flan-PaLM is a specially designed version of PaLM for processing instructions [
84]. The latest version, called Med-PaLM2, was unveiled at Google Health’s annual event [
98]. This version was developed based on PaLM2, which is the fundamental language model used by Google’s chatbot, Bard.
Recently, Google developed a multimodal model called Med-PaLM M [
100], which follows on from its PaLM-E vision-language model [
101], and has the ability to synthesize and communicate information from medical images such as chest X-rays, dermatology images, pathology slides, and other biomedical data to aid the diagnostic process by understanding and discussing these visuals through a natural language dialogue with clinicians.
5. Applications of Transformer-Based LLMs in Healthcare
When a patient meets a clinician for the first time, it is essential to establish an accurate diagnosis and create a relationship of trust. Unfortunately, time constraints often limit the ability to thoroughly review medical histories and provide tailored care to each patient. LLMs can improve this process by extracting relevant information from electronic health records [
108], allowing a concise overview of the patient’s medical history to be presented. This helps optimize consultation productivity.
By highlighting past medical conditions, treatments, medications, and results of previous visits, LLMs eliminate the need to review numerous records, allowing for more targeted interactions to address specific patient issues. By focusing on key concerns, LLMs ensure that diagnosis is personalized to each patient. Additionally, these models are constantly improved as health data grow, ensuring their accuracy. By ensuring data transparency and implementing continuous monitoring, the potential of LLMs can be further improved, resulting in faster diagnoses and greater patient satisfaction.
Transformer-based language models, such as LLaMA, ChatGPT [
109], and GPT-4 [
2] have shown great potential in a variety of fields, including medical practice. These models are particularly efficient at understanding and producing texts that resemble those written by humans, opening up possibilities for supporting healthcare professionals in the diagnostic process, as shown in
Figure 5. The rest of this section illustrates how these models can be applied for this purpose.
5.1. Text Analysis and Summarization
Electronic medical records (EMRs) have transformed the way healthcare professionals access and use patient data [
110], profoundly altering their approach to decision-making. A major advantage of EMRs lies in their ability to summarize clinical observations [
18], giving doctors a clear view of potential health hazards and facilitating informed medical choices. This data summarization not only minimizes diagnostic errors but also optimizes therapeutic outcomes thanks to up-to-date, targeted patient information [
111].
Nevertheless, manual summarization of clinical reports is laborious and error-prone [
112]. The abundance and density of data mean that even seasoned professionals can miss crucial details. It is therefore imperative to develop automated synthesis techniques to increase the efficiency and reliability of medical care.
Relying on these automated summarization techniques, medical staff could rapidly extract the essential elements from the clinical notes provided, thereby reducing the risk of errors and enhancing the quality of the care provided.
The rise of LLMs in NLP has opened new possibilities for streamlining automated clinical report summarization. These models, recognized for their capability to align with input directions, benefit from the integration of text prompts [
113,
114].
Indeed, LLMs, like BERT, BART, LLAMA, Bloom, GPT-3.5, and GPT-4, have demonstrated a remarkable ability to analyze, understand, and summarize large quantities of unstructured text such as clinical notes, patient histories, scientific articles, etc. [
115,
116]. This makes them helpful in extracting meaningful information from textual medical data [
117].
Critical applications include automating the extraction of crucial information from detailed medical records [
118,
119,
120,
121], creating customized summaries to assist healthcare professionals [
122,
123], supporting report-based coding and billing [
123,
124], and semantically grouping symptoms to facilitate diagnostic processes [
125].
Large amounts of medical literature can also be filtered and analyzed by LLMs, which can then dynamically summarize data from many sources for doctors [
126,
127]. Succinctly synthesizing ideas saves crucial time.
LLMs can analyze clinical text and identify clinical concepts, entities, and their interrelationships. This enables them to generate differential diagnoses [
128,
129], predict specific diagnoses [
130,
131], and provide answers to physicians’ queries [
98,
132]. Moreover, LLMs’ multilingual capabilities can facilitate global healthcare [
133].
In capturing patient details and symptoms, LLMs offer a promising solution for the extensive documentation that physicians and clinical experts face daily [
134]. These models can produce detailed clinical summaries and diagnostic reports, effectively alleviating the time-intensive burden of these professionals [
135]. A concrete example of their potential is an LLM specially designed to condense the results of radiology reports, highlighting its applicability in similar medical fields [
136].
To optimize time management during consultations, LLMs can be exploited to generate concise summaries of each patient’s medical history. These summaries include information on comorbidities, previous consultations or admissions, medication lists, as well as progress and response to previous treatments [
19,
137]. These synthesized summaries draw the doctor’s attention to relevant information about the patient, which is crucial for an accurate diagnosis of the disease. By concisely providing this information, LLMs promote more efficient and comprehensive consultations. This approach allows doctors to devote more time to patient interaction, which in turn increases patient satisfaction.
Joseph et al. [
138] recently introduced FACTPICO, an evidence-based reference for plain-language summaries of medical texts describing randomized controlled trials (RCTs). RCTs are essential in evidence-based medicine and play a direct role in patients’ treatment decisions. FACTPICO consists of 345 plain-language abstracts generated by three LLMs (GPT-4, Llama-2, and Alpaca). It includes fine-grained evaluation and natural language justifications provided by experts.
As a result, using LLMs to generate concise summaries optimizes time management during consultations, enables physicians to focus on patient interaction, and improves patient care [
139]. In this way, the summary process helps clinicians to efficiently review patient histories, research studies, and guidelines [
140].
5.2. Diagnostic Assistance
Using LLMs to aid medical professionals in making diagnoses has shown potential [
141]. LLMs, trained on a massive medical corpus, have acquired important clinical knowledge. This latent expertise gives them several assets that can support the diagnostic process.
LLMs can synthesize information from a variety of sources, such as medical literature, clinical guidelines, and case histories, to guide decision-making. Using a patient’s data, these models can automatically generate differential diagnosis hypotheses, ranked according to their estimated probability, for consideration by the clinician [
142,
143,
144].
Medical doctors (MDs) can quickly acquire pertinent decision assistance by using natural language queries to interact with LLMs through conversational interfaces. In order to identify patterns and create clinical syndromes, related indications and symptoms may be semantically grouped [
145,
146].
LLMs can also assist in clinical decision-making by suggesting suitable medications [
147], recommending relevant imaging services based on symptoms [
148], or pinpointing disease causes from diverse clinical documents. When combined with tools like medical imaging, these models can offer a holistic view, helping doctors diagnose and define diseases [
149]. Moreover, by evaluating cases of individuals with comparable symptoms, LLMs can forecast potential disease outcomes, empowering both doctors and patients to make well-informed treatment choices.
By extracting the most important elements from narrative records, LLMs can simplify the review process for healthcare professionals by creating structured record segments [
150,
151]. In addition, the synthesis of information from exchanges or expert research helps to improve data assimilation.
Supplementary examinations often play a crucial role in medical diagnosis, helping to reinforce clinical hypotheses and confirm diagnoses. In this context, LLMs can act as clinical decision support tools, helping physicians choose the most relevant radiological examinations for given clinical cases [
8]. This approach has the potential to minimize pressure on limited resources, particularly in public hospitals or resource-constrained areas where imaging equipment and technical support are limited. Furthermore, LLMs can help avoid unnecessary examinations, such as those involving radiation or contrast agents, for patients who do not need them.
The increasing use of DL models to create computer-aided diagnosis (CAD) systems has automated the interpretation of medical images, facilitating the detection and classification of pathologies. However, their integration into clinical practice is sometimes hampered by a lack of transparency in the decisions made by these models. To remedy this, the integration of LLMs into CAD systems can enable clinicians to ask questions about specific images or patient cases, thereby clarifying the CAD system’s decision-making process. This human–machine interaction can make models more interpretable and encourage their use in routine diagnostic procedures. In addition, this approach may reveal new insights or biomarkers in disease imaging.
A recent study [
31] presents a new paradigm called generalist medical AI (GMAI), in which models are trained on large, diversified medical datasets. This approach allows models to perform a large range of tasks, such as diagnosis, prognosis, and treatment planning. The paper evaluates the performance of GMAI models on different medical tasks and finds that they perform better than conventional specialized medical AI models in several areas, including diagnosis, prognosis, and treatment planning.
In the future, the integration of different data sources, such as images and laboratory tests, could provide more comprehensive diagnostic information [
152]. Throughout this process, however, it is essential to maintain the interpretability of models and to assign responsibility for results to human practitioners. The integration of multimodal data will pave the way for in-depth diagnostic analysis [
153].
Despite the significant progress made by LLMs, their use to assist medical diagnosis has several important limitations. First, these models may lack the clinical precision necessary for specific diagnoses, as they may fail to capture all essential details or misinterpret complex medical information. Additionally, they generally struggle to fully understand the clinical context of a patient, including crucial elements such as medical history or current symptoms [
154].
The data used to train these models can also contain biases or inaccuracies, which can translate into erroneous recommendations. It is important to emphasize that LLMs can in no way replace human clinical expertise, especially regarding physical examination and interpretation of medical test results.
Data privacy and security issues are also a major concern, particularly when dealing with sensitive patient information. Furthermore, LLMs are not always effective in interpreting complex medical test results, such as radiological images or laboratory tests. Their use in the medical field therefore requires rigorous clinical validation to ensure their safety and efficacy before deployment in a healthcare setting.
Finally, given the rapidly evolving nature of the medical field, these models need to be constantly updated to incorporate new discoveries and practices, which necessitates continuous maintenance.
5.3. Answering Medical Queries
QA is a task that automatically provides an answer to a given question. Indeed, LLMs have greatly progressed the field of QA in NLP, enabling machines to comprehend and respond to natural language queries with precision. These extensive language models find applications in virtual assistants and chatbots, delivering precise and pertinent responses to user questions. Such systems leverage LLMs to grasp the context and meaning of inquiries and formulate suitable replies. For instance, Google Assistant, underpinned by LLMs, adeptly addresses an array of user queries spanning general knowledge, weather updates, directions, and more [
57].
Advancements in DL exemplified by “transformers” now open up new possibilities [
53]. Under their enhanced capacities, LLMs [
4,
55] have the potential to gain a finer-grained understanding of complex relationships within enormous corpora. The recent emergence of transformers and LLMs has breathed new life into research exploring AI’s potential to tackle the great challenge [
155,
156] of medical QA.
Previously, most works relied on smaller linguistic models trained specifically on domain-specific healthcare text corpora [
94,
157,
158,
159]. This has driven steady improvements in benchmark performance on reference tests such as MedQA [
160], MEDMCQA [
161], SentiMedQAer [
162], and DATLMedQA [
163].
Answering medical queries using LLMs has shown significant advancements in recent times. With the emergence of larger general-purpose LLMs like GPT-3 [
3], GPT-Neo [
164], GPT-3.5 [
1], OPT [
165], LLaMA [
68], and Flan-PaLM [
84,
87], trained on massive internet-scale datasets using extensive computational resources, there has been a remarkable improvement in their performance on medical QA benchmarks.
For instance, GPT-3.5 achieved an accuracy of
on the MedQA (USMLE) dataset, demonstrating its ability to provide accurate responses to medical queries [
98,
166]. Flan-PaLM, another powerful LLM, achieved an even higher accuracy of
on the same dataset. The performance of GPT-4-base [
2] was even more impressive, achieving an accuracy of
on medical QA tasks [
167,
168].
These advancements in LLMs have been observed within a relatively short time, showcasing their rapid progress and potential for accurate and reliable medical QA. It is important to note that these accuracy figures are specific to the mentioned datasets and models, and the performance may vary depending on the specific task and dataset involved.
The critical question remains whether they can generate responses meeting the demands of the clinical context—reliability, interpretability, and adherence to ethical standards, prerequisites for actual utility at the point of care [
169,
170,
171,
172].
One of the major challenges encountered in medical QA systems is the problem of hallucinations leading to incorrect answers [
173]. An approach commonly used to solve this problem is retrieval augmentation. This approach consists of combining LLMs with a search system such as New Bing for general domains, or Almanac for clinical domains [
174]. It involves first retrieving relevant documents as support, and then using LLMs to answer a specific question based on these retrieved documents. The underlying idea is that LLMs are able to effectively summarize content, which can reduce hallucinations. It is important to stress, however, that these systems are not error-free [
175] and require thorough and systematic evaluation to ensure their quality [
176]. Further studies are needed to rigorously evaluate the performance of these systems and identify possible sources of error.
A promising approach to solving the hallucination problem is to augment LLMs with additional tools. For example, recent work has explored the use of LLM augmentation by combining data from specific sources [
177,
178,
179,
180]. A concrete example is the GeneTuring dataset, which contains search questions for information on specific single nucleotide polymorphisms (SNPs). However, it is important to note that autoregressive LLMs do not possess specific knowledge about these SNPs, and most commercial search engines are unable to provide relevant results for such queries. Consequently, the approach of increasing retrieval may prove ineffective in such cases.
In such situations, relevant information is often only accessible via specialized databases such as NCBI dbSNP. For this reason, the approach of enriching LLMs using web database utility APIs, such as those provided by NCBI, offers the potential for solving the problem of hallucinations related to specific entities in biomedical databases [
178].
While computational progress hints at promising avenues, rigorous validation in real-world healthcare environments remains essential before these systems can meaningfully impact patient outcomes [
176]. Overall, advancements in NLP promise to improve the ability of systems to answer medical questions. However, this is only possible if rigorous evaluations are carried out to ensure the safety, transparency, and responsibility of such solutions. Indeed, as healthcare is a crucial area where the smallest failure could have serious consequences, it is essential to ensure through thorough testing that digital medical assistance systems respect the highest standards in terms of ethics, protection of sensitive data, and patient well-being.
5.4. Image Captioning
Developing precise and dependable automated report-generation systems for medical images presents various obstacles. These include the analysis of limited medical image datasets using machine learning techniques and the generation of informative captions for images that involve multiple organs.
The generation of image captions has been the subject of considerable work in computer vision. Early models focused on characterizing the objects, attributes, and relations present in the image via visual primitive extraction methods [
181,
182,
183,
184].
The emergence of DL enabled the development of end-to-end neural architectures that encode an image as a vector representation and then generate the legend word by word [
185,
186]. Since then, multiple improvements have been made to encoding [
187,
188,
189,
190,
191,
192] and decoding [
193,
194,
195] blocks, as well as to attention mechanisms [
196,
197,
198,
199].
It has been shown that encoding by object region instead of global image representation increases performance [
200]. These advancements in image encoders and decoders, together with the development of attention networks, have led to significant improvements in the quality of automatically generated captions.
Captioning medical images is an especially difficult task due to the complexity and variability of images such as fetal ultrasound images. The latter are usually noisy, of low resolution, and dependent on factors such as fetal position, gestational age, or imaging plan. Furthermore, the availability of extensive annotated datasets is essential for tackling this task effectively. In response to these challenges, researchers have put forward DL-based approaches that seamlessly merge visual and textual data, enabling the creation of informative captions for both images and videos.
Alsharid et al. [
201] presented a model for image captioning specifically designed for fetal ultrasound images. Alsharid et al. [
202,
203] have taken their studies further by introducing a programmatic learning approach to train such image captioning models. They have also proposed a captioning framework where an image is first classified, and then one of the multiple-image captioning models is employed to generate the corresponding caption. Each image captioning model is associated with an anatomical structure.
In a separate investigation [
204], researchers addressed the difficulties encountered in generating captions for medical images, particularly those involving multiple organs. In response, they propose a solution that combines DL and transfer learning techniques, specifically employing a Multilevel Transfer Learning Technique (MLTL) and LSTM framework. The authors introduce a foundational MLTL framework consisting of three models designed to detect and classify datasets with severe limitations. This is achieved by leveraging knowledge gained from readily available datasets. The first model utilizes non-medical images to acquire generalized features, which are subsequently transferred to the second model, responsible for the intermediate and auxiliary domains related to the target domain. The study concludes by discussing the potential applications of this approach in the medical diagnosis field and its potential to enhance patient care.
A new model for automatic clinical image caption generation has been developed [
205]. It combines radiological scan analysis with structured patient information to generate comprehensive and detailed radiology reports. The model utilizes two language models, Show-Attend-Tell [
206] and GPT-3, to generate captions that contain important information about identified pathologies, their locations, and 2D heatmaps highlighting the pathologies on the scans. This approach improves the interpretation and communication of radiological results, facilitating clinical decision-making.
The application of caption generation is not limited solely to images; it extends to videos as well. As part of another research project [
207], the authors introduced an innovative method for generating captions for fetal ultrasound videos, which can be valuable in aiding healthcare professionals in their diagnosis and treatment decision-making processes. This approach leverages a three-way multimodal deep neural network that merges gaze-tracking technology with NLP. The primary objective is to develop a comprehensive understanding of ultrasound images, enabling the generation of detailed captions encompassing nouns, verbs, and adjectives. The potential applications of these DL models include integration into systems designed to assist in the interpretation of ultrasound scan video frames and clips. The article also explores previous studies in the field of image and video captioning and presents quantitative assessment findings.
Li et al. [
208] developed a novel end-to-end video understanding system called VideoChat, which focuses on chat interaction. The system incorporates video foundation models and LLMs via a learnable neural interface. Its main feature is its ability to perform spatiotemporal reasoning, locate events in the video, and infer causal relationships between them. To fine-tune the system, the researchers developed a specific dataset focused on video instructions. This comprehensive dataset includes thousands of videos with detailed descriptions and associated conversations. It places particular emphasis on spatiotemporal reasoning and captures the causal relationships between events in the videos.
Although important advancements have been made in the field of generating captions for medical images and videos using LLMs, certain challenges still need to be tackled. Firstly, the limited size of the datasets used in some studies represents one of the main difficulties, as it may compromise the generalizability of the proposed approaches to more expansive datasets. In addition, the evaluation measures used in some research studies do not always provide a comprehensive understanding of the quality of the generated captions. In the future, it would be interesting to explore more sophisticated measures that would enable us to evaluate model performance more effectively. Improving the representativity of the data used and refining the evaluation criteria would be promising avenues for progress in this field.
Another challenge lies in the need for large quantities of annotated data for model training, which can prove complicated in the field of medical imaging where annotation requires expertise. In the future, research could explore new ways of reducing dependence on massive quantities of pre-annotated data, for example by developing methods that enable better generalization of the proposed approaches. The integration of additional contextual information such as clinical metadata or patient medical history into the caption generation process also represents a promising avenue for providing richer context and improving both the relevance and accuracy of the captions produced. This type of multimodal approach would go some way to overcoming the lack of massive annotated data.
Future research in medical image and video captioning can focus on two key areas: advancing transformer-based word embedding models and spatiotemporal visual and textual feature extractors and utilizing larger and more diverse datasets. These advancements aim to improve the accuracy, precision, and usefulness of generated captions in medical applications. By developing more sophisticated models and incorporating richer datasets, the goal is to overcome limitations such as dataset size and lack of diversity, ultimately enhancing the performance of automatic report generation systems in the medical field.
8. Discussion
LLMs, such as GPT-4 or LLaMA, have demonstrated significant potential in advancing NLP, and their application in medical diagnosis promises to transform this field.
One of the main strengths of LLMs is their ability to quickly analyze clinical data. A doctor could, for example, provide a model like GPT-4 with a complex set of symptoms and, in return, receive suggestions for possible differential diagnoses. This is even more important when considering that these models can access large, up-to-date medical databases, providing valuable information, especially for rare diseases or complex cases.
The potential of LLMs does not stop at large medical institutions. In remote or underserved areas where access to a specialist may be difficult, an LLM-based tool could serve as a first line of diagnostic advice. Of course, this could never replace real medical advice, but it could effectively guide initial care.
Furthermore, the information burden is a constant challenge in medicine. MDs are inundated with data, and LLMs could play a critical role in filtering and prioritizing this information. At the same time, medical students could benefit from these models as interactive learning tools, allowing them to ask questions and receive detailed answers about various medical conditions.
What is also interesting is the ability of LLMs to process natural language. Patients can simply describe their symptoms, and models like BERT could translate those verbal descriptions into precise medical terms. Going further, by combining LLMs with computer vision technologies, we could envision these models in helping interpret medical images.
However, it is essential to remember that although LLMs offer impressive benefits, they should not be used as the final word in diagnosis. They can serve as valuable assistants, but human clinical judgment is important. Additionally, these models would require extensive validation before widespread adoption to ensure their accuracy in a clinical setting.
Although they offer advantages, the use of LLMs in medical diagnosis also presents serious challenges. First, reliability is a major concern. Although an LLM like GPT-4 can provide insights based on a vast amount of data, it is not immune to errors. For example, a model could misinterpret symptoms described by a patient and suggest an incorrect diagnosis, with potentially serious consequences for the patient’s health. Additionally, LLMs’ ability to generalize from the data they were trained on can sometimes lead them to make mistakes in situations they did not explicitly “see” during their training.
Then, there is the challenge of overlinking. Although LLMs can be useful as support tools, there is a risk that healthcare professionals will begin to rely too heavily on them to the detriment of their clinical judgment. For example, if an LLM suggests a particular diagnosis based on the information provided, an MD might be tempted to follow that suggestion without questioning it, even if their clinical expertise suggests otherwise.
The issue of data privacy is also crucial. LLMs require huge amounts of data to train. If these data come from actual medical records, this could pose privacy concerns, even if the data are anonymized. It is essential to ensure that sensitive patient information remains protected.
Another challenge is that of transparency and interpretability. Decisions made by LLMs can often appear to be a “black box”, meaning that doctors and patients may struggle to understand how a particular diagnosis was suggested. Without a clear explanation, it can be difficult to trust the model’s recommendations.
Finally, medical ethics are another area of concern. If an incorrect diagnosis is made based on an LLM’s suggestions, who is responsible? The doctor, the hospital, or the creator of the model? These questions require careful consideration before the widespread adoption of LLMs in medical diagnosis.
However, although LLMs have the potential to transform medical diagnosis, it is crucial to approach these challenges with caution and diligence to ensure patient safety and well-being.
8.1. Techniques for Mitigating Ethical Risks
The ethical analysis of LLM models in healthcare is crucial, given their potential impact on various aspects of healthcare. To attenuate the ethical risks associated with these models, several techniques can be employed.
In this field,
differentially private learning is an important technique for protecting the privacy of the data used in LLM training [
285,
286]. It ensures that adding or deleting a single data point in the training dataset does not significantly affect the model’s output. This reduces the risk of the model revealing sensitive information about the patients from whom the data were collected. A key challenge of differentially private learning is to find a balance between privacy and model accuracy. As the level of confidentiality (i.e., the quantity of noise added) increases, the accuracy of the model tends to decrease.
Improving the
interpretability of LLMs in healthcare is fundamental to enabling professionals to understand the decision-making process [
287,
288]. In a sector where decisions have far-reaching consequences, such as healthcare, it is essential to understand the basis of a model’s recommendations or predictions. This contributes to the confidence and adoption of these technologies by healthcare professionals. Interpretability techniques aim to explain why and how a model has reached a specific conclusion [
289]. This can be done by identifying the characteristics or data that were decisive in the model’s decision. For example, in the case of a diagnosis, the model might indicate which symptoms or test results were most influential in its conclusion. Visual tools such as heat maps make explanations accessible. Improved interpretability contributes to the transparency and reliability of recommendations, encouraging their informed integration without substituting for human judgment [
290]. It complements expertise for enriched decision-making. Ultimately, this approach ensures that they not only perform well, but are also understandable and trustworthy by professionals, enabling safer use of AI in healthcare.
LLMs in the healthcare field can involuntarily learn from and propagate biases in training data. To overcome this risk, it is crucial to make regular
audits to detect and rectify these biases [
291]. These audits should include performance analyses on a variety of datasets, and focus on identifying any results that may be unfair or discriminatory.
To ensure that these audits are exhaustive and unbiased, they should be carried out by teams made up of members with diverse profiles and from a variety of disciplines. These teams should include not only medical experts, who have an in-depth understanding of the healthcare field but also researchers specializing in ethics, who can provide a critical perspective on the moral implications of model results. This diversity ensures that different points of view and expertise are taken into account when evaluating models. Such an interdisciplinary approach is essential to ensure that LLMs in healthcare are not only technically competent but also fair and unbiased in their applications. This helps to build confidence in the use of AI in medicine and to ensure that the benefits of these technologies accrue to all individuals in an equitable manner.
Fair learning techniques play a vital role in ensuring that LLM models are fair, unbiased, and representative in healthcare. To achieve this objective, several key strategies can be put in place [
292]. First, it is important to perform a preliminary assessment of the data to detect any potential bias and assess the representativity of demographic groups, medical conditions and socio-economic contexts. Next,
datasets should be diversified by integrating information from diverse sources and populations, ensuring that data are collected from underrepresented or marginalized groups to ensure balanced representation.
Data rebalancing techniques, such as oversampling minorities or undersampling majority groups, can also be used to balance datasets [
293,
294].
Regularization models can be integrated into the learning process to minimize bias, for example by penalizing predictions that reinforce stereotypes or inequalities [
295,
296].
Multi-task learning, which involves training models on multiple tasks simultaneously, can improve their ability to generalize and reduce single-task-specific biases [
297,
298,
299]. It is also essential to make rigorous tests on various samples in order to evaluate performance and detect biases in the model’s predictions. Finally, it is important to train users on the strengths and limitations of LLMs, emphasizing the importance of their clinical expertise in interpreting the results. Interdisciplinary collaboration, involving ethics experts, health professionals, lawyers, and representatives of diverse communities, can also contribute to the balanced design and evaluation of models in health care.