Generative AI Models (2018–2024): Advancements and Applications in Kidney Care

Neha, Fnu; Bhati, Deepshikha; Shukla, Deepak Kumar

doi:10.3390/biomedinformatics5020018

Open AccessArticle

Generative AI Models (2018–2024): Advancements and Applications in Kidney Care

by

Fnu Neha

^1,*

,

Deepshikha Bhati

¹

and

Deepak Kumar Shukla

²

¹

Department of Computer Science, Kent State University, Kent, OH 44242, USA

²

Rutgers Business School, Rutgers University, Newark, NJ 07102, USA

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(2), 18; https://doi.org/10.3390/biomedinformatics5020018

Submission received: 19 February 2025 / Revised: 22 March 2025 / Accepted: 27 March 2025 / Published: 3 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

Kidney disease poses a significant global health challenge, affecting millions and straining healthcare systems due to limited nephrology resources. This paper examines the transformative potential of Generative AI (GenAI), Large Language Models (LLMs), and Large Vision Models (LVMs) in addressing critical challenges in kidney care. GenAI supports research and early interventions through the generation of synthetic medical data. LLMs enhance clinical decision-making by analyzing medical texts and electronic health records, while LVMs improve diagnostic accuracy through advanced medical image analysis. Together, these technologies show promise for advancing patient education, risk stratification, disease diagnosis, and personalized treatment strategies. This paper highlights key advancements in GenAI, LLMs, and LVMs from 2018 to 2024, focusing on their applications in kidney care and presenting common use cases. It also discusses their limitations, including knowledge cutoffs, hallucinations, contextual understanding challenges, data representation biases, computational demands, and ethical concerns. By providing a comprehensive analysis, this paper outlines a roadmap for integrating these AI advancements into nephrology, emphasizing the need for further research and real-world validation to fully realize their transformative potential.

Keywords:

generative AI (GenAI); large language models (LLM); large vision models (LVM); natural language processing; medical AI; kidney; nephrology; healthcare

1. Introduction

Kidneys are vital, bean-shaped organs responsible for filtering waste as urine, maintaining electrolyte balance, regulating blood pressure, and producing hormones that control red blood cell production and calcium metabolism [1,2]. Disruption in kidney function can lead to severe health complications, including chronic kidney disease (CKD) or end-stage renal disease (ESRD).

CKD affects over 800 million people worldwide, approximately 10% of the global population [3]. The incidence of CKD continues to rise, driven by factors such as diabetes, hypertension, and aging. It is the 12th leading cause of death worldwide, contributing to nearly 2.5 million deaths annually from CKD and ESRD [4].

As kidney disease cases rise globally, the demand for nephrologists and timely treatment increases. A major challenge in managing kidney disease is the shortage of nephrologists, particularly in developing countries [5]. The 2021 Global Kidney Health Atlas reported that 80% of countries fail to meet the World Health Assembly’s goal for adequate nephrology care, leading to delays in diagnosis and treatment [6]. This highlights the need for emerging technologies, such as Generative AI (GenAI), Large Language Models (LLMs), and Large Vision Models (LVMs), which offer promising solutions to address these challenges.

GenAI can create synthetic medical data for research and drug discovery. LLMs excel in processing medical texts and analyzing patient records, supporting clinical decision-making and diagnosis. LVMs enhance the analysis of medical images, improving diagnostic accuracy, especially in kidney disease detection. These technologies can address diagnostic challenges, enable earlier detection, and support personalized treatment. By analyzing medical literature and electronic health records (EHRs), LLMs offer insights and treatment suggestions, while GenAI simulates treatment responses and creates predictive models for early intervention.

The objective of this paper is to explore the current state of Generative AI (GenAI), Large Language Models (LLMs), and Large Vision Models (LVMs), and their applications in kidney care from 2018 to 2024. The year 2018 marks a pivotal point in AI development, with the introduction of OpenAI’s Generative Pre-trained Transformer (GPT) models, which have significantly advanced natural language processing capabilities. While several research papers have discussed the individual applications of GenAI, LLMs, and LVMs in healthcare, to the best of our knowledge, no study has comprehensively examined all three technologies together. This paper aims to fill that gap by analyzing the combined potential of these AI technologies in addressing the limitations posed by the shortage of nephrologists and the growing global demand for kidney care. By presenting a unified exploration of these three advanced technologies, this article seeks to offer a broader perspective on their collective impact on improving kidney care practices. The main research objectives of this paper are:

Understand Key Technologies: Provide an overview of GenAI, LLMs, and LVMs as of 2024. This includes explaining the core principles of each technology, their advancements over recent years, and their current capabilities.
Application in Kidney Care: Current applications of these technologies to improve diagnostics, treatment, and patient management in kidney care.
Present Common Use Cases: Highlight use cases where these technologies can be effectively implemented.
Address Limitations and Future Directions: Analyze current limitations of these technologies and propose future research areas for advancing their applications in the field.

The paper is structured as follows: Section 2 outlines the research methodology. Section 3 provides background on GenAI, LLMs, and LVMs along with their advancements. Section 4 explores recent applications of GenAI, LLMs, and LVMs in kidney care. Section 5 focuses on technology-centered discussions. Section 6 illustrates related use cases. Section 7 discusses the limitations of these technologies, and Section 8 concludes the paper with future work.

2. Research Methodology

This survey explores the applications of GenAI, LLMs, and LVMs in kidney care. Relevant publications were sourced from databases such as PubMed, IEEE Xplore, Scopus, Google Scholar, ACM Digital Library, and ScienceDirect using targeted keywords, including Generative AI in kidney care, LLMs for kidney diagnostics, AI in nephrology, Kidney transplant prediction using ChatGPT.

LLM in chronic kidney disease management, NLP in nephrology, LVM for patient education, GenAI in kidney imaging analysis, LLMs for kidney function prediction, Generative models for medical data, and AI-assisted kidney diagnosis using LLM and LVM. These keywords facilitated a comprehensive review of literature to assess the role of AI in advancing kidney disease diagnosis, treatment, and management.

Note on Terminology: The keyword renal was excluded from the search strategy to align with current nomenclature recommendations favoring the term kidney [7]. This shift, supported by initiatives such as those from Kidney Disease: Improving Global Outcomes (KDIGO), aims to enhance clarity and patient understanding by using more widely recognized terminology [8].

Data Retrieval and Screening: As of December 2024, a total of 50 publications were retrieved through keyword searches. The initial screening involved removing duplicate records across databases using metadata matching based on titles, authors, DOI, and publication year, which resulted in 25 unique articles. These were then assessed for relevance based on predefined inclusion and exclusion criteria. Following a full-text review, 17 articles were selected for inclusion in the final synthesis.

Inclusion Criteria:
–
Articles in English, published between January 2020 and December 2024, discussing the application of GenAI, LLMs, and LVMs in kidney care, including peer-reviewed journal articles, conference papers, and reputable preprints.
Exclusion Criteria:
–
Non-English articles.
–
Publications without full-text access.
–
Studies not specifically focused on the application of GenAI, LLMs, and LVMs in kidney care.
–
Opinion pieces, editorials, and commentaries lacking empirical data.

Synthesis of Findings: The findings were synthesized to provide an overview of the current applications of GenAI, LLMs, and LVMs in kidney care, highlighting key use cases, limitations, and potential directions for future research.

3. Background

GenAI, LLMs, and LVMs have evolved significantly over the past few years, with key advancements in technology, algorithms, and applications. This section presents a detailed history of their development up to the present. The inclusion of solutions, techniques, and models in the background is guided by a technological progression framework. Developments from 2018 to 2024 were selected based on their significant contributions to the evolution of Generative AI (GenAI), Large Language Models (LLMs), and Large Vision Models (LVMs), particularly those that marked turning points in capability, architecture, or adoption. Priority was given to foundational models, landmark architectures, and innovations. This chronological approach enables a coherent narrative of advancement, contextualizing the current state of the field and establishing the basis for their applicability in nephrology-related tasks.

3.1. The Era of Pretrained Language Models (2018–2020)

From 2018 to 2020, pretrained language models, particularly Generative Pretrained Transformer (GPT) and Bidirectional Encoder Representations from Transformers (BERT), transformed generative AI. These models, based on transformer architectures, enabled significant improvements in natural language processing (NLP) tasks such as text generation, translation, and summarization, setting the stage for future advancements [9].

3.1.1. Generative Pretrained Transformer (GPT)

Developed by OpenAI in 2018, GPT utilizes unidirectional pretraining (left-to-right) [10]. It is trained through autoregressive language modeling, where each word is predicted based on the preceding ones:

L_{GPT} = - \sum_{i = 1}^{n} log P (w_{i} ∣ w_{1}, \dots, w_{i - 1})

where

w_{i}

represents the predicted words. This design enables strong performance in text generation and continuation tasks.

3.1.2. Bidirectional Encoder Representations from Transformers (BERT)

Introduced by Google in 2018, BERT utilizes bidirectional pretraining, considering both left and right context for a deeper understanding of language [11]. BERT’s pretraining includes:

Masked Language Modeling (MLM): Predicts randomly masked words from their context:

$L_{MLM} = - \sum_{i = 1}^{n} log P (w_{i} ∣ x_{1}, \dots, x_{i - 1}, x_{i + 1}, \dots, x_{n})$
Next Sentence Prediction (NSP): Determines if the second sentence logically follows the first:

$L_{NSP} = - log P (IsNext)$

3.1.3. XLNet

Introduced in 2019 by Google and Carnegie Mellon University, XLNet combines the strengths of BERT’s bidirectional pretraining and autoregressive modeling (like GPT) [12]. It addresses the limitations of BERT’s masked language modeling by using a permutation-based training method, which allows the model to consider all possible token orders during training. This approach captures long-range dependencies more effectively.

XLNet’s pretraining task maximizes the likelihood of the correct token order across all permutations of the input sequence:

L_{XLNet} = - \sum_{i = 1}^{n} log P (w_{i} ∣ permutation of previous tokens)

3.1.4. Text-to-Text Transfer Transformer (T5)

Introduced by Google in 2019, T5 reformulated all NLP tasks into a text-to-text format, where every task, from translation to summarization to question answering, is treated as a text generation problem [13]. This unified architecture allows T5 to handle a wide range of tasks with a single model.

T5 is pretrained using a span-corruption denoising objective, where some part of the input text is randomly corrupted by replacing spans of text with a special token, and the model is tasked with predicting the missing spans. The loss function for this task is:

L_{T 5} = - \sum_{i = 1}^{n} log P (w_{i} ∣ corrupted input)

where

w_{i}

represents the predicted words or spans, and the model is trained to recover the original input from the corrupted version.

3.1.5. GPT-3

Building on GPT-2’s success with 1.5 billion parameters [14], GPT-3, released in 2020 with 175 billion parameters, further scaled the model, enabling it to perform few-shot learning [15]. This allows GPT-3 to generate coherent text with minimal input, performing a wide variety of tasks without extensive retraining. It also uses the same autoregressive language modeling approach:

L_{GPT - 3} = - \sum_{i = 1}^{n} log P (w_{i} ∣ w_{1}, \dots, w_{i - 1})

Table 1 provides a comparison of text-based models—GPT, BERT, XLNet, T5, and GPT-3, highlighting their architectural differences, strengths, best use cases, and limitations.

3.2. Multimodal Models: Text, Image, and Video (2021–2023)

Between 2021 and 2023, multimodal models like DALL·E and CLIP expanded generative AI to process text, images, and video, enabling applications such as image captioning, text-to-image generation, and video analysis.

3.2.1. DALL·E

Released in 2021, DALL·E extends GPT-3’s transformer-based architecture to generate images from text descriptions [16]. It was trained on large text-image pairs, enabling it to understand the relationship between textual prompts and visual features. The model employs a VQ-VAE-2 (Vector Quantized Variational Autoencoder) architecture, which encodes the input text into a latent space and decodes it into high-quality images. Specifically, a text prompt T is passed through the transformer model, which produces a latent representation. This representation is then decoded into the corresponding image I:

I = Decoder (Transformer (T))

where T represents the input text and the decoder generates the final image I. This unified approach allows DALL·E to generate coherent and imaginative images.

3.2.2. Contrastive Language-Image Pretraining (CLIP)

Released in 2021, CLIP by OpenAI connects text and images, trained on a large dataset of text-image pairs [17]. Unlike DALL·E, which generates images from text, CLIP matches images to textual descriptions. It understands the relationship between text and corresponding visual representations, making it highly effective for multimodal tasks. CLIP employs contrastive learning to associate images with their corresponding text by minimizing the difference between their embeddings in a shared latent space. The contrastive loss function is:

L_{contrastive} = - \sum_{i = 1}^{N} log \frac{exp (sim (f_{image} (I_{i}), f_{text} (T_{i})) / τ)}{\sum_{j = 1}^{N} exp (sim (f_{image} (I_{i}), f_{text} (T_{j})) / τ)}

where

f_{image} (I_{i})

and

f_{text} (T_{i})

are the embeddings of image

I_{i}

and text

T_{i}

,

sim (\cdot)

is cosine similarity,

τ

is a temperature parameter, and N is the number of pairs. This loss function encourages the model to maximize similarity for matching image-text pairs and minimize it for non-matching pairs.

The architecture uses a vision transformer model for images and a transformer model for text [9]. Both models project their respective data into a shared feature space for direct comparison. Additionally, CLIP’s zero-shot learning capability allows it to perform tasks without task-specific fine-tuning.

3.2.3. Neural Video and Image Animation (NUWA)

Released in 2021, NUWA is a multimodal, pre-trained, transformer-based model designed for tasks such as text-to-image, text-to-video, and image animation [18]. It utilizes 3D transformers to capture spatial and temporal patterns in the data. A key feature of NUWA is the use of a 3D Nearby Attention (3DNA) mechanism, which improves the model’s ability to handle spatial and temporal relationships in visual data. This attention mechanism enhances computational efficiency while preserving high-quality visual outputs, allowing the model to scale effectively for complex tasks.Given an input T, such as text or an image, the model generates a latent representation through a 3D transformer, and the decoder reconstructs the output O, which can be an image or video:

O = Decoder (3 D-Transformer (T))

This architecture enables NUWA to efficiently generate high-quality visual outputs.

3.2.4. CogView2

Released in 2022, CogView2 is an advanced version of the original CogView model (2021), designed for text-to-image generation [19,20]. It employs a transformer-based architecture that integrates the Cross-Modal Language Model (CogLM). Using a masked autoregressive approach, CogView2 generates high-quality images from textual descriptions. Pretrained on large text-image pairs, it is capable of generating detailed and diverse visual content, with improved efficiency and scalability thanks to an advanced masking strategy.

Given an input text prompt T, the model generates a latent representation, which is then decoded into the corresponding image I:

I = Decoder (Transformer (T))

CogView2’s hierarchical design first generates low-resolution images, which are progressively refined through an iterative super-resolution module using local parallel autoregressive generation. This approach enables faster image generation, with CogView2 being up to 10 times faster than CogView at producing images of similar resolution, while offering improved quality. Additionally, CogView2 supports interactive, text-guided editing, allowing users to modify images based on new textual input.

3.2.5. Imagen

Released in 2022, Imagen is a text-to-image model developed by Google that integrates a two-stage diffusion process combined with a pretrained T5 transformer for encoding text [21]. The model first converts a text prompt T into a latent representation using the T5 encoder, then refines the image through a diffusion process, iteratively improving the image’s quality. Imagen focuses on generating high-resolution, photorealistic images and outperforms previous models like CogView in terms of image fidelity and realism. This architecture makes Imagen highly efficient and scalable for producing detailed, high-quality visual content from natural language descriptions.

3.2.6. DALL·E 2

Released in 2022, DALL·E 2 by OpenAI improves upon the original DALL·E by generating higher-quality images and introducing inpainting capabilities [22]. Unlike CLIP, which matches images to text, DALL·E 2 generates images directly from textual descriptions and allows for editing existing images by modifying specific areas with new text prompts.

DALL·E 2 uses a two-part model for image generation. The first part, based on CLIP, encodes the input text into a latent space representation that captures the relationship between the text and image. The second part employs a diffusion model to refine this representation into high-resolution images. The diffusion process generates more photorealistic images compared to the original DALL·E.

3.2.7. Stable Diffusion

Stable Diffusion (2022) by Stability AI is a text-to-image model that generates detailed images from textual descriptions [23]. It uses a latent diffusion model operating in a lower-dimensional latent space for efficiency and scalability. The model progressively refines images from noise using a denoising score-matching approach. The diffusion process is as follows:

Given a noisy image

x_{t}

at time step t, the model generates the clean image

x_{0}

by minimizing the loss:

L_{diffusion} = E_{q (x_{0}, t)} [∥ \nabla_{x_{t}} log p (x_{t} ∣ t) - \nabla_{x_{t}} log q (x_{t} ∣ x_{0}, t) ∥^{2}]

where

p (x_{t} ∣ t)

is the model’s predicted distribution, and

q (x_{t} ∣ x_{0}, t)

is the transition probability.

3.2.8. Make-A-Video

Make-A-Video (2022) is a text-to-video model developed by Meta, capable of generating short video clips from textual descriptions [24]. It uses a transformer-based architecture for text encoding and a spatiotemporal generative model to capture video dynamics. It uses a text-to-image (T2I) model fine-tuned with temporal components for video generation.

The model generates temporally consistent video frames by progressively refining video output. The architecture employs pseudo-3D convolutional layers and temporal attention mechanisms to learn how motion occurs across time, ensuring smooth transitions between frames. The loss function involves temporal coherence, represented as:

L_{video} = E_{q (x_{0}, t)} [∥ \nabla_{x_{t}} log p (x_{t} ∣ t) - \nabla_{x_{t}} log q (x_{t} ∣ x_{0}, t) ∥^{2}] + λ ∥ Δ (x_{t - 1}, x_{t}) ∥^{2}

where

p (x_{t} ∣ t)

is the model’s predicted distribution for frame t,

q (x_{t} ∣ x_{0}, t)

is the transition probability conditioned on the previous frame, and the second term,

Δ (x_{t - 1}, x_{t})

, penalizes any temporal discontinuities between consecutive frames.

This architecture allows Make-A-Video to generate high-quality videos from text, with an added capability of frame interpolation for smooth video generation at higher frame rates.

3.2.9. MidJourney

MidJourney (2022) is a text-to-image model focused on creating artistic and stylized images from text prompts [25]. It uses a transformer-based architecture with latent diffusion for efficient generation of high-quality, surreal, and creative images.

Primarily accessed through Discord, MidJourney allows users to generate images by submitting text prompts, with control over aspects like aspect ratios and artistic styles. It employs a latent diffusion model and integrates attention mechanisms for generating coherent and detailed artistic visuals.

3.2.10. DreamFusion

DreamFusion (2022) is a text-to-3D model developed by Google that generates 3D models from textual descriptions [26]. It utilizes a Neural Radiance Field (NeRF) model and a pretrained 2D diffusion model (such as Imagen) to guide the creation of 3D objects.

The model employs Score Distillation Sampling (SDS) to optimize the 3D model. Starting with a random initialization, the model is progressively refined using 2D renderings of the model at different viewpoints. These renderings are generated from a NeRF-based representation of the scene. The text prompt is encoded by the 2D diffusion model (Imagen), which guides the model’s generation by conditioning the 3D structure on the textual description.

DreamFusion ensures 3D consistency by using multi-view consistency, which aligns the rendered images from different viewpoints to the same object, guaranteeing that the generated 3D model is visually coherent from all angles. This process is iterated multiple times, adjusting the NeRF parameters using gradient descent to minimize the difference between the rendered 2D views and the outputs conditioned on the text prompt.

By combining 2D diffusion models with NeRF optimization, DreamFusion is capable of generating photorealistic 3D models directly from text, without requiring a 3D training dataset.

3.2.11. Make-A-Scene

Make-A-Scene (2022) is an interactive text-to-image model developed by Meta, designed to allow users to specify the layout and placement of elements in the generated image [27]. It leverages a transformer-based architecture to integrate both textual descriptions and layout information for creative image generation.

The model utilizes a shared embedding space where both text inputs and spatial layout information are represented. Text is encoded using a text encoder (likely based on transformers), while the layout is encoded through a spatial encoder that processes the positioning and arrangement of elements in the scene. These encoded representations are then fused into a unified vector, which is decoded to generate the final image.

Table 2 provides a comparison between DALL·E, CLIP, NUWA, CogView2, Imagen, DALL·E 2, Stable Diffusion, Make-A-Video, MidJourney, DreamFusion, and Make-A-Scene, highlighting their architectural differences, strengths, best use cases, and limitations.

3.2.12. Large Language Model Meta AI (Llama)

Llama (2023) is a transformer-based LLM developed by Meta, designed for high efficiency in language modeling tasks [28]. The model utilizes a standard transformer architecture with multi-head self-attention and feed-forward layers, trained on a large, diverse dataset to capture a wide range of linguistic patterns. Llama models are scaled to hundreds of billions of parameters, improving their capacity for text generation and understanding. The model is pretrained using a causal language modeling task, where it predicts the next token in a sequence. Optimized with techniques like mixed precision training and the Adam optimizer, Llama offers high performance while maintaining efficiency. Llama 3.1 (2024) further enhances the architecture, improving model scaling and fine-tuning strategies, making it a competitive open-source alternative to proprietary LLMs.

3.2.13. Bidirectional Attention Recurrent Decoder (Bard)

Bard (2023), now Gemini, is a text-to-text language model developed by Google [29,30], powered by LaMDA (Language Model for Dialogue Applications) [31] and PaLM 2 (Pathways Language Model 2) [32]. It generates human-like responses from text using deep learning techniques, particularly neural networks. Bard employs a transformer-based architecture with attention mechanisms to optimize text generation. The model integrates Gemini, a multimodal system for handling text, images, audio, and video. LaMDA ensures natural dialogue, while PaLM 2 enhances language understanding. Bard uses gradient descent for optimization, minimizing the loss function during training.

3.2.14. Phenaki

Phenaki (2023) is a text-to-video model developed by Google to generate long, coherent videos from textual descriptions [33]. It uses a Causal Vision Transformer (C-ViViT) architecture for spatiotemporal processing and is trained on a large dataset of text-video pairs.

The model employs a masked bidirectional transformer approach, learning to predict missing video tokens by conditioning on the surrounding context. Text is encoded using the T5X model to guide video generation. Phenaki ensures temporal coherence by generating frames progressively, maintaining smooth transitions over time.

3.2.15. GPT-4

GPT-4 (2023) is a multimodal model developed by OpenAI capable of processing both text and images [34]. It utilizes a transformer-based architecture to integrate and generate outputs from these mixed inputs. GPT-4 is trained on large datasets that consist of both text-image pairs.

The model employs a shared embedding space where both text and image inputs are represented. Text is encoded using a text encoder (likely a transformer-based model), while images are processed through a vision model (e.g., a CNN or vision transformer). The resulting encoded representations are fused into a unified vector, which is then decoded to generate text-based outputs.

GPT-4 ensures effective multimodal understanding by learning the relationships between the textual and visual data during training, enabling it to generate contextually relevant and coherent text responses based on both modalities.

3.2.16. Mistral

Mistral (2023) is a family of open-weight, sparse and dense transformer-based models designed to optimize large language model efficiency [35]. It employs Mixture of Experts (MoE) architecture, where only a subset of expert models is activated during inference, enabling significant computational savings while maintaining performance. Mistral models are designed for dense and sparse retrieval, improving generative tasks by efficiently scaling to large datasets. With up to 12.9 billion parameters in its dense version, Mistral enhances LLMs by improving computational efficiency, reducing latency, and providing high-quality text generation while leveraging minimal resources.

Table 3 provides a comparison between Llama, Bard/Gemini, Phenaki, GPT-4, and Mistral, highlighting their architectural differences, strengths, and limitations.

3.3. 2024—Advanced Video and Multimodal Systems

In 2024, advances in video and multimodal systems enhanced GenAI’s ability to process and generate content across text, image, and video, enabling new applications in healthcare systems.

3.3.1. SORA

SORA (2024) is a text-to-video generation model developed by OpenAI, which uses a transformer-based architecture combined with a pre-trained diffusion model to generate videos from textual descriptions [36]. The model encodes text T using a transformer-based text encoder and processes images I with a vision model, such as a CNN or vision transformer. These representations are fused into a unified latent vector

z_{l a t e n t} = f (z_{T}, z_{I})

, which is then decoded to produce video frames. SORA employs a spacetime latent patch representation, where both spatial (appearance) and temporal (motion) information are encoded within patches that span across space and time, represented as

z_{s p a c e t i m e} (x, y, t)

, with x, y, and t being the spatial and temporal dimensions. The latent vector is refined through a denoising diffusion process, iteratively reducing noise

ϵ

to generate clearer frames

V (t)

. Temporal consistency is maintained through motion modeling and frame-to-frame coherence, ensuring smooth transitions between video frames, expressed as:

V_{t + 1} = Consistency (V (t), V (t - 1))

where

V (t)

is the frame at time t and

V_{t + 1}

represents the next frame in the sequence.

3.3.2. Gemini 1.5

Gemini 1.5, released in 2024, extends the capabilities of Gemini, which was introduced in 2023 [37]. Gemini is a multimodal AI model built on the PaLM architecture and utilizes the Pathways framework to efficiently scale across various tasks. Gemini employs transformer-based models with attention mechanisms to integrate text, image, and video generation. The attention mechanism is defined as:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

where Q, K, and V are the query, key, and value matrices, respectively, and

d_{k}

is the dimensionality of the key vectors.

Gemini 1.5 improves Gemini by temporal coherence in video generation and introduces a Mixture-of-Experts (MoE) architecture. MoE activates different subsets of the model’s parameters based on the task, thereby improving efficiency. The MoE activation is expressed as:

y = \sum_{i = 1}^{N} w_{i} \cdot f_{i} (x)

where

w_{i}

are the gating weights for the expert functions

f_{i} (x)

, and N is the number of experts. Additionally, Gemini 1.5 enhances multimodal integration, extends context windows, and supports the processing of up to 1 million tokens.

3.3.3. Big Sleep

Big Sleep (2024) is an AI framework developed by Google Project Zero and DeepMind for vulnerability research [38]. It integrates large language models (LLMs) like Gemini 1.5 Pro with tools such as a code browser, Python 3.8 fuzz testing tool, debugger, and reporter to automate security analysis. The system uses a multimodal architecture to analyze codebases, detect vulnerabilities, and perform variant analysis by comparing code changes to known vulnerabilities. Big Sleep’s key features include variant and root-cause analysis, proactive defense by identifying vulnerabilities before exploitation, and automated reporting. The system processes input code through advanced machine learning algorithms, analyzing patterns with its neural network to detect security issues.

The technical formula applied by Big Sleep can be represented as:

y = f (x, θ)

where x is the input code,

θ

represents the model parameters, and y is the output identifying potential vulnerabilities. Big Sleep demonstrated its effectiveness by identifying a stack buffer underflow in SQLite, marking its significant role in real-world cybersecurity.

3.3.4. ChatGPT 4o and Variants (2024)

In 2024, OpenAI released GPT-4o, GPT-4o Mini, and o1, each with advancements in multimodal capabilities and problem-solving.

GPT-4o is a multimodal model built on a transformer-based architecture with MoE activation and dense attention layers [39]. It processes text, images, audio, and video, using unsupervised learning and reinforcement learning with human feedback (RLHF) for improved outputs.

GPT-4o Mini is a smaller, resource-efficient version, maintaining core capabilities while optimizing for lower computational costs [40].

GPT-o1 focuses on complex problem-solving, utilizing iterative reasoning and exploration techniques for tasks like competitive programming and mathematics [41].

These models are based on large-scale datasets, multimodal data integration, and sophisticated training methods to improve accuracy and versatility.

Table 4 provides a comparison between SORA, Gemini 1.5, Big Sleep, and ChatGPT, highlighting their architectural differences, strengths, and limitations.

3.4. Improvements in GenAI, LLMs, and LVMs

Between 2018 and 2024, numerous advancements have been made in GenAI, LLMs, and LVMs. These improvements have significantly enhanced the models’ ability to generate more relevant, accurate, and knowledge-rich responses. They address key limitations of earlier models, particularly in terms of factual accuracy and domain-specific knowledge. Below is an overview of the key changes introduced during this period.

3.4.1. Prompt Engineering

Prompt Engineering is a technique for optimizing AI model outputs by refining inputs [15]. It uses strategies like few-shot learning (where models learn tasks with minimal training examples), contextual embedding, and task-specific fine-tuning to guide models. Typically relying on transformer-based models like GPT, it processes prompt tokens to adjust the output based on specific instructions. Prompt engineering helps models understand context, intent, and constraints, improving performance in tasks like text generation, code synthesis, and complex queries. Techniques such as zero-shot and few-shot prompting enhance efficiency without extensive retraining.

3.4.2. Retrieval-Augmented Generation (RAG)

RAG is a hybrid approach that combines retrieval-based methods with generative models to improve the factual accuracy and contextual relevance of responses [42]. The architecture integrates a retrieval component, often using dense vector search or BM25 (Best Matching 25), a ranking function used to retrieve relevant documents from an external knowledge base based on their relevance to a given query. These documents are then used to condition a generative model, to produce more informed and contextually accurate outputs. RAG enables models to access a broader knowledge pool, improving performance on tasks that require up-to-date or domain-specific information and mitigating issues like hallucinations.

3.4.3. Dense Retrieval and Fusion-in-Decoder (FiD)

Dense Retrieval and Fusion-in-Decoder (FiD) improve RAG by using dense vector search for efficient document retrieval [43,44]. Documents are represented as vectors, and the most relevant ones are retrieved based on semantic similarity to the query. FiD concatenates the retrieved documents with the query in the decoder of transformer models, allowing the model to attend to both the query and retrieved documents. This architecture improves contextual coherence and factuality in tasks like question answering and document generation.

3.4.4. Sparse Retrieval Models

Sparse Retrieval Models use keyword-based methods like BM25 to retrieve documents by matching terms in the query with those in the documents [45]. They rank documents based on term frequency (TF) and inverse document frequency (IDF) without generating dense embeddings. FAISS (Facebook AI Similarity Search) is often used in conjunction with dense retrieval models, but in sparse retrieval, it can also be used for indexing large collections of documents for faster search by storing precomputed term-based vectors. These models are simpler, faster, and more interpretable but less effective at capturing semantic meaning compared to dense retrieval models. They are ideal for tasks prioritizing high recall and fast retrieval.

3.4.5. Vector Databases

Vector Databases store and search high-dimensional vectors for tasks like semantic search and recommendation systems [46]. Using approximate nearest neighbor (ANN) algorithms like FAISS, they quickly retrieve vectors similar to query vectors. Integrating vector databases with LLMs and LVMs enables faster retrieval of relevant data from large datasets, improving generative model performance. These databases enhance RAG by providing real-time access to external knowledge, reducing hallucinations and improving factual accuracy.

3.4.6. Contrastive Learning

Contrastive Learning is a technique where models learn representations by bringing similar data points (positive pairs) closer and pushing dissimilar ones (negative pairs) apart [47,48,49]. Using loss functions like InfoNCE (Information Noise Contrastive Estimation), it enhances LLMs, LVMs, and GenAI by improving semantic understanding and capturing meaningful relationships. This method boosts tasks such as semantic search, image captioning, and multi-modal alignment, enabling models to learn efficiently from unstructured data, improving accuracy, and reducing biases. Contrastive learning also enhances generative tasks by learning high-quality, aligned representations, as seen in models like CLIP.

4. Recent Applications of GenAI, LLMs, and LVMs in Kidney Care

Recent advancements in Generative AI (GenAI), Large Language Models (LLMs), and Large Vision Models (LVMs) have significantly impacted various aspects of healthcare, enhancing efficiency and patient outcomes. GenAI automates routine tasks such as data entry and appointment scheduling, allowing clinicians to focus more on direct patient care. LLMs, like ChatGPT, have demonstrated potential in drafting medical documents and assisting in diagnostics, thereby streamlining clinical workflows. LVMs contribute to improved diagnostics and personalized medicine by analyzing medical images, aiding in early disease detection and treatment planning. These technologies collectively transform healthcare delivery, making it more efficient and patient-centered. This section highlights recent research applying GenAI, LLMs, and LVMs in kidney care. These studies explore innovative approaches to improving diagnosis, treatment, and patient management through advanced AI models in nephrology.

4.1. 2018–2023: Early Developments

Between 2018 and 2023, advancements in GenAI, LLMs, and LVMs reshaped nephrology by enhancing diagnostic accuracy, patient education, and clinical workflows [50,51,52,53,54,55,56]. Domain-specific adaptations, such as AKI-BERT and MD-BERT-LGBM, improved early disease prediction by integrating textual and structured data, achieving significant accuracy in acute kidney injury (AKI) and CKD prediction. Generative models like ChatGPT demonstrated utility in dietary support for CKD patients and patient education in nephrology and urology, although inconsistencies and limitations in depth and citation accuracy necessitated human oversight. The integration of AI-powered chatbots in kidney transplant care showed promise in improving treatment adherence and personalized patient education, despite challenges in bias, interpretability, and EHR system compatibility. These efforts underscore the growing potential of AI in nephrology while highlighting the need for continued innovation to address limitations and enhance clinical reliability. Table 5 summarizes the early developments in nephrology from 2018 to 2023.

4.2. 2024: Recent Breakthroughs

In 2024, the application of GenAI, LLMs, and advanced multimodal frameworks marked significant progress in nephrology, expanding their utility in diagnosis, prediction, education, and clinical support [57,58,59,60,61,62,63,64,65,66,67]. Breakthroughs included the development of HERBERT for CKD risk stratification, multimodal models for AKI and continuous renal replacement therapy (CRRT) prediction, and innovative uses of AI-powered tools for patient education, renal cancer diagnosis, and visual communication. These advancements showcased enhanced integration of structured and unstructured data, improved prediction accuracies, and personalized education materials tailored for diverse audiences. Despite notable achievements in improving healthcare delivery and engagement, challenges such as data limitations, computational demands, and the need for human oversight persist, emphasizing the importance of iterative refinement and validation in clinical settings. Table 6 provides a summary of these works.

5. Discussion

Recent studies have identified several limitations in nephrology care; however, ongoing research has proposed various strategies to address these gaps. Emerging methods emphasize hybrid models that integrate structured electronic health record (EHR) data with unstructured clinical narratives. Additionally, lightweight architectures suited for low-resource settings and human-in-the-loop frameworks for clinical validation have been introduced to enhance accuracy, efficiency, and safety. Thus, this section serves a dual purpose: (1) It analyzes how recent solutions and methods have been developed to overcome the limitations identified in previous work, as discussed in Section 4. (2) It outlines practical applications of these advancements, which are further elaborated in Section 6.

By 2024, LLMs such as ChatGPT have proven valuable in kidney care applications, including patient education, chronic kidney disease (CKD) risk stratification, renal cell carcinoma (RCC) diagnosis, medical training, and predicting postoperative complications [68]. For example, ChatGPT has enhanced CKD dietary planning and health literacy, though challenges like limited clinical depth persist. HERBERT, a domain-specific LLM, has improved CKD risk stratification by handling temporal data, but requires broader validation across diverse populations.

Advanced LLMs, including GPT-4, Gemini, Llama, and Mistral, offer untapped potential in personalized medicine, such as symptom monitoring and tailored treatment. They could integrate with EHRs to analyze datasets and support real-time decisions, but adoption in nephrology remains limited. Additionally, techniques like retrieval-augmented generation (RAG), dense retrieval, Fusion-in-Decoder (FiD), and vector databases remain unexplored despite their ability to improve contextual accuracy.

Similarly, LVMs like DALL-E, Stable Diffusion, MidJourney, and DreamFusion, which have transformed image generation in other domains, have seen minimal application in nephrology. Studies in 2024 have only begun to explore AI-generated visuals for patient education and presentations, such as using Microsoft Copilot with DALL-E, focusing primarily on text-to-image generation. This approach neglects the integration of LVMs with medical imaging data for diagnostic purposes. The absence of generative AI applications for creating synthetic nephrology datasets further limits advancements. Synthetic data could address critical gaps in training datasets, particularly for underrepresented populations, while aiding model development for CKD progression, RCC detection, and postoperative care. Moreover, there is limited exploration of multimodal AI for combining structured, unstructured, and imaging data, which could support a comprehensive understanding of nephrology-specific challenges. Such advancements could enable the generation of 3D anatomical structures of kidneys, enhancing both diagnostic and educational applications.

Additionally, multimodal frameworks like CLIP, Make-A-Scene, SORA, and Phenaki, capable of text-to-video generation and video modeling, remain unexplored in nephrology-specific use cases. These frameworks could be transformative for applications such as surgical video analysis, treatment simulations, and patient education through dynamic visual content. They also hold potential for automating video-based training for medical professionals, enhancing procedural accuracy, and simulating complex kidney-related conditions for research and diagnostic purposes. Despite their capabilities, their integration with nephrology workflows and datasets is still lacking, limiting their impact on advancing clinical practices and patient outcomes.

Addressing these gaps could transform nephrology by leveraging synthetic datasets, integrating multimodal AI, and validating technologies across diverse populations. Thoughtful application of generative AI could enhance diagnostic accuracy, personalized treatment, and education, reducing disparities and improving outcomes globally.

6. Use Case

This section highlights various use cases demonstrating the application of LLMs and LVMs in kidney care. These applications include generating imaging data through text-to-image prompts, with or without RAG integration, creating 2D and 3D images to enhance patient health literacy and support educational initiatives, utilizing text-to-video methods to produce medical videos that visualize anatomical structures, and employing text-to-text models for generating text-based responses to common kidney-related questions.

6.1. Text-to-Image (2D)

Prompt 1: Create an image of the human kidney.

Response: DALL·E generated a detailed illustration of a human kidney, as shown in Figure 1.

Prompt 2: Create an image of the human kidney as shown in [69].

Response: DALL·E generated a detailed illustration of a human kidney, as shown in Figure 2.

6.2. Text-to-Image (3D)

Prompt 1: Create an 3D image of the human kidney.

Response: Gemini 2.0 Flash generated a 3D illustration of a human kidney, as shown in Figure 3.

Prompt 1: Create an 3D image of the tumor kidney.

Response: Gemini 2.0 Flash generated a 3D illustration of a tumor kidney, as shown in Figure 4.

6.3. Text-to-Text

Table 7 showcases the ability of ChatGPT to provide responses to text-based prompts related to kidney health. Each response demonstrates ChatGPT’s understanding and its ability to deliver medically relevant and concise information. The provided prompts focus on common questions concerning kidney function and associated symptoms.

6.4. Text-to-Video

Prompt 1: Generate an anatomical video depicting human kidney function, including blood filtration, urine production, and urine flow into the collecting ducts.

Response: SORA successfully generated a video, as illustrated in the provided Figure 5, Figure 6, Figure 7 and Figure 8.

DALL·E 2, Gemini 2.0 Flash, SORA, and ChatGPT-4.0 show significant potential for educational purposes. However, medical expert intervention is essential to correct common spelling mistakes and factual inaccuracies.

7. Challenges of GenAI, LLMs, and LVMs in Kidney Healthcare

GenAI, LLMs, and LVMs are reshaping healthcare, yet substantial limitations hinder their full integration, particularly in kidney care. These challenges span technical, ethical, and demographic dimensions, affecting reliability, equity, and practicality.

A key challenge is the knowledge cutoff in most LLMs. Trained on static datasets, they cannot access real-time updates, leaving them unable to incorporate the latest medical research or guidelines. In nephrology, where protocols and diagnostics evolve rapidly, this can lead to outdated or incomplete recommendations. Furthermore, hallucination—where models produce incorrect or fabricated information—poses risks, as misleading insights on CKD management, biomarkers, or RCC diagnosis could lead to inappropriate clinical decisions and diminished trust.

LLMs and LVMs also face difficulties in contextual understanding, critical in kidney care, which often requires synthesizing lab results, comorbidities, and lifestyle factors. While these models handle individual data points, integrating them into actionable insights remains a challenge. Their outputs can be overly simplistic or irrelevant, and their lack of interpretability—operating as black boxes—reduces clinician trust, particularly in high-stakes scenarios.

Bias and demographic representation issues further complicate adoption. Training datasets often lack diversity, leading to racial, gender, and socioeconomic biases. In nephrology, this may result in less accurate predictions for minority groups or rare disease presentations, exacerbating health inequities. Similarly, inadequate representation of anatomical and disease diversity leads to generalized predictions that may miss diagnoses or recommend inappropriate treatments, especially for underrepresented conditions.

Regulatory and ethical challenges, such as ensuring compliance with Health Insurance Portability and Accountability Act (HIPAA), present additional barriers. Patient data use in training raises concerns about privacy, security, and consent. Moreover, the potential misuse of AI-generated outputs, including falsified records or inaccurate recommendations, underscores the need for stringent oversight to align with medical and ethical standards.

Deployment demands also limit accessibility. The significant computational resources required by LLMs and LVMs make them unattainable for under-resourced facilities, widening care disparities. In kidney care, this could deepen the divide between well-funded institutions and underserved regions.

Finally, integrating GenAI into existing healthcare workflows remains underdeveloped. Legacy systems often lack compatibility with advanced AI tools. Effective kidney care applications, such as CKD risk stratification and complication prediction, require seamless integration with electronic health records (EHRs) and clinical systems, a capability not yet fully realized. Addressing these limitations is essential for unlocking the full potential of generative AI in kidney healthcare.

8. Conclusions and Future Work

The integration of Generative AI (GenAI), Large Language Models (LLMs), and Large Vision Models (LVMs) in kidney care has seen significant advancements, contributing to risk stratification, disease diagnosis, patient education, postoperative complication prediction, treatment optimization, and continuous monitoring. However, challenges remain in integrating multimodal AI, generating synthetic datasets to address demographic underrepresentation, and overcoming issues related to interpretability, bias, and ethical compliance. Emerging tools and frameworks, including retrieval-augmented generation (RAG) and multimodal AI models such as ChatGPT, Gemini, and SORA, offer promising solutions for improving contextual accuracy, synthesizing structured and unstructured data, and expanding applications in nephrology diagnostics and education.

The rapid evolution of AI highlights the growing sophistication of these models. OpenAI’s introduction of o3, an artificial general intelligence (AGI) model focused on advanced reasoning, and Google’s release of Gemini 2.0 Flash with enhanced multimodal capabilities underscore the push toward improved problem-solving, real-time decision-making, and multimodal adaptability. Additionally, NVIDIA’s advancements in long-context AI are aimed at reducing hallucinations and enhancing logical reasoning, which could address limitations in nephrology-focused AI applications. We will incorporate DeepSeek in future reviews as more data becomes available.

Future research should prioritize integrating these advancements into clinical workflows while ensuring seamless compatibility with electronic health records (EHRs) and accessibility in under-resourced healthcare settings. AI-powered text generation, detection systems, and multimodal frameworks should be leveraged to develop precise, equitable, and scalable kidney care solutions. Addressing ethical, computational, and regulatory challenges will be critical to fostering widespread adoption and ultimately transforming diagnostics, personalized treatment, and medical education on a global scale.

Author Contributions

Conceptualization, F.N., D.B. and D.K.S.; Methodology, F.N.; Writing—original draft, F.N.; Writing—review & editing, F.N., D.B. and D.K.S.; Funding acquisition, F.N., D.B. and D.K.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by Kent State University’s Open Access APC Support Fund.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All data were presented in main text.

Conflicts of Interest

The authors declare no conflict of interest.

References

Nair, M. The renal system. In Fundamentals of Anatomy and Physiology: For Nursing and Healthcare Students; Wiley-Blackwell: Hoboken, NJ, USA, 2016. [Google Scholar]
Neha, F. Kidney Localization and Stone Segmentation from a CT Scan Image. In Proceedings of the 2023 7th International Conference on Computing, Communication, Control And Automation (ICCUBEA), Pune, India, 18–19 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Kovesdy, C.P. Epidemiology of chronic kidney disease: An update 2022. Kidney Int. Suppl. 2022, 12, 7–11. [Google Scholar] [CrossRef] [PubMed]
Luyckx, V.A.; Tonelli, M.; Stanifer, J.W. The global burden of kidney disease and the sustainable development goals. Bull. World Health Organ. 2018, 96, 414. [Google Scholar] [PubMed]
Osman, M.A.; Alrukhaimi, M.; Ashuntantang, G.E.; Bellorin-Font, E.; Gharbi, M.B.; Braam, B.; Courtney, M.; Feehally, J.; Harris, D.C.; Jha, V.; et al. Global nephrology workforce: Gaps and opportunities toward a sustainable kidney care system. Kidney Int. Suppl. 2018, 8, 52–63. [Google Scholar]
Bello, A.K.; McIsaac, M.; Okpechi, I.G.; Johnson, D.W.; Jha, V.; Harris, D.C.; Saad, S.; Zaidi, D.; Osman, M.A.; Ye, F.; et al. International Society of Nephrology Global Kidney Health Atlas: Structures, organization, and services for the management of kidney failure in North America and the Caribbean. Kidney Int. Suppl. 2021, 11, e66–e76. [Google Scholar]
Levey, A.S.; Eckardt, K.U.; Dorman, N.M.; Christiansen, S.L.; Cheung, M.; Jadoul, M.; Winkelmayer, W.C. Nomenclature for kidney function and disease—Executive summary and glossary from a Kidney Disease: Improving Global Outcomes (KDIGO) consensus conference. Eur. Heart J. 2020, 41, 4592–4598. [Google Scholar]
Seaborg, E. What’s in a Name, and Who’s the Audience? “Kidney” vs. ”Renal”. Kidney News 2021, 13, 24–25. [Google Scholar]
Fnu, N.; Bansal, A. Understanding the architecture of vision transformer and its variants: A review. In Proceedings of the 2024 1st International Conference on Innovative Engineering Sciences and Technological Research (ICIESTR), Muscat, Oman, 14–15 May 2024; pp. 1–6. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Mikecaptain 2018. in progress. [Google Scholar]
Kenton, J.D.M.W.C.; Toutanova, L.K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, p. 2. [Google Scholar]
Yang, Z. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv 2019, arXiv:1906.08237. [Google Scholar]
Roberts, A.; Raffel, C.; Lee, K.; Matena, M.; Shazeer, N.; Liu, P.J.; Narang, S.; Li, W.; Zhou, Y. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2019, 21, 1–67. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Wu, C.; Liang, J.; Ji, L.; Yang, F.; Fang, Y.; Jiang, D.; Duan, N. Nüwa: Visual synthesis pre-training for neural visual world creation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 720–736. [Google Scholar]
Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. Cogview: Mastering text-to-image generation via transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 19822–19835. [Google Scholar]
Ding, M.; Zheng, W.; Hong, W.; Tang, J. Cogview2: Faster and better text-to-image generation via hierarchical transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 16890–16902. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Singer, U.; Polyak, A.; Hayes, T.; Yin, X.; An, J.; Zhang, S.; Hu, Q.; Yang, H.; Ashual, O.; Gafni, O.; et al. Make-a-video: Text-to-video generation without text-video data. arXiv 2022, arXiv:2209.14792. [Google Scholar]
MidJourney Team. MidJourney Explore: Top Images. Available online: https://www.midjourney.com/explore?tab=top (accessed on 22 August 2024).
Poole, B.; Jain, A.; Barron, J.T.; Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffusion. arXiv 2022, arXiv:2209.14988. [Google Scholar]
Gafni, O.; Polyak, A.; Ashual, O.; Sheynin, S.; Parikh, D.; Taigman, Y. Make-a-scene: Scene-based text-to-image generation with human priors. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 89–106. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Google. Bard: A Generative AI Text-to-Text Model Powered by LaMDA and PaLM 2. 2023. Available online: https://bard.google.com (accessed on 22 December 2024).
Team, G.; Anil, R.; Borgeaud, S.; Alayrac, J.B.; Yu, J.; Soricut, R.; Schalkwyk, J.; Dai, A.M.; Hauth, A.; Millican, K.; et al. Gemini: A family of highly capable multimodal models. arXiv 2023, arXiv:2312.11805. [Google Scholar]
Thoppilan, R.; De Freitas, D.; Hall, J.; Shazeer, N.; Kulshreshtha, A.; Cheng, H.T.; Jin, A.; Bos, T.; Baker, L.; Du, Y.; et al. Lamda: Language models for dialog applications. arXiv 2022, arXiv:2201.08239. [Google Scholar]
Anil, R.; Dai, A.M.; Firat, O.; Johnson, M.; Lepikhin, D.; Passos, A.; Shakeri, S.; Taropa, E.; Bailey, P.; Chen, Z.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403. [Google Scholar]
Villegas, R.; Babaeizadeh, M.; Kindermans, P.J.; Moraldo, H.; Zhang, H.; Saffar, M.T.; Castro, S.; Kunze, J.; Erhan, D. Phenaki: Variable length video generation from open domain textual descriptions. In Proceedings of the International Conference on Learning Representations, Online, 25–29 April 2022. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Mistral AI. Introducing Mistral: Advancing Open-Weight Transformer Models. 2023. Available online: https://www.mistral.ai (accessed on 22 December 2024).
OpenAI. Sora: A Text-to-Video Generation Model Using Transformer-Based Architecture and Pre-Trained Diffusion Models. 2024. Available online: https://openai.com (accessed on 22 December 2024).
Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
Big Sleep Team. From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code. 2024. Available online: https://googleprojectzero.blogspot.com/2024/10/from-naptime-to-big-sleep.html (accessed on 22 December 2024).
OpenAI. Hello GPT-4o. 2024. Available online: https://openai.com/index/hello-gpt-4o/ (accessed on 1 December 2024).
OpenAI. GPT-4o Mini: Advancing Cost-Efficient Intelligence. 2024. Available online: https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence (accessed on 1 December 2024).
OpenAI. Introducing OpenAI o1: A Model Designed for Enhanced Reasoning. Available online: https://openai.com/index/introducing-openai-o1-preview (accessed on 1 December 2024).
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; Yih, W.t. Dense passage retrieval for open-domain question answering. arXiv 2020, arXiv:2004.04906. [Google Scholar]
Izacard, G.; Grave, E. Leveraging passage retrieval with generative models for open domain question answering. arXiv 2020, arXiv:2007.01282. [Google Scholar]
Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; Volume 39. [Google Scholar]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Gao, T.; Yao, X.; Chen, D. Simcse: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
Mao, C.; Yao, L.; Luo, Y. A Pre-trained Clinical Language Model for Acute Kidney Injury. In Proceedings of the 2020 IEEE International Conference on Healthcare Informatics (ICHI), Oldenburg, Germany, 30 November–3 December 2020; pp. 1–2. [Google Scholar] [CrossRef]
Ma, D.; Li, X.; Mou, S.; Cheng, Z.; Yan, X.; Lu, Y.; Yan, R.; Cao, S. Prediction of chronic kidney disease risk using multimodal data. In Proceedings of the 2021 5th International Conference on Compute and Data Analysis, Sanya, China, 2–4 February 2021; pp. 20–25. [Google Scholar]
Qarajeh, A.; Tangpanithandee, S.; Thongprayoon, C.; Suppadungsuk, S.; Krisanapan, P.; Aiumtrakul, N.; Garcia Valencia, O.A.; Miao, J.; Qureshi, F.; Cheungpasitporn, W. AI-Powered Renal Diet Support: Performance of ChatGPT, Bard AI, and Bing Chat. Clin. Pract. 2023, 13, 1160–1172. [Google Scholar] [CrossRef]
Aiumtrakul, N.; Thongprayoon, C.; Suppadungsuk, S.; Krisanapan, P.; Miao, J.; Qureshi, F.; Cheungpasitporn, W. Navigating the landscape of personalized medicine: The relevance of ChatGPT, BingChat, and Bard AI in nephrology literature searches. J. Pers. Med. 2023, 13, 1457. [Google Scholar] [CrossRef]
Suppadungsuk, S.; Thongprayoon, C.; Krisanapan, P.; Tangpanithandee, S.; Garcia Valencia, O.; Miao, J.; Mekraksakit, P.; Kashani, K.; Cheungpasitporn, W. Examining the validity of ChatGPT in identifying relevant nephrology literature: Findings and implications. J. Clin. Med. 2023, 12, 5550. [Google Scholar] [CrossRef] [PubMed]
Garcia Valencia, O.A.; Thongprayoon, C.; Jadlowiec, C.C.; Mao, S.A.; Miao, J.; Cheungpasitporn, W. Enhancing kidney transplant care through the integration of chatbot. Healthcare 2023, 11, 2518. [Google Scholar] [CrossRef]
Szczesniewski, J.J.; Tellez Fouz, C.; Ramos Alba, A.; Diaz Goizueta, F.J.; García Tello, A.; Llanes González, L. ChatGPT and most frequent urological diseases: Analysing the quality of information and potential risks for patients. World J. Urol. 2023, 41, 3149–3153. [Google Scholar] [CrossRef] [PubMed]
Moore, A.; Orset, B.; Yassaee, A.; Irving, B.; Morelli, D. HEalthRecordBERT (HERBERT): Leveraging Transformers on Electronic Health Records for Chronic Kidney Disease Risk Stratification. ACM Trans. Comput. Healthc. 2024, 5, 1–18. [Google Scholar] [CrossRef]
Tan, Y.; Dede, M.; Mohanty, V.; Dou, J.; Hill, H.; Bernstam, E.; Chen, K. Forecasting Acute Kidney Injury and Resource Utilization in ICU patients using longitudinal, multimodal models. J. Biomed. Inform. 2024, 154, 104648. [Google Scholar] [CrossRef] [PubMed]
Liang, R.; Zhao, A.; Peng, L.; Xu, X.; Zhong, J.; Wu, F.; Yi, F.; Zhang, S.; Wu, S.; Hou, J. Enhanced artificial intelligence strategies in renal oncology: Iterative optimization and comparative analysis of GPT 3.5 versus 4.0. Ann. Surg. Oncol. 2024, 31, 3887–3893. [Google Scholar] [CrossRef]
Talyshinskii, A.; Juliebø-Jones, P.; Hameed, B.Z.; Naik, N.; Adhikari, K.; Zhanbyrbekuly, U.; Tzelves, L.; Somani, B.K. ChatGPT as a Clinical Decision Maker for Urolithiasis: Compliance with the Current European Association of Urology Guidelines. Eur. Urol. Open Sci. 2024, 69, 51–62. [Google Scholar] [CrossRef]
Bersano, J. Exploring the role of Microsoft’s Copilot in visual communication: Current use and considerations through science communicators’ lens. Virus 2024, 6, 11. [Google Scholar] [CrossRef]
Safadi, M.F.; Zayegh, O.; Hawoot, Z. Advancing Innovation in Medical Presentations: A Guide for Medical Educators to Use Images Generated With Artificial Intelligence. Cureus 2024, 16, e74978. [Google Scholar] [CrossRef]
Goparaju, N. Picture This: Text-to-Image Models Transforming Pediatric Emergency Medicine. Ann. Emerg. Med. 2024, 84, 651–657. [Google Scholar] [CrossRef]
Lin, S.Y.; Jiang, C.C.; Law, K.M.; Yeh, P.C.; Kuo, H.L.; Ju, S.W.; Kao, C.H. Comparative Analysis of Generative AI in Clinical Nephrology: Assessing ChatGPT-4, Gemini Pro, and Bard in Patient Interaction and Renal Biopsy Interpretation. Gemini Pro. Bard Patient Interact. Ren. Biopsy Interpret. 2024. Available online: https://ssrn.com/abstract=4711596 (accessed on 18 February 2025).
Hsueh, J.Y.; Nethala, D.; Singh, S.; Linehan, W.M.; Ball, M.W. Investigating the clinical reasoning abilities of large language model GPT-4: An analysis of postoperative complications from renal surgeries. Urol. Oncol. Semin. Orig. Investig. 2024, 42, 292.e1–292.e7. [Google Scholar] [CrossRef] [PubMed]
Sexton, D.J.; Judge, C. Assessments of Generative Artificial Intelligence as Clinical Decision Support Ought to be Incorporated Into Randomized Controlled Trials of Electronic Alerts for Acute Kidney Injury. Mayo Clin. Proc. Digit. Health 2024, 2, 606–610. [Google Scholar] [CrossRef]
Halawani, A.; Almehmadi, S.G.; Alhubaishy, B.A.; Alnefaie, Z.A.; Hasan, M.N. Empowering patients: How accurate and readable are large language models in renal cancer education. Front. Oncol. 2024, 14, 1457516. [Google Scholar] [CrossRef]
Neha, F.; Bhati, D.; Shukla, D.K.; Amiruzzaman, M. Chatgpt: Transforming healthcare with AI. AI 2024, 5, 2618–2650. [Google Scholar] [CrossRef]
Commons Wikimedia. The Kidney. 2024. Available online: https://commons.wikimedia.org/wiki/File:2610_The_Kidney.jpg (accessed on 23 November 2024).

Figure 1. Generated image based on a general request from Prompt 1.

Figure 2. Generated image inspired from an external reference (RAG) Prompt 2.

Figure 3. Generated 3D image based on a general request from Prompt 1.

Figure 4. Generated 3D image based on Prompt 2.

Figure 5. Frame 1: Illustrates the blood entering the kidney via the renal artery.

Figure 6. Frame 2: Focuses on the blood filtration process in the glomerulus.

Figure 7. Frame 3: Demonstrates the filtration of waste in the form of Urine.

Figure 8. Frame 4: Urine flow into the ureter.

Table 1. Comparison between text-based models: GPT, BERT, XLNet, T5, and GPT-3.

Model	Release Date	Architecture	Pretraining Task	Main Strength	Best Use Cases	Limitations
GPT (OpenAI)	2018	Unidirectional (left-to-right)	Autoregressive Language Modeling	Text generation and continuation	Language generation, conversation	Struggles with long-range dependencies and context, lacks bidirectional understanding
BERT (Google)	2018	Bidirectional (left-to-right + right-to-left)	Masked Language Modeling (MLM), Next Sentence Prediction (NSP)	Text understanding (e.g., question answering, sentiment analysis)	Contextual tasks, classification, named entity recognition (NER)	Computationally expensive during inference, not designed for text generation
XLNet (Google/CMU)	2019	Bidirectional with autoregressive (permutation-based)	Permutation-based Language Modeling	Combines autoregressive and bidirectional learning for improved performance	Question answering, text classification, sentiment analysis, and tasks benefiting from both context and generation	More complex training procedure, requires substantial computational resources
T5 (Google)	2019	Text-to-Text Transfer Transformer (Bidirectional)	Span-Corruption (Denoising) Objective	Unified architecture for diverse tasks (translation, summarization, question answering)	Translation, summarization, question answering, text classification	Pretraining requires large-scale data and computation, struggles with very specific tasks requiring domain expertise
GPT-3 (OpenAI)	2020	Unidirectional (left-to-right)	Autoregressive Language Modeling	Few-shot learning, scalable to a variety of tasks	Text generation, translation, summarization, question answering, conversation	Computationally expensive, lacks true understanding, may generate biased or harmful content

Table 2. Comparison of DALL·E, CLIP, NUWA, CogView2, Imagen, DALL·E 2, Stable Diffusion, Make-A-Video, MidJourney, DreamFusion, and Make-A-Scene.

Model	Release Date	Architecture	Pretraining Task	Main Strength	Best Use Cases	Limitations
DALL·E (OpenAI)	2021	Transformer-based, VQ-VAE-2 for image generation	Trained on text-image pairs for generating images from text	Generates novel, high-quality images from text descriptions	Creative fields (design, art, visualization)	Image quality can vary, struggles with detailed prompts, computationally expensive
CLIP (OpenAI)	2021	Vision transformer for images, transformer for text, shared feature space	Contrastive learning on text-image pairs for matching	Matches images with text, zero-shot learning for classification	Image classification, multimodal search, text-to-image matching	Limited by biases, struggles with abstract queries
NUWA (Microsoft)	2021	3D transformers for spatiotemporal data, VQ-VAE for encoding/decoding	Trained on text-image and text-video pairs for multimodal generation	Generates high-quality videos, animations, and images with temporal coherence	Video animation, text-to-video, creative industries, education	Limited resolution and video length, computationally expensive
CogView2 (Tsinghua University)	2022	Transformer-based, masked autoregressive model for image generation	Trained on text-image pairs for high-quality image generation	Generates high-resolution, detailed images from text prompts, improved over previous versions	Creative industries, design, conceptual art, high-quality visual generation	Struggles with very abstract or highly detailed prompts, computationally intensive
Imagen (Google)	2022	Two-stage diffusion process with T5 transformer encoder	Trained on text-image pairs with large-scale data for photorealistic generation	Generates high-resolution, photorealistic images from text	Design, art, visualization, photorealistic content	Computationally intensive, struggles with complex scenes
DALL·E 2 (OpenAI)	2022	Transformer-based, Diffusion model for image generation, CLIP-based encoder	Trained on text-image pairs with added inpainting capability	Higher quality, more photorealistic images, image inpainting	Design, concept art, image editing	Struggles with very detailed or abstract prompts, computationally intensive
Stable Diffusion (Stability AI)	2022	Latent diffusion model for image generation, transformer-based text encoder	Trained on text-image pairs with latent space modeling	High-quality, photorealistic images, fast and efficient generation	Creative industries, personalized content, rapid prototyping	Potential biases, requires computational resources for high-quality outputs
Make-A-Video (Meta Platforms)	2022	Transformer-based, spatiotemporal generative model	Trained on text-video pairs with temporal coherence modeling	Generates short video clips from text prompts, temporal coherence	Content creation, advertising, creative industries, education	Limited video length, resolution, and frame rate; computationally expensive
MidJourney (MidJourney, Inc.)	2022	Transformer-based, latent diffusion model	Trained on text-image pairs with a focus on artistic content	Produces highly artistic, stylized images from text descriptions	Digital artwork, conceptual designs, visual art projects	Primarily focused on artistic outputs, not suitable for photorealism or highly detailed images
DreamFusion (Google)	2022	2D diffusion model (like Imagen) + Neural Radiance Field (NeRF)	Trained on text-image pairs with 3D optimization	Generates photorealistic 3D models from text prompts, does not require 3D data	3D model generation, game design, virtual/augmented reality	Requires substantial computational resources, limitations with complex scene generation
Make-A-Scene (Meta Platforms)	2022	Transformer-based, text-to-image with interactive layout control	Trained on text-image pairs with spatial layout specification	Interactive image generation, user control over scene composition	Concept art, design mockups, visual storytelling, interactive advertising	Limited by complexity of layout inputs, challenges in highly detailed scene generation

Table 3. Comparison of Llama, Bard/Gemini, Phenaki, GPT-4, and Mistral.

Model	Release Date	Architecture	Pretraining Task	Main Strength	Best Use Cases	Limitations
Llama (Meta)	2023	Transformer-based architecture, open-weight models	Trained on diverse large-scale text datasets	Efficient scaling, high performance with open-source accessibility	Language modeling, text generation, multi-task learning	Limited for very large tasks, scaling challenges in certain applications
Bard/Gemini (Google)	2023	Transformer-based, multimodal architecture (text, image, and video)	Trained on large-scale text datasets, multimodal data (text, images)	Multimodal understanding, high-quality text responses, and creative generation across various formats	Conversational AI, content generation, code generation, customer support, creative industries	Computationally expensive, potential biases, limited by training data complexity
Phenaki (Google)	2023	Transformer-based, Causal Vision Transformer (C-ViViT), masked bidirectional transformer	Trained on text-video pairs with spatiotemporal modeling	Generates long-form, coherent videos from text prompts, maintains temporal coherence	Storytelling, advertising, content creation, creative industries	Requires substantial computational resources, potential challenges with complex prompts
GPT-4 (OpenAI)	2023	Transformer-based, multimodal architecture (text and image inputs)	Trained on large text and image datasets	Multimodal understanding and generation, high-quality text responses from both image and text inputs	Visual Question Answering (VQA), image captioning, multimodal tasks	Computationally expensive, potential biases, and limited by training data complexity
Mistral (Mistral AI)	2023	Sparse mixture-of-experts (MoE), transformer-based architecture	Trained on large-scale text datasets	Efficient scaling with dynamic expert selection, reduces computational costs	Language modeling, text generation, multi-task learning	Limited scalability for very large tasks, expert activation challenges

Table 4. Comparison of SORA, Gemini 1.5, Big Sleep, and ChatGPT.

Model	Release Date	Architecture	Pretraining Task	Main Strength	Best Use Cases	Limitations
SORA (OpenAI)	2024	Transformer-based architecture with pre-trained diffusion model	Text-to-video generation from textual descriptions	Efficient video generation with spatial and temporal consistency	Video generation for creative industries, educational content, advertising, and prototyping	Struggles with complex scenes, accurate human motion rendering, and high-quality image generation
Gemini 1.5 (Google)	2024	PaLM-based transformer architecture with Pathways framework and MoE activation	Multimodal integration, including text, image, and video generation	Advanced multimodal integration and efficient scaling with large context windows	Creative and analytical applications across text, image, and video generation, including large-scale AI tasks	Challenges with fine-grained temporal coherence in video generation and computational complexity in some tasks
Big Sleep (Google)	2024	Multimodal architecture with LLMs and specialized tools (code browser, Python tool, debugger)	Trained on codebases for vulnerability detection, variant analysis, and proactive defense	Detects and analyzes vulnerabilities, proactive defense, variant analysis	Vulnerability research, cybersecurity, automated security analysis, root-cause analysis	Requires substantial computational resources, challenges with complex vulnerabilities and large codebases
ChatGPT (OpenAI)	2024	Transformer-based architecture with multimodal capabilities (text, image, video)	Trained on large-scale text and multimodal datasets, reinforcement learning with human feedback (RLHF)	Multimodal understanding, high-quality responses, and creative generation	Conversational AI, content generation, customer support, creative tasks	Computationally expensive, potential biases, limited by training data complexity

Table 5. Summary of early developments in nephrology from 2018 to 2023.

Year	Focus	Contribution	Architecture & Details	Result	Strengths	Limitations
2020 [50]	Acute Kidney Injury (AKI) Prediction	AKI-BERT: a domain-specific language model for early AKI prediction using clinical notes.	Based on Clinical BioBERT; pre-trained with MLM and NSP tasks on AKI-specific corpus (77,160 notes, 49 M tokens). Fine-tuned with stratified sampling strategies.	Achieved AUC of 76.4% (best with upsampling + pooling)	Enhanced domain-specific representation; effective handling of imbalanced datasets; improved early AKI prediction accuracy.	Limited to textual data without incorporating structured clinical measurements
2021 [51]	CKD Prediction	MD-BERT-LGBM: a multimodal model integrating BERT with LightGBM for CKD risk prediction using structured and unstructured data.	MD-BERT for unstructured clinical notes extraction, coupled with LightGBM for nonlinear classification using lab results and other structured data.	Accuracy: 78.1%, recall: 75.7%, AUC: 85.2%.	Superior accuracy and effective integration of structured and unstructured data; robust for small datasets.	Requires high computational resources; limited interpretability; validation needed on larger datasets.
2023 [52]	CKD Dietary Support	Evaluated ChatGPT 3.5, ChatGPT 4, Bard AI, and Bing Chat in identifying mineral (potassium and phosphorus) content in foods for CKD patients.	Transformer-based generative AI models tailored for text-to-information tasks; ChatGPT 4 offers improved precision and complexity handling over 3.5.	ChatGPT 4: 81.0% accuracy for potassium; Bard AI: 100.0% accuracy for phosphorus; Bing Chat: 89.0% for phosphorus.	High potential for assisting healthcare providers in CKD diet planning; efficiency in categorizing nutritional content.	Inconsistent performance across food categories; requires human oversight to ensure reliability in clinical application.
2023 [53]	Nephrology Literature Search	Evaluated ChatGPT-3.5, Bing Chat, and Bard AI for citation accuracy in nephrology literature searches.	GPT-3.5, GPT-4 (ChatGPT), Bing Chat (GPT-4), and Bard AI (PaLM2); tested with 12 nephrology topics and 20 references per topic.	ChatGPT-3.5: 38.0% accurate references, Bing Chat: 30.0% accurate, Bard AI: 3.0% accurate. Bard exhibited the highest rate of fabricated references (63.0%).	Highlights potential for automating citation generation; ChatGPT showed highest accuracy.	High rate of fabricated and inaccurate references; inconsistent performance across platforms; requires human verification.
2023 [54]	Nephrology Citation Validation	Assessed ChatGPT-3.5’s ability to identify and validate references in nephrology literature.	GPT-3.5	Of 610 references, 62.0% existed, 20.0% accurate, 31.0% fabricated, and 7.0% incomplete; only 20.0% included correct links.	Time-efficient for generating initial reference lists; identifies relevant references quickly.	High rate of fabricated references; poor accuracy in DOIs/links (68.0%); weak reliability in specialized domains like peritoneal dialysis.
2023 [55]	Kidney Transplant Care	Proposed integration of AI-powered chatbots to enhance kidney transplant care in patient education, clinical decision support, and medication management.	Leveraged NLP in GenAI chatbots like ChatGPT; integrated with EHR systems for real-time insights and personalized patient education.	Demonstrated potential for improving adherence to treatment plans and addressing disparities in kidney transplant care.	Personalizes patient education; supports treatment compliance; enhances healthcare provider efficiency.	Risks of over-reliance on chatbot recommendations; challenges with AI biases, integration into existing EHR systems, and ethical concerns.
2023 [56]	Patient Education in Urology	Evaluated ChatGPT-4.0 for providing information on bladder cancer, prostate cancer, renal cancer, benign prostatic hypertrophy (BPH), and urinary stones.	ChatGPT-4.0 evaluated using DISCERN and informed consent quality assessments; analyzed for correctness and word count.	DISCERN score of 4/5 for most conditions; accurate but lacked sources and comprehensive coverage.	Clear explanations of pathologies and treatments; emphasized patient-doctor consultation; addressed risks and recovery for surgeries.	Missing bibliographic references; lacked depth for complex treatments; omitted alternative options unless explicitly queried.

Table 6. Summary of Recent Work in 2024.

Year	Focus	Contribution	Architecture & Details	Result	Strengths	Limitations
2024 [57]	CKD Risk Stratification	HERBERT: A BERT-based model tailored for EHR data with custom embeddings (age, sex, visit) and next visit prediction task.	Transformer architecture with MLM; 2025-token vocabulary.	ROC AUC: 91.0% (1-year), 86.0% (2-year), 82.0% (5-year).	Superior temporal context handling and improved long-term prediction.	Limited dataset size; requires validation on diverse cohorts; lacks integration of lab results.
2024 [58]	AKI and Continuous Renal Replacement Therapy (CRRT) Prediction in ICU	Multimodal model combining LSTM (time series) and BioMedBERT (clinical notes) to predict AKI and CRRT needs 12 h in advance.	LSTM for structured time-series data and BioMedBERT for unstructured clinical notes; integrated using a multimodal encoder; SHAP used for interpretability.	AUROC: 88.8% (AKI), 99.7% (CRRT); AUPRC: 72.7% (AKI), 84.0% (CRRT).	Effective integration of structured and unstructured data; enhanced prediction accuracy and interpretability.	Imbalanced dataset for CRRT; computationally intensive multimodal architecture.
2024 [59]	Renal Cell Carcinoma (RCC) Diagnosis	Compared GPT-3.5 and GPT-4.0 for addressing RCC-related clinical queries.	GPT-3.5 Turbo and GPT-4.0; fine-tuned GPT-3.5 with RCC-specific clinical questions using iterative training.	GPT-3.5: 67.1% accuracy; GPT-4.0: 77.5%; Fine-tuned GPT-3.5: 93.8%, achieving 100.0% accuracy with iterative training.	Enhanced accuracy with iterative fine-tuning; improved handling of RCC-specific clinical queries.	Initial instability in responses due to limited RCC-specific data; requires iterative optimization for significant improvements.
2024 [60]	Urolithiasis (Kidney Stones) Diagnosis and Treatment	Investigated GPT-4 for urolithiasis care, including diagnosis, and urgent care.	GPT-4 evaluated using 11 clinical scenarios; responses assessed for compliance with guidelines by experienced urologists.	Correct in initial diagnostic scenarios; 73.0% partial compliance; errors in 6/11 responses.	Accurate diagnostic suggestions; demonstrated potential for patient communication and initial treatment planning.	Poor adherence to surgical guidelines; lacked detailed drug recommendations; inconsistent response accuracy.
2024 [61]	Visual Communication in Nephrology	Use of Microsoft Copilot with DALL-E to create educational kidney-related visuals for diagnostics and patient education.	DALL-E 3 integrated with Microsoft Copilot.	Produced accurate, engaging visuals for nephrology education and diagnostics, boosting communication efficiency.	Easy to use; accessible integration with Microsoft 365; supports nephrology-specific visual content creation.	Challenges in precise anatomical accuracy; iterative refinement needed for medical-grade visuals; advanced features require premium subscriptions.
2024 [62]	Medical Education and Presentations	Comprehensive guide for medical educators to create AI-generated images for presentations, focusing on practical and cost-effective methods.	Utilized Text-to-Image models (DALL-E); emphasized prompt engineering and multimedia principles.	Generated accurate, culturally sensitive, and diverse visuals tailored to educational needs, enhancing engagement.	Cost-effective, customizable tools for creating compelling educational presentations.	Current Text-to-Image models lack precision for complex medical visuals; ethical concerns with patient representation and copyright issues.
2024 [63]	Pediatric Emergency Medicine	Use of DALL-E to create child-friendly visuals for medical procedures to reduce anxiety and improve understanding.	DALL-E 3; text encoder and generative diffusion model for creating illustrations of medical procedures.	Reduced patient anxiety and improved compliance during procedures; increased patient satisfaction.	Enhanced patient engagement and communication, particularly in pediatric settings; helpful for non-English-speaking families.	Dependency on precise text input; potential for generating anatomically inaccurate or confusing images.
2024 [64]	Clinical Nephrology	Comparative analysis of ChatGPT-4, Gemini Pro, and Bard in nephrology-related queries and renal biopsy interpretations.	Evaluated using 21 nephrology-related questions and 3 renal biopsy reports; metrics included TF-IDF, BertScore, ROUGE, and empathy ratings.	ChatGPT-4: Highest in empathy (79.9%); Gemini Pro: Best for biopsy appropriateness; Bard: Best in dialysis-related helpfulness.	Comprehensive performance assessment across empathy, appropriateness, and helpfulness; improved reference accuracy.	Limited to predefined questions and reports; small evaluator group; dependency on model-specific strengths limits generalizability.
2024 [65]	Postoperative Complication Analysis for Renal Surgeries	Evaluated GPT-4’s ability to detect and classify postoperative complications using discharge summaries from renal surgeries.	GPT-4 with zero-shot and few-shot prompts.	79.6% match rate with human-curated data for complication detection; accuracy: GPT-4 (86.7%), human (92.9%).	Demonstrated potential for automating complication detection; reduced manual workload for initial analysis.	Poor performance in advanced categorization; prone to false positives/negatives; requires human oversight.
2024 [66]	Clinical Decision Support in AKI	Evaluated GPT-4 and GenAI integration into AKI e-alert systems for early detection and management.	Multimodal generative AI leveraging EHR inputs like labs, vitals, and clinical notes.	Hypothetical trials indicated improved detection accuracy and management recommendations for AKI cases.	Scalable solution for real-time AKI management; improves physician efficiency and patient safety.	Challenges with specificity in complex cases; risk of alert fatigue; interpretability issues limit full adoption.
2024 [67]	Renal Cancer Education	Evaluated ChatGPT-4, Gemini AI, and Perplexity AI against authoritative patient education materials for accuracy and readability.	ChatGPT-4, Gemini AI, and Perplexity AI	ChatGPT-4: Highest accuracy; Gemini AI: Best readability	High accuracy and improved readability in patient education; adaptable to simplifying text upon request.	Variability across topics; struggled to achieve uniform 6th-grade readability; minor inaccuracies in treatment-related queries.

Table 7. ChatGPT responses to text-based prompts related to kidney health.

Prompt	ChatGPT Response
$P_{1}$ : Does frequent urination mean kidneys are functioning normally?	$R_{1}^{text}$ : Frequent urination indicates active urine production but not necessarily normal kidney function. It can result from high fluid intake, diuretics, medications, or conditions like UTIs, diabetes, or overactive bladder. Consult a healthcare professional for concerns.
$P_{2}$ : How can it be determined if someone has kidney disease (KD)?	$R_{2}^{text}$ : Symptoms of KD include fatigue, swelling, changes in urination, foamy urine, back pain, high blood pressure, decreased appetite, nausea, and difficulty concentrating. Persistent symptoms require medical evaluation through blood, urine tests, and imaging.
$P_{3}$ : Can kidney problems occur without pain?	$R_{3}^{text}$ : Yes, kidney issues, like CKD, may not cause pain in early stages. Symptoms to watch for include changes in urination, fatigue, swelling, high blood pressure, and anemia. Regular check-ups and tests are essential for early detection. Consult a healthcare professional if kidney problems are suspected.
$P_{4}$ : Is anemia related to kidney failure?	$R_{4}^{text}$ : Anemia is common in kidney failure due to reduced red blood cell production, leading to fatigue and weakness. Managing anemia is crucial for treating KD patients.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Neha, F.; Bhati, D.; Shukla, D.K. Generative AI Models (2018–2024): Advancements and Applications in Kidney Care. BioMedInformatics 2025, 5, 18. https://doi.org/10.3390/biomedinformatics5020018

AMA Style

Neha F, Bhati D, Shukla DK. Generative AI Models (2018–2024): Advancements and Applications in Kidney Care. BioMedInformatics. 2025; 5(2):18. https://doi.org/10.3390/biomedinformatics5020018

Chicago/Turabian Style

Neha, Fnu, Deepshikha Bhati, and Deepak Kumar Shukla. 2025. "Generative AI Models (2018–2024): Advancements and Applications in Kidney Care" BioMedInformatics 5, no. 2: 18. https://doi.org/10.3390/biomedinformatics5020018

APA Style

Neha, F., Bhati, D., & Shukla, D. K. (2025). Generative AI Models (2018–2024): Advancements and Applications in Kidney Care. BioMedInformatics, 5(2), 18. https://doi.org/10.3390/biomedinformatics5020018

Article Menu

Generative AI Models (2018–2024): Advancements and Applications in Kidney Care

Abstract

1. Introduction

2. Research Methodology

3. Background

3.1. The Era of Pretrained Language Models (2018–2020)

3.1.1. Generative Pretrained Transformer (GPT)

3.1.2. Bidirectional Encoder Representations from Transformers (BERT)

3.1.3. XLNet

3.1.4. Text-to-Text Transfer Transformer (T5)

3.1.5. GPT-3

3.2. Multimodal Models: Text, Image, and Video (2021–2023)

3.2.1. DALL·E

3.2.2. Contrastive Language-Image Pretraining (CLIP)

3.2.3. Neural Video and Image Animation (NUWA)

3.2.4. CogView2

3.2.5. Imagen

3.2.6. DALL·E 2

3.2.7. Stable Diffusion

3.2.8. Make-A-Video

3.2.9. MidJourney

3.2.10. DreamFusion

3.2.11. Make-A-Scene

3.2.12. Large Language Model Meta AI (Llama)

3.2.13. Bidirectional Attention Recurrent Decoder (Bard)

3.2.14. Phenaki

3.2.15. GPT-4

3.2.16. Mistral

3.3. 2024—Advanced Video and Multimodal Systems

3.3.1. SORA

3.3.2. Gemini 1.5

3.3.3. Big Sleep

3.3.4. ChatGPT 4o and Variants (2024)

3.4. Improvements in GenAI, LLMs, and LVMs

3.4.1. Prompt Engineering

3.4.2. Retrieval-Augmented Generation (RAG)

3.4.3. Dense Retrieval and Fusion-in-Decoder (FiD)

3.4.4. Sparse Retrieval Models

3.4.5. Vector Databases

3.4.6. Contrastive Learning

4. Recent Applications of GenAI, LLMs, and LVMs in Kidney Care

4.1. 2018–2023: Early Developments

4.2. 2024: Recent Breakthroughs

5. Discussion

6. Use Case

6.1. Text-to-Image (2D)

6.2. Text-to-Image (3D)

6.3. Text-to-Text

6.4. Text-to-Video

7. Challenges of GenAI, LLMs, and LVMs in Kidney Healthcare

8. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI