MDPI - Publisher of Open Access Journals

28 pages, 1711 KB

Open AccessArticle

Identifying Literary Microgenres and Writing Style Differences in Romanian Novels with ReaderBench and Large Language Models

by Aura Cristina Udrea, Stefan Ruseti, Vlad Pojoga, Stefan Baghiu, Andrei Terian and Mihai Dascalu

Future Internet 2025, 17(9), 397; https://doi.org/10.3390/fi17090397 - 30 Aug 2025

Viewed by 147

Abstract

Recent developments in natural language processing, particularly large language models (LLMs), create new opportunities for literary analysis in underexplored languages like Romanian. This study investigates stylistic heterogeneity and genre blending in 175 late 19th- and early 20th-century Romanian novels, each classified by literary [...] Read more.

Recent developments in natural language processing, particularly large language models (LLMs), create new opportunities for literary analysis in underexplored languages like Romanian. This study investigates stylistic heterogeneity and genre blending in 175 late 19th- and early 20th-century Romanian novels, each classified by literary historians into one of 17 genres. Our findings reveal that most novels do not adhere to a single genre label but instead combine elements of multiple (micro)genres, challenging traditional single-label classification approaches. We employed a dual computational methodology combining an analysis with Romanian-tailored linguistic features with general-purpose LLMs. ReaderBench, a Romanian-specific framework, was utilized to extract surface, syntactic, semantic, and discourse features, capturing fine-grained linguistic patterns. Alternatively, we prompted two LLMs (Llama3.3 70B and DeepSeek-R1 70B) to predict genres at the paragraph level, leveraging their ability to detect contextual and thematic coherence across multiple narrative scales. Statistical analyses using Kruskal–Wallis and Mann–Whitney tests identified genre-defining features at both novel and chapter levels. The integration of these complementary approaches enhances microgenre detection beyond traditional classification capabilities. ReaderBench provides quantifiable linguistic evidence, while LLMs capture broader contextual patterns; together, they provide a multi-layered perspective on literary genre that reflects the complex and heterogeneous character of fictional texts. Our results argue that both language-specific and general-purpose computational tools can effectively detect stylistic diversity in Romanian fiction, opening new avenues for computational literary analysis in limited-resourced languages. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Natural Language Processing (NLP))

► Show Figures

Figure 1

36 pages, 590 KB

Open AccessReview

Machine Translation in the Era of Large Language Models:A Survey of Historical and Emerging Problems

by Duygu Ataman, Alexandra Birch, Nizar Habash, Marcello Federico, Philipp Koehn and Kyunghyun Cho

Information 2025, 16(9), 723; https://doi.org/10.3390/info16090723 - 25 Aug 2025

Viewed by 980

Abstract

Historically regarded as one of the most challenging tasks presented to achieve complete artificial intelligence (AI), machine translation (MT) research has seen continuous devotion over the past decade, resulting in cutting-edge architectures for the modeling of sequential information. While the majority of statistical [...] Read more.

Historically regarded as one of the most challenging tasks presented to achieve complete artificial intelligence (AI), machine translation (MT) research has seen continuous devotion over the past decade, resulting in cutting-edge architectures for the modeling of sequential information. While the majority of statistical models traditionally relied on the idea of learning from parallel translation examples, recent research exploring self-supervised and multi-task learning methods extended the capabilities of MT models, eventually allowing the creation of general-purpose large language models (LLMs). In addition to versatility in providing translations useful across languages and domains, LLMs can in principle perform any natural language processing (NLP) task given sufficient amount of task-specific examples. While LLMs now reach a point where they can both replace and augment traditional MT models, the extent of their advantages and the ways in which they leverage translation capabilities across multilingual NLP tasks remains a wide area for exploration. In this literature survey, we present an introduction to the current position of MT research with a historical look at different modeling approaches to MT, how these might be advantageous for the solution of particular problems, and which problems are solved or remain open in regard to recent developments. We also discuss the connection of MT models leading to the development of prominent LLM architectures, how they continue to support LLM performance across different tasks by providing a means for cross-lingual knowledge transfer, and the redefinition of the task with the possibilities that LLM technology brings. Full article

(This article belongs to the Special Issue Human and Machine Translation: Recent Trends and Foundations)

► Show Figures

Figure 1

26 pages, 1810 KB

Open AccessArticle

A Memetic and Reflective Evolution Framework for Automatic Heuristic Design Using Large Language Models

by Fubo Qi, Tianyu Wang, Ruixiang Zheng and Mian Li

Appl. Sci. 2025, 15(15), 8735; https://doi.org/10.3390/app15158735 - 7 Aug 2025

Viewed by 463

Abstract

The increasing complexity of real-world engineering problems, ranging from manufacturing scheduling to resource optimization in smart grids, has driven demand for adaptive and high-performing heuristic methods. Automatic Heuristic Design (AHD) and neural-enhanced metaheuristics have shown promise in automating strategy development, but often suffer [...] Read more.

The increasing complexity of real-world engineering problems, ranging from manufacturing scheduling to resource optimization in smart grids, has driven demand for adaptive and high-performing heuristic methods. Automatic Heuristic Design (AHD) and neural-enhanced metaheuristics have shown promise in automating strategy development, but often suffer from limited flexibility and scalability due to static operator libraries or high retraining costs. Recently, Large Language Models (LLMs) have emerged as a powerful alternative for exploring and evolving heuristics through natural language and program synthesis. This paper proposes a novel LLM-based memetic framework that synergizes LLM-driven exploration with domain-specific local refinement and memory-aware reflection, enabling a dynamic balance between heuristic creativity and effectiveness. In the experiments, the developed framework outperforms other LLM-based state-of-the-art approaches across the designed AGV-drone scheduling scenario and two benchmark combinatorial problems. The findings suggest that LLMs can serve not only as general-purpose optimizers but also as interpretable heuristic generators that adapt efficiently to complex and heterogeneous domains. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

23 pages, 2002 KB

Open AccessArticle

Precision Oncology Through Dialogue: AI-HOPE-RTK-RAS Integrates Clinical and Genomic Insights into RTK-RAS Alterations in Colorectal Cancer

by Ei-Wen Yang, Brigette Waldrup and Enrique Velazquez-Villarreal

Biomedicines 2025, 13(8), 1835; https://doi.org/10.3390/biomedicines13081835 - 28 Jul 2025

Viewed by 710

Abstract

Background/Objectives: The RTK-RAS signaling cascade is a central axis in colorectal cancer (CRC) pathogenesis, governing cellular proliferation, survival, and therapeutic resistance. Somatic alterations in key pathway genes—including KRAS, NRAS, BRAF, and EGFR—are pivotal to clinical decision-making in precision oncology. However, the integration of [...] Read more.

Background/Objectives: The RTK-RAS signaling cascade is a central axis in colorectal cancer (CRC) pathogenesis, governing cellular proliferation, survival, and therapeutic resistance. Somatic alterations in key pathway genes—including KRAS, NRAS, BRAF, and EGFR—are pivotal to clinical decision-making in precision oncology. However, the integration of these genomic events with clinical and demographic data remains hindered by fragmented resources and a lack of accessible analytical frameworks. To address this challenge, we developed AI-HOPE-RTK-RAS, a domain-specialized conversational artificial intelligence (AI) system designed to enable natural language-based, integrative analysis of RTK-RAS pathway alterations in CRC. Methods: AI-HOPE-RTK-RAS employs a modular architecture combining large language models (LLMs), a natural language-to-code translation engine, and a backend analytics pipeline operating on harmonized multi-dimensional datasets from cBioPortal. Unlike general-purpose AI platforms, this system is purpose-built for real-time exploration of RTK-RAS biology within CRC cohorts. The platform supports mutation frequency profiling, odds ratio testing, survival modeling, and stratified analyses across clinical, genomic, and demographic parameters. Validation included reproduction of known mutation trends and exploratory evaluation of co-alterations, therapy response, and ancestry-specific mutation patterns. Results: AI-HOPE-RTK-RAS enabled rapid, dialogue-driven interrogation of CRC datasets, confirming established patterns and revealing novel associations with translational relevance. Among early-onset CRC (EOCRC) patients, the prevalence of RTK-RAS alterations was significantly lower compared to late-onset disease (67.97% vs. 79.9%; OR = 0.534, p = 0.014), suggesting the involvement of alternative oncogenic drivers. In KRAS-mutant patients receiving Bevacizumab, early-stage disease (Stages I–III) was associated with superior overall survival relative to Stage IV (p = 0.0004). In contrast, BRAF-mutant tumors with microsatellite-stable (MSS) status displayed poorer prognosis despite higher chemotherapy exposure (OR = 7.226, p < 0.001; p = 0.0000). Among EOCRC patients treated with FOLFOX, RTK-RAS alterations were linked to worse outcomes (p = 0.0262). The system also identified ancestry-enriched noncanonical mutations—including CBL, MAPK3, and NF1—with NF1 mutations significantly associated with improved prognosis (p = 1 × 10⁻⁵). Conclusions: AI-HOPE-RTK-RAS exemplifies a new class of conversational AI platforms tailored to precision oncology, enabling integrative, real-time analysis of clinically and biologically complex questions. Its ability to uncover both canonical and ancestry-specific patterns in RTK-RAS dysregulation—especially in EOCRC and populations with disproportionate health burdens—underscores its utility in advancing equitable, personalized cancer care. This work demonstrates the translational potential of domain-optimized AI tools to accelerate biomarker discovery, support therapeutic stratification, and democratize access to multi-omic analysis. Full article

(This article belongs to the Special Issue Advancements in Artificial Intelligence (AI) for Cancer Genomics and Genetics)

► Show Figures

Figure 1

28 pages, 4702 KB

Open AccessArticle

Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study

by Cemre Aydin, Ozden Bedre Duygu, Asli Beril Karakas, Eda Er, Gokhan Gokmen, Anil Murat Ozturk and Figen Govsa

Medicina 2025, 61(8), 1342; https://doi.org/10.3390/medicina61081342 - 25 Jul 2025

Viewed by 764

Abstract

Background and Objectives: General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This [...] Read more.

Background and Objectives: General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. Materials and Methods: A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin’s CCC), inter-rater reliability (Cohen’s κ), and measurement agreement (Bland–Altman LoA). Results: The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0–14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: −21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ −0.106). Inter-rater reliability fell below random chance (ChatGPT κ = −0.039). Universal proportional bias (slopes ≈ −1.0) caused severe curve underestimation (e.g., 10–15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3–2.8° vs. 2.6–10.7°) but suboptimal specificity (21.7–26.1%) and hazardous lumbar concordance (CCC: −0.123). Conclusions: General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480–1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities. Full article

(This article belongs to the Special Issue Diagnosis and Treatment of Adolescent Idiopathic Scoliosis)

► Show Figures

Figure 1

9 pages, 509 KB

Open AccessArticle

Evaluation of Multiple-Choice Tests in Head and Neck Ultrasound Created by Physicians and Large Language Models

by Jacob P. S. Nielsen, August Krogh Mikkelsen, Julian Kuenzel, Merry E. Sebelik, Gitta Madani, Tsung-Lin Yang and Tobias Todsen

Diagnostics 2025, 15(15), 1848; https://doi.org/10.3390/diagnostics15151848 - 22 Jul 2025

Viewed by 506

Abstract

Background/Objectives: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head [...] Read more.

Background/Objectives: Otolaryngologists are increasingly using head and neck ultrasound (HNUS). Determining whether a practitioner of HNUS has achieved adequate theoretical knowledge remains a challenge. This study assesses the performance of two large language models (LLMs) in generating multiple-choice questions (MCQs) for head and neck ultrasound, compared with MCQs generated by physicians. Methods: Physicians and LLMs (ChatGPT, GPT4o, and Google Gemini, Gemini Advanced) created a total of 90 MCQs that covered the topics of lymph nodes, thyroid, and salivary glands. Experts in HNUS additionally evaluated all physician-drafted MCQs using a Delphi-like process. The MCQs were assessed by an international panel of experts in HNUS, who were blinded to the source of the questions. Using a Likert scale, the evaluation was based on an overall assessment including six assessment criteria: clarity, relevance, suitability, quality of distractors, adequate rationale of the answer, and an assessment of the level of difficulty. Results: Four experts in the clinical field of HNUS assessed the 90 MCQs. No significant differences were observed between the two LLMs. Physician-drafted questions (n = 30) had significant differences with Google Gemini in terms of relevance, suitability, and adequate rationale of the answer, but only significant differences in terms of suitability compared with ChatGPT. Compared to MCQ items (n = 16) validated by medical experts, LLM-constructed MCQ items scored significantly lower across all criteria. The difficulty level of the MCQs was the same. Conclusions: Our study demonstrates that both LLMs could be used to generate MCQ items with a quality comparable to drafts from physicians. However, the quality of LLM-generated MCQ items was still significantly lower than MCQs validated by ultrasound experts. LLMs are therefore cost-effective to generate a quick draft for MCQ items that afterward should be validated by experts before being used for assessment purposes. In this way, the value of LLM is not the elimination of humans, but rather vastly superior time management. Full article

(This article belongs to the Special Issue Advances in Head and Neck Ultrasound)

► Show Figures

Figure 1

18 pages, 1554 KB

Open AccessArticle

ChatCVD: A Retrieval-Augmented Chatbot for Personalized Cardiovascular Risk Assessment with a Comparison of Medical-Specific and General-Purpose LLMs

by Wafa Lakhdhar, Maryam Arabi, Ahmed Ibrahim, Abdulrahman Arabi and Ahmed Serag

AI 2025, 6(8), 163; https://doi.org/10.3390/ai6080163 - 22 Jul 2025

Viewed by 696

Abstract

Large language models (LLMs) are increasingly being applied to clinical tasks, but it remains unclear whether medical-specific models consistently outperform smaller, generalpurpose ones. This study investigates that assumption in the context of cardiovascular disease (CVD) risk assessment. We fine-tuned eight LLMs—both general-purpose and [...] Read more.

Large language models (LLMs) are increasingly being applied to clinical tasks, but it remains unclear whether medical-specific models consistently outperform smaller, generalpurpose ones. This study investigates that assumption in the context of cardiovascular disease (CVD) risk assessment. We fine-tuned eight LLMs—both general-purpose and medical-specific—using textualized data from the Behavioral Risk Factor Surveillance System (BRFSS) to classify individuals as “High Risk” or “Low Risk”. To provide actionable insights, we integrated a Retrieval-Augmented Generation (RAG) framework for personalized recommendation generation and deployed the system within an interactive chatbot interface. Notably, Gemma2, a compact 2B-parameter general-purpose model, achieved a high recall (0.907) and F1-score (0.770), performing on par with larger or medical-specialized models such as Med42 and BioBERT. These findings challenge the common assumption that larger or specialized models always yield superior results, and highlight the potential of lightweight, efficiently fine-tuned LLMs for clinical decision support—especially in resource-constrained settings. Overall, our results demonstrate that general-purpose models, when fine-tuned appropriately, can offer interpretable, high-performing, and accessible solutions for CVD risk assessment and personalized healthcare delivery. Full article

(This article belongs to the Special Issue AI-Driven Innovations: Emerging Trends, Security, and Industrial Solutions)

► Show Figures

Graphical abstract

14 pages, 320 KB

Open AccessArticle

Evaluating Large Language Models in Cardiology: A Comparative Study of ChatGPT, Claude, and Gemini

by Michele Danilo Pierri, Michele Galeazzi, Simone D’Alessio, Melissa Dottori, Irene Capodaglio, Christian Corinaldesi, Marco Marini and Marco Di Eusanio

Hearts 2025, 6(3), 19; https://doi.org/10.3390/hearts6030019 - 19 Jul 2025

Viewed by 2591

Abstract

Background: Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini are being increasingly adopted in medicine; however, their reliability in cardiology remains underexplored. Purpose of the study: To compare the performance of three general-purpose LLMs in response to cardiology-related clinical queries. Study [...] Read more.

Background: Large Language Models (LLMs) such as ChatGPT, Claude, and Gemini are being increasingly adopted in medicine; however, their reliability in cardiology remains underexplored. Purpose of the study: To compare the performance of three general-purpose LLMs in response to cardiology-related clinical queries. Study design: Seventy clinical prompts stratified by diagnostic phase (pre or post) and user profile (patient or physician) were submitted to ChatGPT, Claude, and Gemini. Three expert cardiologists, who were blinded to the model’s identity, rated each response on scientific accuracy, completeness, clarity, and coherence using a 5-point Likert scale. Statistical analysis included Kruskal–Wallis tests, Dunn’s post hoc comparisons, Kendall’s W, weighted kappa, and sensitivity analyses. Results: ChatGPT outperformed both Claude and Gemini across all criteria (mean scores: 3.7–4.2 vs. 3.4–4.0 and 2.9–3.7, respectively; p < 0.001). The inter-rater agreement was substantial (Kendall’s W: 0.61–0.71). Pre-diagnostic and patient-framed prompts received higher scores than post-diagnostic and physician-framed ones. Results remained robust across sensitivity analyses. Conclusions: Among the evaluated LLMs, ChatGPT demonstrated superior performance in generating clinically relevant cardiology responses. However, none of the models achieved maximal ratings, and the performance varied by context. These findings highlight the need for domain-specific fine-tuning and human oversight to ensure a safe clinical deployment. Full article

► Show Figures

Graphical abstract

39 pages, 4950 KB

Open AccessSystematic Review

Large Language Models’ Trustworthiness in the Light of the EU AI Act—A Systematic Mapping Study

by Md Masum Billah, Harry Setiawan Hamjaya, Hakima Shiralizade, Vandita Singh and Rafia Inam

Appl. Sci. 2025, 15(14), 7640; https://doi.org/10.3390/app15147640 - 8 Jul 2025

Viewed by 1025

Abstract

The recent advancements and emergence of rapidly evolving AI models, such as large language models (LLMs), have sparked interest among researchers and professionals. These models are ubiquitously being fine-tuned and applied across various fields such as healthcare, customer service and support, education, automated [...] Read more.

The recent advancements and emergence of rapidly evolving AI models, such as large language models (LLMs), have sparked interest among researchers and professionals. These models are ubiquitously being fine-tuned and applied across various fields such as healthcare, customer service and support, education, automated driving, and smart factories. This often leads to an increased level of complexity and challenges concerning the trustworthiness of these models, such as the generation of toxic content and hallucinations with high confidence leading to serious consequences. The European Union Artificial Intelligence Act (AI Act) is a regulation concerning artificial intelligence. The EU AI Act has proposed a comprehensive set of guidelines to ensure the responsible usage and development of general-purpose AI systems (such as LLMs) that may pose potential risks. The need arises for strengthened efforts to ensure that these high-performing LLMs adhere to the seven trustworthiness aspects (data governance, record-keeping, transparency, human-oversight, accuracy, robustness, and cybersecurity) recommended by the AI Act. Our study systematically maps research, focusing on identifying the key trends in developing LLMs across different application domains to address the aspects of AI Act-based trustworthiness. Our study reveals the recent trends that indicate a growing interest in emerging models such as LLaMa and BARD, reflecting a shift in research priorities. GPT and BERT remain the most studied models, and newer alternatives like Mistral and Claude remain underexplored. Trustworthiness aspects like accuracy and transparency dominate the research landscape, while cybersecurity and record-keeping remain significantly underexamined. Our findings highlight the urgent need for a more balanced, interdisciplinary research approach to ensure LLM trustworthiness across diverse applications. Expanding studies into underexplored, high-risk domains and fostering cross-sector collaboration can bridge existing gaps. Furthermore, this study also reveals domains (like telecommunication) which are underrepresented, presenting considerable research gaps and indicating a potential direction for the way forward. Full article

(This article belongs to the Special Issue Advances in Large Language Models: Techniques, Applications and Challenges)

► Show Figures

Figure 1

30 pages, 4736 KB

Open AccessArticle

AutoGEEval: A Multimodal and Automated Evaluation Framework for Geospatial Code Generation on GEE with Large Language Models

by Huayi Wu, Zhangxiao Shen, Shuyang Hou, Jianyuan Liang, Haoyue Jiao, Yaxian Qing, Xiaopu Zhang, Xu Li, Zhipeng Gui, Xuefeng Guan and Longgang Xiang

ISPRS Int. J. Geo-Inf. 2025, 14(7), 256; https://doi.org/10.3390/ijgi14070256 - 30 Jun 2025

Cited by 2 | Viewed by 754

Abstract

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level [...] Read more.

Geospatial code generation is emerging as a key direction in the integration of artificial intelligence and geoscientific analysis. However, there remains a lack of standardized tools for automatic evaluation in this domain. To address this gap, we propose AutoGEEval, the first multimodal, unit-level automated evaluation framework for geospatial code generation tasks on the Google Earth Engine (GEE) platform powered by large language models (LLMs). Built upon the GEE Python API, AutoGEEval establishes a benchmark suite (AutoGEEval-Bench) comprising 1325 test cases that span 26 GEE data types. The framework integrates both question generation and answer verification components to enable an end-to-end automated evaluation pipeline—from function invocation to execution validation. AutoGEEval supports multidimensional quantitative analysis of model outputs in terms of accuracy, resource consumption, execution efficiency, and error types. We evaluate 18 state-of-the-art LLMs—including general-purpose, reasoning-augmented, code-centric, and geoscience-specialized models—revealing their performance characteristics and potential optimization pathways in GEE code generation. This work provides a unified protocol and foundational resource for the development and assessment of geospatial code generation models, advancing the frontier of automated natural language to domain-specific code translation. Full article

► Show Figures

Figure 1

19 pages, 1273 KB

Open AccessArticle

Beyond the Benchmark: A Customizable Platform for Real-Time, Preference-Driven LLM Evaluation

by George Zografos and Lefteris Moussiades

Electronics 2025, 14(13), 2577; https://doi.org/10.3390/electronics14132577 - 26 Jun 2025

Viewed by 1116

Abstract

The rapid progress of Large Language Models (LLMs) has intensified the demand for flexible evaluation frameworks capable of accommodating diverse user needs across a growing variety of applications. While numerous standardized benchmarks exist for evaluating general-purpose LLMs, they remain limited in both scope [...] Read more.

The rapid progress of Large Language Models (LLMs) has intensified the demand for flexible evaluation frameworks capable of accommodating diverse user needs across a growing variety of applications. While numerous standardized benchmarks exist for evaluating general-purpose LLMs, they remain limited in both scope and adaptability, often failing to capture domain-specific quality criteria. In many specialized domains, suitable benchmarks are lacking, leaving practitioners without systematic tools to assess the suitability of LLMs for their specific tasks. This paper presents LLM PromptScope (LPS), a customizable, real-time evaluation framework that enables users to define qualitative evaluation criteria aligned with their domain-specific needs. LPS integrates a novel LLM-as-a-Judge mechanism that leverages multiple language models as evaluators, minimizing human involvement while incorporating subjective preferences into the evaluation process. We validate the proposed framework through experiments on widely used datasets (MMLU, Math, and HumanEval), comparing conventional benchmark rankings with preference-driven assessments across multiple state-of-the-art LLMs. Statistical analyses demonstrate that user-defined evaluation criteria can significantly impact model rankings, particularly in open-ended tasks where standard benchmarks offer limited guidance. The results highlight LPS’s potential as a practical decision-support tool, particularly valuable in domains lacking mature benchmarks, offering both flexibility and rigor in model selection for real-world deployment. Full article

(This article belongs to the Special Issue Advances in Algorithm Optimization and Computational Intelligence)

► Show Figures

Figure 1

56 pages, 3118 KB

Open AccessArticle

Semantic Reasoning Using Standard Attention-Based Models: An Application to Chronic Disease Literature

by Yalbi Itzel Balderas-Martínez, José Armando Sánchez-Rojas, Arturo Téllez-Velázquez, Flavio Juárez Martínez, Raúl Cruz-Barbosa, Enrique Guzmán-Ramírez, Iván García-Pacheco and Ignacio Arroyo-Fernández

Big Data Cogn. Comput. 2025, 9(6), 162; https://doi.org/10.3390/bdcc9060162 - 19 Jun 2025

Viewed by 1009

Abstract

Large-language-model (LLM) APIs demonstrate impressive reasoning capabilities, but their size, cost, and closed weights limit the deployment of knowledge-aware AI within biomedical research groups. At the other extreme, standard attention-based neural language models (SANLMs)—including encoder–decoder architectures such as Transformers, Gated Recurrent Units (GRUs), [...] Read more.

Large-language-model (LLM) APIs demonstrate impressive reasoning capabilities, but their size, cost, and closed weights limit the deployment of knowledge-aware AI within biomedical research groups. At the other extreme, standard attention-based neural language models (SANLMs)—including encoder–decoder architectures such as Transformers, Gated Recurrent Units (GRUs), and Long Short-Term Memory (LSTM) networks—are computationally inexpensive. However, their capacity for semantic reasoning in noisy, open-vocabulary knowledge bases (KBs) remains unquantified. Therefore, we investigate whether compact SANLMs can (i) reason over hybrid OpenIE-derived KBs that integrate commonsense, general-purpose, and non-communicable-disease (NCD) literature; (ii) operate effectively on commodity GPUs; and (iii) exhibit semantic coherence as assessed through manual linguistic inspection. To this end, we constructed four training KBs by integrating ConceptNet (600k triples), a 39k-triple general-purpose OpenIE set, and an 18.6k-triple OpenNCDKB extracted from 1200 PubMed abstracts. Encoder–decoder GRU, LSTM, and Transformer models (1–2 blocks) were trained to predict the object phrase given the subject + predicate. Beyond token-level cross-entropy, we introduced the Meaning-based Selectional-Preference Test (MSPT): for each withheld triple, we masked the object, generated a candidate, and measured its surplus cosine similarity over a random baseline using word embeddings, with significance assessed via a one-sided t-test. Hyperparameter sensitivity (311 GRU/168 LSTM runs) was analyzed, and qualitative frame–role diagnostics completed the evaluation. Our results showed that all SANLMs learned effectively from the point of view of the cross entropy loss. In addition, our MSPT provided meaningful semantic insights: for the GRUs (256-dim, 2048-unit, 1-layer): mean similarity

(μ_{s t s})

of 0.641 to the ground truth vs. 0.542 to the random baseline (gap 12.1%;

p < 10^{- 180}

). For the 1-block Transformer:

μ_{s t s} = 0.551

vs.

0.511

(gap 4%;

p < 10^{- 25}

). While Transformers minimized loss and accuracy variance, GRUs captured finer selectional preferences. Both architectures trained within <24 GB GPU VRAM and produced linguistically acceptable, albeit over-generalized, biomedical assertions. Due to their observed performance, LSTM results were designated as baseline models for comparison. Therefore, properly tuned SANLMs can achieve statistically robust semantic reasoning over noisy, domain-specific KBs without reliance on massive LLMs. Their interpretability, minimal hardware footprint, and open weights promote equitable AI research, opening new avenues for automated NCD knowledge synthesis, surveillance, and decision support. Full article

► Show Figures

Figure 1

15 pages, 1244 KB

Open AccessArticle

Can AI-Based ChatGPT Models Accurately Analyze Hand–Wrist Radiographs? A Comparative Study

by Ahmet Yıldırım, Orhan Cicek and Yavuz Selim Genç

Diagnostics 2025, 15(12), 1513; https://doi.org/10.3390/diagnostics15121513 - 14 Jun 2025

Cited by 2 | Viewed by 946

Abstract

Background/Aims: The aim of this study was to evaluate the effectiveness of large language model (LLM)-based chatbot systems in predicting bone age and identifying growth stages, and to explore their potential as practical, infrastructure-independent alternatives to conventional methods and convolutional neural network (CNN)-based [...] Read more.

Background/Aims: The aim of this study was to evaluate the effectiveness of large language model (LLM)-based chatbot systems in predicting bone age and identifying growth stages, and to explore their potential as practical, infrastructure-independent alternatives to conventional methods and convolutional neural network (CNN)-based deep learning models. Methods: This study evaluated the performance of three ChatGPT-based models (GPT-4o, GPT-o4-mini-high, and GPT-o1-pro) in predicting bone age and growth stage using 90 anonymized hand–wrist radiographs (30 from each growth stage—pre-peak, peak, and post-peak—with equal male and female distribution). Reference standards were ensured by expert orthodontists using Fishman’s Skeletal Maturity Indicators (SMI) system and the Greulich–Pyle Atlas, with each radiograph analyzed by three GPT models using standardized prompts. Model performances were evaluated through statistical analyses assessing agreement and prediction accuracy. Results: All models showed significant agreement with the reference values in bone age prediction (p < 0.001), with GPT-o1-pro having the highest concordance (Pearson r = 0.546). No statistically significant difference was observed in the mean absolute error (MAE) among the models (p > 0.05). The GPT-o4-mini-high model achieved an accuracy rate of 72.2% within a ±2 year deviation range for bone age prediction. The GPT-o1-pro and GPT-o4-mini-high models showed bias in the Bland–Altman analysis of bone age predictions; however, GPT-o1-pro yielded more reliable predictions with narrower limits of agreement. In terms of growth stage classification, the GPT-4o model achieved the highest agreement with the reference values (κ = 0.283, p < 0.001). Conclusions: This study shows that general-purpose GPT models can support bone age and growth stages prediction, with each model having distinct strengths. While GPT models do not replace clinical examination, their contextual reasoning and ability to perform preliminary assessments without domain-specific training make them promising tools, though further development is needed. Full article

(This article belongs to the Special Issue Advancements in Artificial Intelligence for Dentomaxillofacial Radiology—2nd Edition)

► Show Figures

Figure 1

19 pages, 1492 KB

Open AccessArticle

Metaverse and Digital Twins in the Age of AI and Extended Reality

by Ming Tang, Mikhail Nikolaenko, Ahmad Alrefai and Aayush Kumar

Architecture 2025, 5(2), 36; https://doi.org/10.3390/architecture5020036 - 30 May 2025

Viewed by 1384

Abstract

This paper explores the evolving relationship between Digital Twins (DT) and the Metaverse, two foundational yet often conflated digital paradigms in digital architecture. While DTs function as mirrored models of real-world systems—integrating IoT, BIM, and real-time analytics to support decision-making—Metaverses are typically fictional, [...] Read more.

This paper explores the evolving relationship between Digital Twins (DT) and the Metaverse, two foundational yet often conflated digital paradigms in digital architecture. While DTs function as mirrored models of real-world systems—integrating IoT, BIM, and real-time analytics to support decision-making—Metaverses are typically fictional, immersive, multi-user environments shaped by social, cultural, and speculative narratives. Through several research projects, the team investigate the divergence between DTs and Metaverses through the lens of their purpose, data structure, immersion, and interactivity, while highlighting areas of convergence driven by emerging technologies in Artificial Intelligence (AI) and Extended Reality (XR).This study aims to investigate the convergence of DTs and the Metaverse in digital architecture, examining how emerging technologies—such as AI, XR, and Large Language Models (LLMs)—are blurring their traditional boundaries. By analyzing their divergent purposes, data structures, and interactivity modes, as well as hybrid applications (e.g., data-integrated virtual environments and AI-driven collaboration), this study seeks to define the opportunities and challenges of this integration for architectural design, decision-making, and immersive user experiences. Our research spans multiple projects utilizing XR and AI to develop DT and the Metaverse. The team assess the capabilities of AI in DT environments, such as reality capture and smart building management. Concurrently, the team evaluates metaverse platforms for online collaboration and architectural education, focusing on features facilitating multi-user engagement. The paper presents evaluations of various virtual environment development pipelines, comparing traditional BIM+IoT workflows with novel approaches such as Gaussian Splatting and generative AI for content creation. The team further explores the integration of Large Language Models (LLMs) in both domains, such as virtual agents or LLM-powered Non-Player-Controlled Characters (NPC), enabling autonomous interaction and enhancing user engagement within spatial environments. Finally, the paper argues that DTs and Metaverse’s once-distinct boundaries are becoming increasingly porous. Hybrid digital spaces—such as virtual buildings with data-integrated twins and immersive, social metaverses—demonstrate this convergence. As digital environments mature, architects are uniquely positioned to shape these dual-purpose ecosystems, leveraging AI, XR, and spatial computing to fuse data-driven models with immersive and user-centered experiences. Full article

(This article belongs to the Special Issue Shaping Architecture with Computation)

► Show Figures

Figure 1

42 pages, 551 KB

Open AccessArticle

AI Reasoning in Deep Learning Era: From Symbolic AI to Neural–Symbolic AI

by Baoyu Liang, Yuchen Wang and Chao Tong

Mathematics 2025, 13(11), 1707; https://doi.org/10.3390/math13111707 - 23 May 2025

Cited by 1 | Viewed by 8342

Abstract

The pursuit of Artificial General Intelligence (AGI) demands AI systems that not only perceive but also reason in a human-like manner. While symbolic systems pioneered early breakthroughs in logic-based reasoning, such as MYCIN and DENDRAL, they suffered from brittleness and poor scalability. Conversely, [...] Read more.

The pursuit of Artificial General Intelligence (AGI) demands AI systems that not only perceive but also reason in a human-like manner. While symbolic systems pioneered early breakthroughs in logic-based reasoning, such as MYCIN and DENDRAL, they suffered from brittleness and poor scalability. Conversely, modern deep learning architectures have achieved remarkable success in perception tasks, yet continue to fall short in interpretable and structured reasoning. This dichotomy has motivated growing interest in Neural–Symbolic AI, a paradigm that integrates symbolic logic with neural computation to unify reasoning and learning. This survey provides a comprehensive and technically grounded overview of AI reasoning in the deep learning era, with a particular focus on Neural–Symbolic AI. Beyond a historical narrative, we introduce a formal definition of AI reasoning and propose a novel three-dimensional taxonomy that organizes reasoning paradigms by representation form, task structure, and application context. We then systematically review recent advances—including Differentiable Logic Programming, abductive learning, program induction, logic-aware Transformers, and LLM-based symbolic planning—highlighting their technical mechanisms, capabilities, and limitations. In contrast to prior surveys, this work bridges symbolic logic, neural computation, and emergent generative reasoning, offering a unified framework to understand and compare diverse approaches. We conclude by identifying key open challenges such as symbolic–continuous alignment, dynamic rule learning, and unified architectures, and we aim to provide a conceptual foundation for future developments in general-purpose reasoning systems. Full article

(This article belongs to the Special Issue Advanced Applications of Deep Learning Methods: Interdisciplinary Perspectives)

Search Results (34)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (34)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI