Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (254)

Search Parameters:
Keywords = inter-model agreement

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 1460 KB  
Article
Smartphone-Based 3D Surface Imaging: A New Frontier in Digital Breast Assessment? Smartphone-Based Breast Assessment
by Nikolas Chrobot, Philipp Unbehaun, Konstantin Frank, Michael Alfertshofer, Wenko Smolka, Tobias Ettl, Alexandra Anker, Lukas Prantl, Vanessa Brébant and Robin Hartmann
J. Clin. Med. 2025, 14(17), 6233; https://doi.org/10.3390/jcm14176233 - 3 Sep 2025
Viewed by 180
Abstract
Background: Three-dimensional surface imaging is widely used in breast surgery. Recently, smartphone-based approaches have emerged. This investigation examines whether smartphone-based three-dimensional surface imaging provides clinically acceptable data in terms of accuracy when compared to a validated reference tool. Methods: Three-dimensional surface [...] Read more.
Background: Three-dimensional surface imaging is widely used in breast surgery. Recently, smartphone-based approaches have emerged. This investigation examines whether smartphone-based three-dimensional surface imaging provides clinically acceptable data in terms of accuracy when compared to a validated reference tool. Methods: Three-dimensional surface models were generated for 40 patients who underwent breast reconstruction surgery using the Vectra H2 (Canfield Scientific, Fairfield, NJ, USA) and the LiDAR sensor of an iPhone 15 Pro in conjunction with photogrammetry. The generated surface models were superimposed using CloudCompare’s ICP algorithm, followed by 14 linear surface-to-surface measurements to assess agreement between the three-dimensional surface models. Statistical methods included absolute error calculation, paired t-test, Bland–Altman analysis, and Intra-Class Correlation Coefficients to evaluate intra- and inter-rater reliability. Results: The average landmark-to-landmark deviation between smartphone-based and Vectra-based surface models was M = 2.997 mm (SD = 1.897 mm). No statistical differences were found in 13 of the 14 measurements for intra-rater comparison and in 12 of the 14 for inter-rater comparison. The Intra-Class Correlation Coefficient for intra-rater reliability of the iPhone was good, ranging from 0.873 to 0.993. Intra-Class Correlation Coefficient values indicated good reliability, ranging from 0.873 to 0.993 (intra-rater) and 0.845 to 0.992 (inter-rater). Bland–Altman analyses confirmed moderate to reliable agreement in 13 of 14 measurements. Conclusions: Smartphone-based three-dimensional surface imaging presents promising possibilities for breast assessment. However, it may not yet be suitable for highly detailed breast assessments requiring accuracy below the 3 mm threshold. Full article
(This article belongs to the Special Issue Current Opinion of Reconstructive and Aesthetic Breast Surgery)
Show Figures

Figure 1

14 pages, 2854 KB  
Article
AI-Assisted Evaluation of Colon Cleanliness in Capsule Endoscopy Videos
by Pere Gilabert, Carolina Malagelada, Hagen Wenzek, Angus Watson, Alexander R. Robertson, Ádám Finta, Jordi Vitrià and Santi Seguí
Diagnostics 2025, 15(17), 2228; https://doi.org/10.3390/diagnostics15172228 - 3 Sep 2025
Viewed by 239
Abstract
(1) Background: Accurate evaluation of colon capsule endoscopy videos plays a pivotal role in diagnosing gastrointestinal disorders. A primary step in this process is assessing the cleanliness of the area of interest to determine its admissibility. This study introduces a system designed [...] Read more.
(1) Background: Accurate evaluation of colon capsule endoscopy videos plays a pivotal role in diagnosing gastrointestinal disorders. A primary step in this process is assessing the cleanliness of the area of interest to determine its admissibility. This study introduces a system designed to assist physicians in evaluating the colon cleanliness score of capsule endoscopy videos. (2) Methods: The system uses a TransUNet architecture, a customized loss function, and a low-effort labeling method to propose cleanliness scores for previously unseen videos. The proposed model was evaluated on a dataset of 52 capsule endoscopy videos. Agreement with physicians was measured using Cohen’s kappa statistic. (3) Results: The system achieved a Cohen’s kappa agreement of 0.586 with physicians, which is notably higher than the intra-observer variability observed, measured at 0.546. Additionally, the system can show the cleanliness evolution throughout the entire video, helping justify the proposed score. (4) Conclusions: The proposed system demonstrates improved agreement with physicians compared to inter-physician agreement, showing its potential to support the cleanliness evaluation process in colon capsule endoscopy. The ability to visualize the cleanliness evolution across the video enhances the transparency and interpretability of the suggested score. Full article
Show Figures

Figure 1

19 pages, 321 KB  
Article
Parents’ and Teachers’ Perspectives on Children’s Socio-Emotional Well-Being During Transition from Home to Kindergarten
by Sanja Tatalović Vorkapić and Tamara Komadina
Children 2025, 12(9), 1145; https://doi.org/10.3390/children12091145 - 28 Aug 2025
Viewed by 381
Abstract
Background: As the social-emotional well-being of children as a whole and specifically during the transition to kindergarten is of paramount importance, it is important to continuously research this topic using a multi-informant approach. Moreover, a further contribution of this study lies in addressing [...] Read more.
Background: As the social-emotional well-being of children as a whole and specifically during the transition to kindergarten is of paramount importance, it is important to continuously research this topic using a multi-informant approach. Moreover, a further contribution of this study lies in addressing the substantial gap in the existing literature within this important field. Objectives: Starting from the Ecological-Dynamic Transition Model and the Positive Development and Resilience in Kindergarten (PERIK) Model, the main aim of this research was to analyze parents’ and teachers’ perceptions of children’s social-emotional well-being during the transition and adjustment, and the quality of transition and adjustment. Methods: The study was conducted on a sample of parents (N = 154; 147 mothers) and teachers from 4 kindergartens (N = 12, all female) as raters of children’s (N = 202; 82 girls) social-emotional well-being, using PERIK scale and four questions on the quality of transition. Results: All PERIK-dimensions were rated as elevated based on parents’ ratings and moderate based on teachers’ ratings. Ratings of difficulties during transition decreased, and satisfaction with transition and adjustment and cooperation between parents and caregivers during transition increased (teachers’ ratings were lower than parents’ ratings). The average duration of adjustment in kindergarten was three weeks. Correlation analyses showed the expected significant correlations between the PERIK dimensions and the quality of transitions and adjustment of children. Inter-rater agreement analyses showed the effect sizes were predominantly large and poor to medium agreement between parent and teacher ratings was determined. Conclusions: Although the study found that there are significant differences in perceptions of the relationship between PERIK-dimensions and satisfaction with children’s transition between teachers and parents, which was expected due to the assessment of children in different contexts, it is important to consider them both in future research. Full article
(This article belongs to the Special Issue Children’s Well-Being and Mental Health in an Educational Context)
38 pages, 4944 KB  
Article
Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis
by Eleonora Bernasconi, Domenico Redavid and Stefano Ferilli
Electronics 2025, 14(17), 3404; https://doi.org/10.3390/electronics14173404 - 27 Aug 2025
Viewed by 401
Abstract
This study proposes a novel, scalable framework for the automated classification and synthesis of survey literature by integrating state-of-the-art Large Language Models (LLMs) with robust ensemble voting techniques. The framework consolidates predictions from three independent models—GPT-4, LLaMA 3.3, and Claude 3—to generate consensus-based [...] Read more.
This study proposes a novel, scalable framework for the automated classification and synthesis of survey literature by integrating state-of-the-art Large Language Models (LLMs) with robust ensemble voting techniques. The framework consolidates predictions from three independent models—GPT-4, LLaMA 3.3, and Claude 3—to generate consensus-based classifications, thereby enhancing reliability and mitigating individual model biases. We demonstrate the generalizability of our approach through comprehensive evaluation on two distinct domains: Question Answering (QA) systems and Computer Vision (CV) survey literature, using a dataset of 1154 real papers extracted from arXiv. Comprehensive visual evaluation tools, including distribution charts, heatmaps, confusion matrices, and statistical validation metrics, are employed to rigorously assess model performance and inter-model agreement. The framework incorporates advanced statistical measures, including k-fold cross-validation, Fleiss’ kappa for inter-rater reliability, and chi-square tests for independence to validate classification robustness. Extensive experimental evaluations demonstrate that this ensemble approach achieves superior performance compared to individual models, with accuracy improvements of 10.0% over the best single model on QA literature and 10.9% on CV literature. Furthermore, comprehensive cost–benefit analysis reveals that our automated approach reduces manual literature synthesis time by 95% while maintaining high classification accuracy (F1-score: 0.89 for QA, 0.87 for CV), making it a practical solution for large-scale literature analysis. The methodology effectively uncovers emerging research trends and persistent challenges across domains, providing researchers with powerful tools for continuous literature monitoring and informed decision-making in rapidly evolving scientific fields. Full article
(This article belongs to the Special Issue Knowledge Engineering and Data Mining, 3rd Edition)
Show Figures

Figure 1

33 pages, 1150 KB  
Article
Exploring the Conceptual Model and Instructional Design Principles of Intelligent Problem-Solving Learning
by Yuna Lee and Sang-Soo Lee
Sustainability 2025, 17(17), 7682; https://doi.org/10.3390/su17177682 - 26 Aug 2025
Viewed by 582
Abstract
The rapid advancement of artificial intelligence has fundamentally transformed how knowledge is created, disseminated, and applied in problem-solving, presenting new challenges for educational models. This study introduces Intelligent Problem-Solving Learning (IPSL)—a capability-based instructional design framework aimed at cultivating learners’ adaptability, creativity, and meta-learning [...] Read more.
The rapid advancement of artificial intelligence has fundamentally transformed how knowledge is created, disseminated, and applied in problem-solving, presenting new challenges for educational models. This study introduces Intelligent Problem-Solving Learning (IPSL)—a capability-based instructional design framework aimed at cultivating learners’ adaptability, creativity, and meta-learning in AI-enhanced environments. Grounded in connectivism, extended mind theory, and the concept of augmented intelligence, IPSL places human–AI collaboration at the core of instructional design. Using a design and development research (DDR) methodology, the study constructs a conceptual model comprising three main categories and eight subcategories, supported by eighteen instructional design principles. The model’s clarity, theoretical coherence, and educational relevance were validated through two rounds of expert review using the Content Validity Index (CVI) and Inter-Rater Agreement (IRA). IPSL emphasizes differentiated task roles—those exclusive to humans, suitable for human–AI collaboration, or fully delegable to AI—alongside meta-learning strategies that empower learners to navigate complex and unpredictable problems. This framework offers both theoretical and practical guidance for building future-oriented education systems, positioning AI as a learning partner while upholding essential human qualities such as ethical judgment, creativity, and agency. It equips educators with actionable principles to harmonize technological integration with human-centered learning in an age of rapid transformation. Full article
(This article belongs to the Special Issue Sustainable Digital Education: Innovations in Teaching and Learning)
Show Figures

Figure 1

15 pages, 3154 KB  
Article
Transformer-Based HER2 Scoring in Breast Cancer: Comparative Performance of a Foundation and a Lightweight Model
by Yeh-Han Wang, Min-Hsiang Chang, Hsin-Hsiu Tsai, Chun-Jui Chien and Jian-Chiao Wang
Diagnostics 2025, 15(17), 2131; https://doi.org/10.3390/diagnostics15172131 - 23 Aug 2025
Viewed by 402
Abstract
Background/Objectives: Human epidermal growth factor 2 (HER2) scoring is critical for modern breast cancer therapies, especially with emerging indications of antibody–drug conjugates for HER2-low tumors. However, inter-observer agreement remains limited in borderline cases. Automatic artificial intelligence-based scoring has the [...] Read more.
Background/Objectives: Human epidermal growth factor 2 (HER2) scoring is critical for modern breast cancer therapies, especially with emerging indications of antibody–drug conjugates for HER2-low tumors. However, inter-observer agreement remains limited in borderline cases. Automatic artificial intelligence-based scoring has the potential to improve diagnostic consistency and scalability. This study aimed to develop two transformer-based models for HER2 scoring of breast cancer whole-slide images (WSIs) and compare their performance. Methods: We adapted a large-scale foundation model (Virchow) and a lightweight model (TinyViT). Both were trained using patch-level annotations and integrated into a WSI scoring pipeline. Performance was evaluated on a clinical test set (n = 66), including clinical decision tasks and inference efficiency. Results: Both models achieved substantial agreement with pathologist reports (linear weighted kappa: 0.860 for Virchow, 0.825 for TinyViT). Virchow showed slightly higher WSI-level accuracy than TinyViT, whereas TinyViT reduced inference times by 60%. In three binary clinical tasks, both models demonstrated a diagnostic performance comparable to pathologists, particularly in identifying HER2-low tumors for antibody–drug conjugate (ADC) therapy. A continuous scoring framework demonstrated a strong correlation between the two models (Pearson’s r = 0.995) and aligned with human assessments. Conclusions: Both transformer-based artificial intelligence models achieved human-level accuracy for automated HER2 scoring with interpretable outputs. While the foundation model offers marginally higher accuracy, the lightweight model provides practical advantages for clinical deployment. In addition, continuous scoring may provide a more granular HER2 quantification, especially in borderline cases. This could support a new interpretive paradigm for HER2 assessment aligned with the evolving indications of ADC. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Graphical abstract

27 pages, 1152 KB  
Article
Mapping the Cognitive Architecture of Health Beliefs: A Multivariate Conditional Network of Perceived Salt-Related Disease Risks
by Stanisław Surma, Łukasz Lewandowski, Karol Momot, Tomasz Sobierajski, Joanna Lewek, Bogusław Okopień and Maciej Banach
Nutrients 2025, 17(17), 2728; https://doi.org/10.3390/nu17172728 - 22 Aug 2025
Viewed by 497
Abstract
Background: Public beliefs about dietary risks, such as excessive salt intake, are often not isolated misconceptions but part of structured cognitive systems. This study aimed to explore how individuals organize their beliefs and misperceptions regarding salt-related health consequences. Material and Methods: Using data [...] Read more.
Background: Public beliefs about dietary risks, such as excessive salt intake, are often not isolated misconceptions but part of structured cognitive systems. This study aimed to explore how individuals organize their beliefs and misperceptions regarding salt-related health consequences. Material and Methods: Using data from an international online survey, we applied a system of multivariate proportional odds logistic regression (POLR) models to estimate conditional associations among beliefs about salt’s links to various diseases—including cardiovascular, metabolic, renal, neuropsychiatric, and mortality outcomes. In addition, exploratory and confirmatory factor analyses (EFA and CFA) were conducted to identify and validate latent constructs underlying the belief items. Beliefs were modeled as interdependent, controlling for latent constructs, sociodemographics, and self-reported health awareness. Statistically significant associations (p < 0.05) were visualized via a heatmap of beta coefficients. Results: Physicians showed almost universal agreement that salt contributes to hypertension (µ = 0.97), compared to non-medical respondents (µ = 0.85; p < 0.0001). Beliefs about mortality (µ = 1.55 for MDs vs. 0.99 for non-medical; p < 0.0001) emerged as central hubs in the belief network. Strong inter-item associations were observed, such as between hypertension and heart failure (β = −0.39), and between obesity and type 2 diabetes (β = −0.94). Notably, cognitive gaps were found, including a lack of association between atrial fibrillation and stroke, and non-reciprocal links between hypertension and heart failure. Conclusions: Beliefs about the health effects of salt are structured and sometimes asymmetrical, reflecting underlying reasoning patterns rather than isolated ignorance. Understanding these structures provides a systems-level view of health literacy and may inform more effective public health communication and education strategies. Full article
(This article belongs to the Special Issue Nutritional Aspects of Cardiovascular Disease Risk Factors)
Show Figures

Figure 1

20 pages, 2833 KB  
Article
A Multi-Level Annotation Model for Fake News Detection: Implementing Kazakh-Russian Corpus via Label Studio
by Madina Sambetbayeva, Anargul Nekessova, Aigerim Yerimbetova, Abdygalym Bayangali, Mira Kaldarova, Duman Telman and Nurzhigit Smailov
Big Data Cogn. Comput. 2025, 9(8), 215; https://doi.org/10.3390/bdcc9080215 - 20 Aug 2025
Viewed by 537
Abstract
This paper presents a multi-level annotation model for detecting fake news in Kazakh and Russian languages, aiming to enhance understanding of disinformation strategies in multilingual digital media environments. Unlike traditional binary models, our approach captures the complexity of disinformation by accounting for both [...] Read more.
This paper presents a multi-level annotation model for detecting fake news in Kazakh and Russian languages, aiming to enhance understanding of disinformation strategies in multilingual digital media environments. Unlike traditional binary models, our approach captures the complexity of disinformation by accounting for both linguistic and cultural factors. To support this, a corpus of over 5000 news texts was manually annotated using the Label Studio platform. The annotation scheme consists of seven interrelated categories: CLAIM, SOURCE, EVIDENCE, DISINFORMATION_TECHNIQUE, AUTHOR_INTENT, TARGET_AUDIENCE, and TIMESTAMP. Inter-annotator agreement, evaluated using Cohen’s Kappa, ranged from 0.72 to 0.81, indicating substantial consistency. The annotated data reveals recurring patterns of disinformation, such as emotional manipulation, targeting of vulnerable individuals, and the strategic concealment of intent. Semantic relations between entities, such as CLAIM → EVIDENCE and CLAIM → AUTHOR_INTENT were formalized to represent disinformation narratives as knowledge graphs. This study contributes the first linguistically and culturally adapted annotation model for Kazakh and Russian languages, providing a robust and empirical resource for building interpretable and context-aware fake news detection systems. The resulting annotated corpus and its semantic structure offer valuable empirical material for further research in natural language processing, computational linguistics, and media studies in low-resource language environments. Full article
Show Figures

Figure 1

13 pages, 385 KB  
Article
How Accurate Is AI? A Critical Evaluation of Commonly Used Large Language Models in Responding to Patient Concerns About Incidental Kidney Tumors
by Bernhard Ralla, Nadine Biernath, Isabel Lichy, Lukas Kurz, Frank Friedersdorff, Thorsten Schlomm, Jacob Schmidt, Henning Plage and Jonathan Jeutner
J. Clin. Med. 2025, 14(16), 5697; https://doi.org/10.3390/jcm14165697 - 12 Aug 2025
Viewed by 521
Abstract
Background: Large language models (LLMs) such as ChatGPT, Google Gemini, and Microsoft Copilot are increasingly used by patients seeking medical information online. While these tools provide accessible and conversational explanations, their accuracy and safety in emotionally sensitive scenarios—such as an incidental cancer diagnosis—remain [...] Read more.
Background: Large language models (LLMs) such as ChatGPT, Google Gemini, and Microsoft Copilot are increasingly used by patients seeking medical information online. While these tools provide accessible and conversational explanations, their accuracy and safety in emotionally sensitive scenarios—such as an incidental cancer diagnosis—remain uncertain. Objective: To evaluate the quality, completeness, readability, and safety of responses generated by three state-of-the-art LLMs to common patient questions following the incidental discovery of a kidney tumor. Methods: A standardized use-case scenario was developed: a patient learns of a suspicious renal mass following a computed tomography (CT) scan for back pain. Ten plain-language prompts reflecting typical patient concerns were submitted to ChatGPT-4o, Microsoft Copilot, and Google Gemini 2.5 Pro without additional context. Responses were independently assessed by five board-certified urologists using a validated six-domain rubric (accuracy, completeness, clarity, currency, risk of harm, hallucinations), scored on a 1–5 Likert scale. Two statistical approaches were applied to calculate descriptive scores and inter-rater reliability (Fleiss’ Kappa). Readability was analyzed using the Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL) metrics. Results: Google Gemini 2.5 Pro achieved the highest mean ratings across most domains, notably in accuracy (4.3), completeness (4.3), and low hallucination rate (4.6). Microsoft Copilot was noted for empathetic language and consistent disclaimers but showed slightly lower clarity and currency scores. ChatGPT-4o demonstrated strengths in conversational flow but displayed more variability in clinical precision. Overall, 14% of responses were flagged as potentially misleading or incomplete. Inter-rater agreement was substantial across all domains (κ = 0.68). Readability varied between models: ChatGPT responses were easiest to understand (FRE = 48.5; FKGL = 11.94), while Gemini’s were the most complex (FRE = 29.9; FKGL = 13.3). Conclusions: LLMs show promise in patient-facing communication but currently fall short of providing consistently accurate, complete, and guideline-conform information in high-stakes contexts such as incidental cancer diagnoses. While their tone and structure may support patient engagement, they should not be used autonomously for counseling. Further fine-tuning, clinical validation, and supervision are essential for safe integration into patient care. Full article
(This article belongs to the Special Issue Clinical Advances in Artificial Intelligence in Urology)
Show Figures

Figure 1

25 pages, 5488 KB  
Article
Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews
by Maria C. Voutsa, Nicolas Tsapatsoulis and Constantinos Djouvas
AI 2025, 6(8), 178; https://doi.org/10.3390/ai6080178 - 4 Aug 2025
Viewed by 994
Abstract
As large language models (LLMs) gain traction among researchers and practitioners, particularly in digital marketing for tasks such as customer feedback analysis and automated communication, concerns remain about the reliability and consistency of their outputs. This study investigates annotation bias in LLMs by [...] Read more.
As large language models (LLMs) gain traction among researchers and practitioners, particularly in digital marketing for tasks such as customer feedback analysis and automated communication, concerns remain about the reliability and consistency of their outputs. This study investigates annotation bias in LLMs by comparing human and AI-generated annotation labels across sentiment, topic, and aspect dimensions in hotel booking reviews. Using the HRAST dataset, which includes 23,114 real user-generated review sentences and a synthetically generated corpus of 2000 LLM-authored sentences, we evaluate inter-annotator agreement between a human expert and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini) as a proxy for assessing annotation bias. Our findings show high agreement among LLMs, especially on synthetic data, but only moderate to fair alignment with human annotations, particularly in sentiment and aspect-based sentiment analysis. LLMs display a pronounced neutrality bias, often defaulting to neutral sentiment in ambiguous cases. Moreover, annotation behavior varies notably with task design, as manual, one-to-one prompting produces higher agreement with human labels than automated batch processing. The study identifies three distinct AI biases—repetition bias, behavioral bias, and neutrality bias—that shape annotation outcomes. These findings highlight how dataset complexity and annotation mode influence LLM behavior, offering important theoretical, methodological, and practical implications for AI-assisted annotation and synthetic content generation. Full article
(This article belongs to the Special Issue AI Bias in the Media and Beyond)
Show Figures

Figure 1

23 pages, 1192 KB  
Article
Multi-Model Dialectical Evaluation of LLM Reasoning Chains: A Structured Framework with Dual Scoring Agents
by Catalin Anghel, Andreea Alexandra Anghel, Emilia Pecheanu, Ioan Susnea, Adina Cocu and Adrian Istrate
Informatics 2025, 12(3), 76; https://doi.org/10.3390/informatics12030076 - 1 Aug 2025
Viewed by 895
Abstract
(1) Background and objectives: Large language models (LLMs) such as GPT, Mistral, and LLaMA exhibit strong capabilities in text generation, yet assessing the quality of their reasoning—particularly in open-ended and argumentative contexts—remains a persistent challenge. This study introduces Dialectical Agent, an internally developed [...] Read more.
(1) Background and objectives: Large language models (LLMs) such as GPT, Mistral, and LLaMA exhibit strong capabilities in text generation, yet assessing the quality of their reasoning—particularly in open-ended and argumentative contexts—remains a persistent challenge. This study introduces Dialectical Agent, an internally developed modular framework designed to evaluate reasoning through a structured three-stage process: opinion, counterargument, and synthesis. The framework enables transparent and comparative analysis of how different LLMs handle dialectical reasoning. (2) Methods: Each stage is executed by a single model, and final syntheses are scored via two independent LLM evaluators (LLaMA 3.1 and GPT-4o) based on a rubric with four dimensions: clarity, coherence, originality, and dialecticality. In parallel, a rule-based semantic analyzer detects rhetorical anomalies and ethical values. All outputs and metadata are stored in a Neo4j graph database for structured exploration. (3) Results: The system was applied to four open-weight models (Gemma 7B, Mistral 7B, Dolphin-Mistral, Zephyr 7B) across ten open-ended prompts on ethical, political, and technological topics. The results show consistent stylistic and semantic variation across models, with moderate inter-rater agreement. Semantic diagnostics revealed differences in value expression and rhetorical flaws not captured by rubric scores. (4) Originality: The framework is, to our knowledge, the first to integrate multi-stage reasoning, rubric-based and semantic evaluation, and graph-based storage into a single system. It enables replicable, interpretable, and multidimensional assessment of generative reasoning—supporting researchers, developers, and educators working with LLMs in high-stakes contexts. Full article
Show Figures

Figure 1

24 pages, 9147 KB  
Article
Experimental and Numerical Study on the Seismic Performance of Base-Suspended Pendulum Isolation Structure
by Liang Lu, Lei Wang, Wanqiu Xia and Minghao Yin
Buildings 2025, 15(15), 2711; https://doi.org/10.3390/buildings15152711 - 31 Jul 2025
Viewed by 304
Abstract
This paper proposes a novel suspended seismic structure system called Base-suspended Pendulum Isolation (BSPI) structure. The BSPI structure can isolate seismic action and reduce structural seismic response by hanging the structure with hanger rods set at the base. The viscous dampers are installed [...] Read more.
This paper proposes a novel suspended seismic structure system called Base-suspended Pendulum Isolation (BSPI) structure. The BSPI structure can isolate seismic action and reduce structural seismic response by hanging the structure with hanger rods set at the base. The viscous dampers are installed in the isolation layer to dissipate earthquake energy and control the displacement. Firstly, the configuration of suspension isolation layer and mechanical model of the BSPI structure are described. Then, an equivalent scaled BSPI structure physical model was tested on the shaking table. The test results demonstrate that the BSPI structure has a good isolation effect under earthquakes, and the viscous dampers had an obvious control effect on the structure’s displacement and acceleration response. Finally, numerical simulation of the tests was carried out. The accuracy of the numerical models are confirmed by the good agreement between the simulation and test results. The numerical models for the BSPI structure and conventional reinforced concrete (RC) frame structure are built and analyzed using the commercial software ABAQUS. Research results indicate that the lateral stiffness of the BSPI structure is reduced greatly by installing the suspension layer, and the acceleration response of BSPI structure is significantly reduced under rare earthquakes, which is only 1/2 of that of the RC frame. The inter-story displacement of the BSPI structure is less than 1/100, which meets the seismic fortification goal and is reduced to 50% of that of the BSPI structure without damper under rare earthquakes. Full article
(This article belongs to the Section Building Structures)
Show Figures

Figure 1

12 pages, 456 KB  
Article
From Variability to Standardization: The Impact of Breast Density on Background Parenchymal Enhancement in Contrast-Enhanced Mammography and the Need for a Structured Reporting System
by Graziella Di Grezia, Antonio Nazzaro, Luigi Schiavone, Cisternino Elisa, Alessandro Galiano, Gatta Gianluca, Cuccurullo Vincenzo and Mariano Scaglione
Cancers 2025, 17(15), 2523; https://doi.org/10.3390/cancers17152523 - 30 Jul 2025
Viewed by 1278
Abstract
Introduction: Breast density is a well-recognized factor in breast cancer risk assessment, with higher density linked to increased malignancy risk and reduced sensitivity of conventional mammography. Background parenchymal enhancement (BPE), observed in contrast-enhanced imaging, reflects physiological contrast uptake in non-pathologic breast tissue. [...] Read more.
Introduction: Breast density is a well-recognized factor in breast cancer risk assessment, with higher density linked to increased malignancy risk and reduced sensitivity of conventional mammography. Background parenchymal enhancement (BPE), observed in contrast-enhanced imaging, reflects physiological contrast uptake in non-pathologic breast tissue. While extensively characterized in breast MRI, the role of BPE in contrast-enhanced mammography (CEM) remains uncertain due to inconsistent findings regarding its correlation with breast density and cancer risk. Unlike breast density—standardized through the ACR BI-RADS lexicon—BPE lacks a uniform classification system in CEM, leading to variability in clinical interpretation and research outcomes. To address this gap, we introduce the BPE-CEM Standard Scale (BCSS), a structured four-tiered classification system specifically tailored to the two-dimensional characteristics of CEM, aiming to improve consistency and diagnostic alignment in BPE evaluation. Materials and Methods: In this retrospective single-center study, 213 patients who underwent mammography (MG), ultrasound (US), and contrast-enhanced mammography (CEM) between May 2022 and June 2023 at the “A. Perrino” Hospital in Brindisi were included. Breast density was classified according to ACR BI-RADS (categories A–D). BPE was categorized into four levels: Minimal (< 10% enhancement), Light (10–25%), Moderate (25–50%), and Marked (> 50%). Three radiologists independently assessed BPE in a subset of 50 randomly selected cases to evaluate inter-observer agreement using Cohen’s kappa. Correlations between BPE, breast density, and age were examined through regression analysis. Results: BPE was Minimal in 57% of patients, Light in 31%, Moderate in 10%, and Marked in 2%. A significant positive association was found between higher breast density (BI-RADS C–D) and increased BPE (p < 0.05), whereas lower-density breasts (A–B) were predominantly associated with minimal or light BPE. Regression analysis confirmed a modest but statistically significant association between breast density and BPE (R2 = 0.144), while age showed no significant effect. Inter-observer agreement for BPE categorization using the BCSS was excellent (κ = 0.85; 95% CI: 0.78–0.92), supporting its reproducibility. Conclusions: Our findings indicate that breast density is a key determinant of BPE in CEM. The proposed BCSS offers a reproducible, four-level framework for standardized BPE assessment tailored to the imaging characteristics of CEM. By reducing variability in interpretation, the BCSS has the potential to improve diagnostic consistency and facilitate integration of BPE into personalized breast cancer risk models. Further prospective multicenter studies are needed to validate this classification and assess its clinical impact. Full article
Show Figures

Figure 1

32 pages, 465 KB  
Article
EsCorpiusBias: The Contextual Annotation and Transformer-Based Detection of Racism and Sexism in Spanish Dialogue
by Ksenia Kharitonova, David Pérez-Fernández, Javier Gutiérrez-Hernando, Asier Gutiérrez-Fandiño, Zoraida Callejas and David Griol
Future Internet 2025, 17(8), 340; https://doi.org/10.3390/fi17080340 - 28 Jul 2025
Viewed by 364
Abstract
The rise in online communication platforms has significantly increased exposure to harmful discourse, presenting ongoing challenges for digital moderation and user well-being. This paper introduces the EsCorpiusBias corpus, designed to enhance the automated detection of sexism and racism within Spanish-language online dialogue, specifically [...] Read more.
The rise in online communication platforms has significantly increased exposure to harmful discourse, presenting ongoing challenges for digital moderation and user well-being. This paper introduces the EsCorpiusBias corpus, designed to enhance the automated detection of sexism and racism within Spanish-language online dialogue, specifically sourced from the Mediavida forum. By means of a systematic, context-sensitive annotation protocol, approximately 1000 three-turn dialogue units per bias category are annotated, ensuring the nuanced recognition of pragmatic and conversational subtleties. Here, annotation guidelines are meticulously developed, covering explicit and implicit manifestations of sexism and racism. Annotations are performed using the Prodigy tool (v1. 16.0) resulting in moderate to substantial inter-annotator agreement (Cohen’s Kappa: 0.55 for sexism and 0.79 for racism). Models including logistic regression, SpaCy’s baseline n-gram bag-of-words model, and transformer-based BETO are trained and evaluated, demonstrating that contextualized transformer-based approaches significantly outperform baseline and general-purpose models. Notably, the single-turn BETO model achieves an ROC-AUC of 0.94 for racism detection, while the contextual BETO model reaches an ROC-AUC of 0.87 for sexism detection, highlighting BETO’s superior effectiveness in capturing nuanced bias in online dialogues. Additionally, lexical overlap analyses indicate a strong reliance on explicit lexical indicators, highlighting limitations in handling implicit biases. This research underscores the importance of contextually grounded, domain-specific fine-tuning for effective automated detection of toxicity, providing robust resources and methodologies to foster socially responsible NLP systems within Spanish-speaking online communities. Full article
(This article belongs to the Special Issue Deep Learning and Natural Language Processing—3rd Edition)
Show Figures

Figure 1

28 pages, 4702 KB  
Article
Clinical Failure of General-Purpose AI in Photographic Scoliosis Assessment: A Diagnostic Accuracy Study
by Cemre Aydin, Ozden Bedre Duygu, Asli Beril Karakas, Eda Er, Gokhan Gokmen, Anil Murat Ozturk and Figen Govsa
Medicina 2025, 61(8), 1342; https://doi.org/10.3390/medicina61081342 - 25 Jul 2025
Viewed by 813
Abstract
Background and Objectives: General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This [...] Read more.
Background and Objectives: General-purpose multimodal large language models (LLMs) are increasingly used for medical image interpretation despite lacking clinical validation. This study evaluates the diagnostic reliability of ChatGPT-4o and Claude 2 in photographic assessment of adolescent idiopathic scoliosis (AIS) against radiological standards. This study examines two critical questions: whether families can derive reliable preliminary assessments from LLMs through analysis of clinical photographs and whether LLMs exhibit cognitive fidelity in their visuospatial reasoning capabilities for AIS assessment. Materials and Methods: A prospective diagnostic accuracy study (STARD-compliant) analyzed 97 adolescents (74 with AIS and 23 with postural asymmetry). Standardized clinical photographs (nine views/patient) were assessed by two LLMs and two orthopedic residents against reference radiological measurements. Primary outcomes included diagnostic accuracy (sensitivity/specificity), Cobb angle concordance (Lin’s CCC), inter-rater reliability (Cohen’s κ), and measurement agreement (Bland–Altman LoA). Results: The LLMs exhibited hazardous diagnostic inaccuracy: ChatGPT misclassified all non-AIS cases (specificity 0% [95% CI: 0.0–14.8]), while Claude 2 generated 78.3% false positives. Systematic measurement errors exceeded clinical tolerance: ChatGPT overestimated thoracic curves by +10.74° (LoA: −21.45° to +42.92°), exceeding tolerance by >800%. Both LLMs showed inverse biomechanical concordance in thoracolumbar curves (CCC ≤ −0.106). Inter-rater reliability fell below random chance (ChatGPT κ = −0.039). Universal proportional bias (slopes ≈ −1.0) caused severe curve underestimation (e.g., 10–15° error for 50° deformities). Human evaluators demonstrated superior bias control (0.3–2.8° vs. 2.6–10.7°) but suboptimal specificity (21.7–26.1%) and hazardous lumbar concordance (CCC: −0.123). Conclusions: General-purpose LLMs demonstrate clinically unacceptable inaccuracy in photographic AIS assessment, contraindicating clinical deployment. Catastrophic false positives, systematic measurement errors exceeding tolerance by 480–1074%, and inverse diagnostic concordance necessitate urgent regulatory safeguards under frameworks like the EU AI Act. Neither LLMs nor photographic human assessment achieve reliability thresholds for standalone screening, mandating domain-specific algorithm development and integration of 3D modalities. Full article
(This article belongs to the Special Issue Diagnosis and Treatment of Adolescent Idiopathic Scoliosis)
Show Figures

Figure 1

Back to TopTop