Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (586)

Search Parameters:
Keywords = large multimodal models

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
34 pages, 11508 KB  
Article
Explainable AI-Driven 1D-CNN with Efficient Wireless Communication System Integration for Multimodal Diabetes Prediction
by Radwa Ahmed Osman
AI 2025, 6(10), 243; https://doi.org/10.3390/ai6100243 (registering DOI) - 25 Sep 2025
Abstract
The early detection of diabetes risk and effective management of patient data are critical for avoiding serious consequences and improving treatment success. This research describes a two-part architecture that combines an energy-efficient wireless communication technology with an interpretable deep learning model for diabetes [...] Read more.
The early detection of diabetes risk and effective management of patient data are critical for avoiding serious consequences and improving treatment success. This research describes a two-part architecture that combines an energy-efficient wireless communication technology with an interpretable deep learning model for diabetes categorization. In Phase 1, a unique wireless communication model is created to assure the accurate transfer of real-time patient data from wearable devices to medical centers. Using Lagrange optimization, the model identifies the best transmission distance and power needs, lowering energy usage while preserving communication dependability. This contribution is especially essential since effective data transport is a necessary condition for continuous monitoring in large-scale healthcare systems. In Phase 2, the transmitted multimodal clinical, genetic, and lifestyle data are evaluated using a one-dimensional Convolutional Neural Network (1D-CNN) with Bayesian hyperparameter tuning. The model beat traditional deep learning architectures like LSTM and GRU. To improve interpretability and clinical acceptance, SHAP and LIME were used to find global and patient-specific predictors. This approach tackles technological and medicinal difficulties by integrating energy-efficient wireless communication with interpretable predictive modeling. The system ensures dependable data transfer, strong predictive performance, and transparent decision support, boosting trust in AI-assisted healthcare and enabling individualized diabetes control. Full article
29 pages, 509 KB  
Review
A Review of Automatic Fake News Detection: From Traditional Methods to Large Language Models
by Repede Ștefan Emil and Brad Remus
Future Internet 2025, 17(10), 435; https://doi.org/10.3390/fi17100435 - 25 Sep 2025
Abstract
In the current digital era, the spread of fake news presents serious difficulties. This study offers a thorough analysis of recent developments in false news automatic detection techniques, from traditional methods to the most recent developed models like large language models. The review [...] Read more.
In the current digital era, the spread of fake news presents serious difficulties. This study offers a thorough analysis of recent developments in false news automatic detection techniques, from traditional methods to the most recent developed models like large language models. The review identifies four perspectives on automatic detection of fake news that are oriented towards knowledge, style, propagation, and source of the misinformation. This paper describes how automatic detection methods use data science techniques such as deep learning, large language models, and traditional machine learning. In addition to discussing the shortcomings of existing approaches, such as the absence of datasets, this paper emphasizes the multidimensional function of large language models in creating and identifying fake news while underlining the necessity for textual, visual, and audio common analysis, multidisciplinary collaboration, and greater model transparency. Full article
(This article belongs to the Special Issue Generative Artificial Intelligence in Smart Societies)
Show Figures

Figure 1

10 pages, 282 KB  
Article
ChatGPT in Oral Pathology: Bright Promise or Diagnostic Mirage
by Ana Suárez, Yolanda Freire, Víctor Díaz-Flores García, Andrea Santamaría Laorden, Jaime Orejas Pérez, María Suárez Ajuria, Juan Algar and Carmen Martín Carreras-Presas
Medicina 2025, 61(10), 1744; https://doi.org/10.3390/medicina61101744 - 25 Sep 2025
Abstract
Background and Objectives: The growing academic interest within the biomedical sciences regarding the diagnostic capabilities of multimodal language models, such as ChatGPT-4o, is clear. However, their ability to interpret oral clinical images remains insufficiently explored. This exploratory pilot study aimed to provide preliminary [...] Read more.
Background and Objectives: The growing academic interest within the biomedical sciences regarding the diagnostic capabilities of multimodal language models, such as ChatGPT-4o, is clear. However, their ability to interpret oral clinical images remains insufficiently explored. This exploratory pilot study aimed to provide preliminary observations about the diagnostic validity of ChatGPT-4o in identifying oral squamous cell carcinoma (OSCC), oral leukoplakia (OL), and oral lichen planus (OLP) using only clinical photographs, without the inclusion of additional clinical data. Materials and Methods: Two general dentists selected 23 images of oral lesions suspected to be OSCC, OL, or OLP. ChatGPT-4o was asked to provide a probable diagnosis for each image on 30 occasions, generating a total of 690 responses. The responses were then evaluated against the reference diagnosis set up by an expert to calculate sensitivity, specificity, predictive values, and the area under the ROC curve. Results: ChatGPT-4o demonstrated high specificity across the three conditions (97.1% for OSCC, 100% for OL, and 96.1% for OLP), correctly classifying 90% of OSCC cases (AUC = 0.81). However, this overall accuracy was largely driven by correct negative classifications, while the clinically relevant sensitivity for OSCC was only 65%. In spite of that, sensitivity was highly variable: 60% for OL and just 25% for OLP, which limits its usefulness in a clinical setting for ruling out these conditions. The model achieved positive predictive values of 86.7% for OSCC and 100% for OL. Given the small dataset, these findings should be interpreted only as preliminary evidence. Conclusions: ChatGPT-4o demonstrates potential as a complementary tool for the screening of OSCC in clinical oral images. Nevertheless, the pilot nature of this study and the reduced sample size highlight that larger, adequately powered studies (with several hundred cases per pathology) are needed to obtain robust and generalizable results. Nevertheless, its sensitivity remains insufficient, as a significant proportion of true cases were missed, underscoring that the model cannot be relied upon as a standalone diagnostic tool. Full article
(This article belongs to the Section Dentistry and Oral Health)
25 pages, 104808 KB  
Article
From the Moon to Mercury: Release of Global Crater Catalogs Using Multimodal Deep Learning for Crater Detection and Morphometric Analysis
by Riccardo La Grassa, Cristina Re, Elena Martellato, Adriano Tullo, Silvia Bertoli, Gabriele Cremonese, Natalia Amanda Vergara Sassarini, Maddalena Faletti, Valentina Galluzzi and Lorenza Giacomini
Remote Sens. 2025, 17(19), 3287; https://doi.org/10.3390/rs17193287 - 25 Sep 2025
Abstract
This study has compiled the first impact-crater dataset for Mercury with diameters greater than 400 m by a multimodal deep-learning pipeline. We present an enhanced deep learning framework for large-scale planetary crater detection, extending the YOLOLens architecture through the integration of multimodal inputs: [...] Read more.
This study has compiled the first impact-crater dataset for Mercury with diameters greater than 400 m by a multimodal deep-learning pipeline. We present an enhanced deep learning framework for large-scale planetary crater detection, extending the YOLOLens architecture through the integration of multimodal inputs: optical imagery, digital terrain models (DTMs), and hillshade derivatives. By incorporating morphometric data, the model achieves robust detection of impact craters that are often imperceptible in optical imagery alone, especially in regions affected by low contrast, degraded rims, or shadow-dominated illumination. The resulting catalogs LU6M371TGT for the Moon and ME6M300TGT for Mercury constitute the most comprehensive automated crater inventories to date, demonstrating the effectiveness of multimodal learning and cross-planet transfer. This work highlights the critical role of terrain information in planetary object detection and establishes a scalable, high-throughput pipeline for planetary surface analysis using modern deep learning tools. To validate the pipeline, we compare its predictions against the manually annotated catalogs for the Moon, Mercury, and several regional inventories, observing close agreement across the full diameter spectrum, revealing a high level of confidence in our approach. This work presents a spatial density analysis, comparing the spatial density maps of small and large craters highlighting the uneven distribution of crater sizes across Mercury. We explore the prevalence of kilometer-scale (1–5 km range) impact craters, demonstrating that these dominate the crater population in certain regions of Mercury’s surface. Full article
Show Figures

Figure 1

29 pages, 983 KB  
Article
Foundation Models for Cybersecurity: A Comprehensive Multi-Modal Evaluation of TabPFN and TabICL for Tabular Intrusion Detection
by Pablo García, J. de Curtò, I. de Zarzà, Juan Carlos Cano and Carlos T. Calafate
Electronics 2025, 14(19), 3792; https://doi.org/10.3390/electronics14193792 - 24 Sep 2025
Abstract
While traditional ensemble methods have dominated tabular intrusion detection systems (IDSs), recent advances in foundation models present new opportunities for enhanced cybersecurity applications. This paper presents a comprehensive multi-modal evaluation of foundation models—specifically TabPFN (Tabular Prior-Data Fitted Network), TabICL (Tabular In-Context Learning), and [...] Read more.
While traditional ensemble methods have dominated tabular intrusion detection systems (IDSs), recent advances in foundation models present new opportunities for enhanced cybersecurity applications. This paper presents a comprehensive multi-modal evaluation of foundation models—specifically TabPFN (Tabular Prior-Data Fitted Network), TabICL (Tabular In-Context Learning), and large language models—against traditional machine learning approaches across three cybersecurity datasets: CIC-IDS2017, N-BaIoT, and CIC-UNSW. Our rigorous experimental framework addresses critical methodological challenges through model-appropriate evaluation protocols and comprehensive assessment across multiple data variants. Results demonstrate that foundation models achieve superior and more consistent performance compared with traditional approaches, with TabPFN and TabICL establishing new state-of-the-art results across all datasets. Most significantly, these models uniquely achieve non-zero recall across all classes, including rare threats like Heartbleed and Infiltration, while traditional ensemble methods—despite achieving >99% overall accuracy—completely fail on several minority classes. TabICL demonstrates particularly strong performance on CIC-IDS2017 (99.59% accuracy), while TabPFN maintains consistent performance across all datasets, suggesting robust generalization capabilities. Both foundation models achieve these results using only fractions of the available training data and requiring no hyperparameter tuning, representing a paradigm shift toward training-light, hyperparameter-free adaptive IDS architectures, where TabPFN requires no task-specific fitting and TabICL leverages efficient in-context adaptation without retraining. Cross-dataset validation reveals that foundation models maintain performance advantages across diverse threat landscapes, while traditional methods exhibit significant dataset-specific variations. These findings challenge the cybersecurity community’s reliance on tree-based ensembles and demonstrate that foundation models offer superior capabilities for next-generation intrusion detection systems in IoT environments. Full article
(This article belongs to the Special Issue Wireless Sensor Network: Latest Advances and Prospects)
22 pages, 1588 KB  
Article
Generative Sign-Description Prompts with Multi-Positive Contrastive Learning for Sign Language Recognition
by Siyu Liang, Yunan Li, Wentian Xin, Huizhou Chen, Xujie Liu, Kang Liu and Qiguang Miao
Sensors 2025, 25(19), 5957; https://doi.org/10.3390/s25195957 - 24 Sep 2025
Abstract
While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large [...] Read more.
While sign language combines sequential hand motions with concurrent non-manual cues (e.g., mouth shapes and head tilts), current recognition systems lack multimodal annotation methods capable of capturing their hierarchical semantics. To bridge this gap, we propose GSP-MC, the first method integrating generative large language models into sign language recognition. It leverages retrieval-augmented generation with domain-specific large language models and expert-validated corpora to produce precise multipart descriptions. A dual-encoder architecture bidirectionally aligns hierarchical skeleton features with multi-level text descriptions (global, synonym, part) through probabilistic matching. The approach combines global and part-level losses with KL divergence optimization, ensuring robust alignment across relevant text-skeleton pairs while capturing sign semantics and detailed dynamics. Experiments demonstrate state-of-the-art performance, achieving 97.1% accuracy on the Chinese SLR500 (surpassing SSRL’s 96.9%) and 97.07% on the Turkish AUTSL (exceeding SML’s 96.85%), confirming cross-lingual potential for inclusive communication technologies. Full article
Show Figures

Figure 1

26 pages, 6447 KB  
Article
Data-Driven Multi-Mode Adaptive Control for Distribution Networks with Multi-Region Coordination
by Youzhuo Zheng, Hengrong Zhang, Zhi Long, Shiyuan Gao, Qihang Yang and Haoran Ji
Processes 2025, 13(10), 3046; https://doi.org/10.3390/pr13103046 - 24 Sep 2025
Abstract
The high penetration of distributed generators (DGs) causes severe voltage fluctuations and voltage limit violations in distribution networks. Traditional control methods rely on precise line parameters, which are often unavailable or inaccurate, and therefore are limited in practical applications. This paper proposes a [...] Read more.
The high penetration of distributed generators (DGs) causes severe voltage fluctuations and voltage limit violations in distribution networks. Traditional control methods rely on precise line parameters, which are often unavailable or inaccurate, and therefore are limited in practical applications. This paper proposes a data-driven multi-mode adaptive control method with multi-region coordination to enhance the operational performance of distribution networks. First, the network is partitioned into multiple regions, each equipped with a local controller to formulate reactive power control strategies for DGs. Second, regions exchange voltage and current measurements to establish linear input–output relationships through dynamic linearization, thereby developing a multi-mode model for different control objectives. Finally, each region employs the gradient descent method to iteratively optimize its control strategy, enabling fast responses to changing operating conditions in distribution networks. Case studies on modified IEEE 33-node and 123-node test systems demonstrate that the proposed method reduces voltage deviation, load imbalance, and power loss by 31.25%, 19.17%, and 20.68%, respectively, and maintains strong scalability for application in large-scale distribution networks. Full article
(This article belongs to the Special Issue Distributed Intelligent Energy Systems)
Show Figures

Figure 1

72 pages, 4170 KB  
Systematic Review
Digital Twin Cognition: AI-Biomarker Integration in Biomimetic Neuropsychology
by Evgenia Gkintoni and Constantinos Halkiopoulos
Biomimetics 2025, 10(10), 640; https://doi.org/10.3390/biomimetics10100640 - 23 Sep 2025
Abstract
(1) Background: The convergence of digital twin technology, artificial intelligence, and multimodal biomarkers heralds a transformative era in neuropsychological assessment and intervention. Digital twin cognition represents an emerging paradigm that creates dynamic, personalized virtual models of individual cognitive systems, enabling continuous monitoring, predictive [...] Read more.
(1) Background: The convergence of digital twin technology, artificial intelligence, and multimodal biomarkers heralds a transformative era in neuropsychological assessment and intervention. Digital twin cognition represents an emerging paradigm that creates dynamic, personalized virtual models of individual cognitive systems, enabling continuous monitoring, predictive modeling, and precision interventions. This systematic review comprehensively examines the integration of AI-driven biomarkers within biomimetic neuropsychological frameworks to advance personalized cognitive health. (2) Methods: Following PRISMA 2020 guidelines, we conducted a systematic search across six major databases spanning medical, neuroscience, and computer science disciplines for literature published between 2014 and 2024. The review synthesized evidence addressing five research questions examining framework integration, predictive accuracy, clinical translation, algorithm effectiveness, and neuropsychological validity. (3) Results: Analysis revealed that multimodal integration approaches combining neuroimaging, physiological, behavioral, and digital phenotyping data substantially outperformed single-modality assessments. Deep learning architectures demonstrated superior pattern recognition capabilities, while traditional machine learning maintained advantages in interpretability and clinical implementation. Successful frameworks, particularly for neurodegenerative diseases and multiple sclerosis, achieved earlier detection, improved treatment personalization, and enhanced patient outcomes. However, significant challenges persist in algorithm interpretability, population generalizability, and the integration of healthcare systems. Critical analysis reveals that high-accuracy claims (85–95%) predominantly derive from small, homogeneous cohorts with limited external validation. Real-world performance in diverse clinical settings likely ranges 10–15% lower, emphasizing the need for large-scale, multi-site validation studies before clinical deployment. (4) Conclusions: Digital twin cognition establishes a new frontier in personalized neuropsychology, offering unprecedented opportunities for early detection, continuous monitoring, and adaptive interventions while requiring continued advancement in standardization, validation, and ethical frameworks. Full article
Show Figures

Figure 1

28 pages, 20824 KB  
Article
Towards Robust Chain-of-Thought Prompting with Self-Consistency for Remote Sensing VQA: An Empirical Study Across Large Multimodal Models
by Fatema Tuj Johora Faria, Laith H. Baniata, Ahyoung Choi and Sangwoo Kang
Mathematics 2025, 13(18), 3046; https://doi.org/10.3390/math13183046 - 22 Sep 2025
Abstract
Remote sensing visual question answering (RSVQA) involves interpreting complex geospatial information captured by satellite imagery to answer natural language questions, making it a vital tool for observing and analyzing Earth’s surface without direct contact. Although numerous studies have addressed RSVQA, most have focused [...] Read more.
Remote sensing visual question answering (RSVQA) involves interpreting complex geospatial information captured by satellite imagery to answer natural language questions, making it a vital tool for observing and analyzing Earth’s surface without direct contact. Although numerous studies have addressed RSVQA, most have focused primarily on answer accuracy, often overlooking the underlying reasoning capabilities required to interpret spatial and contextual cues in satellite imagery. To address this gap, this study presents a comprehensive evaluation of four large multimodal models (LMMs) as follows: GPT-4o, Grok 3, Gemini 2.5 Pro, and Claude 3.7 Sonnet. We used a curated subset of the EarthVQA dataset consisting of 100 rural images with 29 question–answer pairs each and 100 urban images with 42 pairs each. We developed the following three task-specific frameworks: (1) Zero-GeoVision, which employs zero-shot prompting with problem-specific prompts that elicit direct answers from the pretrained knowledge base without fine-tuning; (2) CoT-GeoReason, which enhances the knowledge base with chain-of-thought prompting, guiding it through explicit steps of feature detection, spatial analysis, and answer synthesis; and (3) Self-GeoSense, which extends this approach by stochastically decoding five independent reasoning chains for each remote sensing question. Rather than merging these chains, it counts the final answers, selects the majority choice, and returns a single complete reasoning chain whose conclusion aligns with that majority. Additionally, we designed the Geo-Judge framework to employ a two-stage evaluation process. In Stage 1, a GPT-4o-mini-based LMM judge assesses reasoning coherence and answer correctness using the input image, task type, reasoning steps, generated model answer, and ground truth. In Stage 2, blinded human experts independently review the LMM’s reasoning and answer, providing unbiased validation through careful reassessment. Focusing on Self-GeoSense with Grok 3, this framework achieves superior performance with 94.69% accuracy in Basic Judging, 93.18% in Basic Counting, 89.42% in Reasoning-Based Judging, 83.29% in Reasoning-Based Counting, 77.64% in Object Situation Analysis, and 65.29% in Comprehensive Analysis, alongside RMSE values of 0.9102 in Basic Counting and 1.0551 in Reasoning-Based Counting. Full article
(This article belongs to the Special Issue Big Data Mining and Knowledge Graph with Application)
Show Figures

Figure 1

20 pages, 18992 KB  
Article
Application of LMM-Derived Prompt-Based AIGC in Low-Altitude Drone-Based Concrete Crack Monitoring
by Shijun Pan, Zhun Fan, Keisuke Yoshida, Shujia Qin, Takashi Kojima and Satoshi Nishiyama
Drones 2025, 9(9), 660; https://doi.org/10.3390/drones9090660 - 21 Sep 2025
Viewed by 112
Abstract
In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for [...] Read more.
In recent years, large multimodal models (LMMs), such as ChatGPT 4o and DeepSeek R1—artificial intelligence systems capable of multimodal (e.g., image and text) human–computer interaction—have gained traction in industrial and civil engineering applications. Concurrently, insufficient real-world drone-view data (specifically close-distance, high-resolution imagery) for civil engineering scenarios has heightened the importance of artificially generated content (AIGC) or synthetic data as supplementary inputs. AIGC is typically produced via text-to-image generative models (e.g., Stable Diffusion, DALL-E) guided by user-defined prompts. This study leverages LMMs to interpret key parameters for drone-based image generation (e.g., color, texture, scene composition, photographic style) and applies prompt engineering to systematize these parameters. The resulting LMM-generated prompts were used to synthesize training data for a You Only Look Once version 8 segmentation model (YOLOv8-seg). To address the need for detailed crack-distribution mapping in low-altitude drone-based monitoring, the trained YOLOv8-seg model was evaluated on close-distance crack benchmark datasets. The experimental results confirm that LMM-prompted AIGC is a viable supplement for low-altitude drone crack monitoring, achieving >80% classification accuracy (images with/without cracks) at a confidence threshold of 0.5. Full article
Show Figures

Figure 1

21 pages, 5544 KB  
Article
Multimodal Large Language Model-Enabled Machine Intelligent Fault Diagnosis Method with Non-Contact Dynamic Vision Data
by Zihan Lu, Cuiying Sun and Xiang Li
Sensors 2025, 25(18), 5898; https://doi.org/10.3390/s25185898 - 20 Sep 2025
Viewed by 237
Abstract
Smart manufacturing demands ever-increasing equipment reliability and continuous availability. Traditional fault diagnosis relies on attached sensors and complex wiring to collect vibration signals. This approach suffers from poor environmental adaptability, difficult maintenance, and cumbersome preprocessing. This study pioneers the use of high-temporal-resolution dynamic [...] Read more.
Smart manufacturing demands ever-increasing equipment reliability and continuous availability. Traditional fault diagnosis relies on attached sensors and complex wiring to collect vibration signals. This approach suffers from poor environmental adaptability, difficult maintenance, and cumbersome preprocessing. This study pioneers the use of high-temporal-resolution dynamic visual information captured by an event camera to fine-tune a multimodal large model for the first time. Leveraging non-contact acquisition with an event camera, sparse pulse events are converted into event frames through time surface processing. These frames are then reconstructed into a high-temporal-resolution video using spatiotemporal denoising and region of interest definition. The study introduces the multimodal model Qwen2.5-VL-7B and employs two distinct LoRA fine-tuning strategies for bearing fault classification. Strategy A utilizes OpenCV to extract key video frames for lightweight parameter injection. In contrast, Strategy B calls the model’s built-in video processing pipeline to fully leverage rich temporal information and capture dynamic details of the bearing’s operation. Classification experiments were conducted under three operating conditions and four rotational speeds. Strategy A and Strategy B achieved classification accuracies of 0.9247 and 0.9540, respectively, successfully establishing a novel fault diagnosis paradigm that progresses from non-contact sensing to end-to-end intelligent analysis. Full article
(This article belongs to the Special Issue Applications of Sensors in Condition Monitoring and Fault Diagnosis)
Show Figures

Figure 1

22 pages, 14929 KB  
Article
Educational Evaluation with MLLMs: Framework, Dataset, and Comprehensive Assessment
by Yuqing Chen, Yixin Li, Yupei Ren, Yixin Liu and Yiping Ma
Electronics 2025, 14(18), 3713; https://doi.org/10.3390/electronics14183713 - 19 Sep 2025
Viewed by 246
Abstract
With the rapid development of Multimodal Large Language Models (MLLMs) in education, their applications have mainly focused on content generation tasks such as text writing and courseware production. However, automated assessment of non-exam learning outcomes remains underexplored. This study shifts the application of [...] Read more.
With the rapid development of Multimodal Large Language Models (MLLMs) in education, their applications have mainly focused on content generation tasks such as text writing and courseware production. However, automated assessment of non-exam learning outcomes remains underexplored. This study shifts the application of MLLMs from content generation to content evaluation and designs a lightweight and extensible framework to enable automated assessment of students’ multimodal work. We constructed a multimodal dataset comprising student essays, slide decks, and presentation videos from university students, which were annotated by experts across five educational dimensions. Based on horizontal educational evaluation dimensions (Format Compliance, Content Quality, Slide Design, Verbal Expression, and Nonverbal Performance) and vertical model capability dimensions (consistency, stability, and interpretability), we systematically evaluated four leading multimodal large models (GPT-4o, Gemini 2.5, Doubao1.6, and Kimi 1.5) in assessing non-exam learning outcomes. The results indicate that MLLMs demonstrate good consistency with human evaluations across various assessment dimensions, with each model exhibiting its own strengths. Additionally, they possess high explainability and perform better in text-based tasks than in visual tasks, but their scoring stability still requires improvement. This study demonstrates the potential of MLLMs for non-exam learning assessment and provides a reference for advancing their applications in education. Full article
(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)
Show Figures

Figure 1

21 pages, 1694 KB  
Article
Integrating Temporal Interest Dynamics and Virality Factors for High-Precision Ranking in Big Data Recommendation
by Zhaoyang Ye, Jingyi Yang, Fanyu Meng, Manzhou Li and Yan Zhan
Electronics 2025, 14(18), 3687; https://doi.org/10.3390/electronics14183687 - 18 Sep 2025
Viewed by 280
Abstract
In large-scale recommendation scenarios, achieving high-precision ranking requires simultaneously modeling user interest dynamics and content propagation potential. In this work, we propose a unified framework that integrates a temporal interest modeling stream with a multimodal virality encoder. The temporal stream captures sequential user [...] Read more.
In large-scale recommendation scenarios, achieving high-precision ranking requires simultaneously modeling user interest dynamics and content propagation potential. In this work, we propose a unified framework that integrates a temporal interest modeling stream with a multimodal virality encoder. The temporal stream captures sequential user behavior through the self-attention-based modeling of long-term and short-term interests, while the virality encoder learns latent virality factors from heterogeneous modalities, including text, images, audio, and user comments. The two streams are fused in the ranking layer to form a joint representation that balances personalized preference with content dissemination potential. To further enhance efficiency, we design hierarchical cascade heads with gating recursion for progressive refinement, along with a multi-level pruning and cache management strategy that reduces redundancy during inference. Experiments on three real-world datasets (Douyin, Bilibili, and MIND) demonstrate that our method achieves significant improvements over state-of-the-art baselines across multiple metrics. Additional analyses confirm the interpretability of the virality factors and highlight their positive correlation with real-world popularity indicators. These results validate the effectiveness and practicality of our approach for high-precision recommendation in big data environments. Full article
(This article belongs to the Special Issue Advances in Data-Driven Artificial Intelligence)
Show Figures

Figure 1

29 pages, 7882 KB  
Article
From Concept to Representation: Modeling Driving Capability and Task Demand with a Multimodal Large Language Model
by Haoran Zhou, Alexander Carballo, Keisuke Fujii and Kazuya Takeda
Sensors 2025, 25(18), 5805; https://doi.org/10.3390/s25185805 - 17 Sep 2025
Viewed by 263
Abstract
Driving safety hinges on the dynamic interplay between task demand and driving capability, yet these concepts lack a unified, quantifiable formulation. In this work, we present a framework based on a multimodal large language model that transforms heterogeneous driving signals—scene images, maneuver descriptions, [...] Read more.
Driving safety hinges on the dynamic interplay between task demand and driving capability, yet these concepts lack a unified, quantifiable formulation. In this work, we present a framework based on a multimodal large language model that transforms heterogeneous driving signals—scene images, maneuver descriptions, control inputs, and surrounding traffic states—into low-dimensional embeddings of task demand and driving capability. By projecting both embeddings into a shared latent space, the framework yields an interpretable measurement of task difficulty that alerts to capability shortfalls before unsafe behavior arises. Built upon a customized BLIP 2 backbone and fine-tuned on diverse simulated driving scenarios, the model respects consistency within tasks, captures impairment-related capability degradation, and can transfer to real-world motorway data without additional training. These findings endorse the framework as a concise yet effective step toward proactive, explainable risk assessment in intelligent vehicles. Full article
Show Figures

Figure 1

25 pages, 541 KB  
Review
Augmented Decisions: AI-Enhanced Accuracy in Glaucoma Diagnosis and Treatment
by Marco Zeppieri, Caterina Gagliano, Daniele Tognetto, Mutali Musa, Alessandro Avitabile, Fabiana D’Esposito, Simonetta Gaia Nicolosi and Matteo Capobianco
J. Clin. Med. 2025, 14(18), 6519; https://doi.org/10.3390/jcm14186519 - 16 Sep 2025
Viewed by 279
Abstract
Glaucoma remains a leading cause of irreversible blindness. We reviewed more than 150 peer-reviewed studies (January 2019–July 2025) that applied artificial or augmented intelligence (AI/AuI) to glaucoma care. Deep learning systems analyzing fundus photographs or OCT volumes routinely achieved area-under-the-curve values around 0.95 [...] Read more.
Glaucoma remains a leading cause of irreversible blindness. We reviewed more than 150 peer-reviewed studies (January 2019–July 2025) that applied artificial or augmented intelligence (AI/AuI) to glaucoma care. Deep learning systems analyzing fundus photographs or OCT volumes routinely achieved area-under-the-curve values around 0.95 and matched—or exceeded—subspecialists in prospective tests. Sequence-aware models detected visual field worsening up to 1.7 years earlier than conventional linear trends, while a baseline multimodal network integrating OCT, visual field, and clinical data predicted the need for incisional surgery with AUROC 0.92. Offline smartphone triage in community clinics reached sensitivities near 94% and specificities between 86% and 94%, illustrating feasibility in low-resource settings. Large language models answered glaucoma case questions with specialist-level accuracy but still require human oversight. Key obstacles include algorithmic bias, workflow integration, and compliance with emerging regulations, such as the EU AI Act and FDA GMLP. With rigorous validation, bias auditing, and transparent change control, AI/AuI can augment—rather than replace—clinician expertise, enabling earlier intervention, tailored therapy, and more equitable access to glaucoma care worldwide. Full article
(This article belongs to the Special Issue Augmented and Artificial Intelligence in Ophthalmology)
Show Figures

Figure 1

Back to TopTop