Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (526)

Search Parameters:
Keywords = vision language models

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
51 pages, 2099 KB  
Review
Secure and Intelligent Low-Altitude Infrastructures: Synergistic Integration of IoT Networks, AI Decision-Making and Blockchain Trust Mechanisms
by Yuwen Ye, Xirun Min, Xiangwen Liu, Xiangyi Chen, Kefan Cao, S. M. Ruhul Kabir Howlader and Xiao Chen
Sensors 2025, 25(21), 6751; https://doi.org/10.3390/s25216751 - 4 Nov 2025
Abstract
The low-altitude economy (LAE), encompassing urban air mobility, drone logistics and sub 3000 m aerial surveillance, demands secure, intelligent infrastructures to manage increasingly complex, multi-stakeholder operations. This survey evaluates the integration of Internet of Things (IoT) networks, artificial intelligence (AI) decision-making and blockchain [...] Read more.
The low-altitude economy (LAE), encompassing urban air mobility, drone logistics and sub 3000 m aerial surveillance, demands secure, intelligent infrastructures to manage increasingly complex, multi-stakeholder operations. This survey evaluates the integration of Internet of Things (IoT) networks, artificial intelligence (AI) decision-making and blockchain trust mechanisms as foundational enablers for next-generation LAE ecosystems. IoT sensor arrays deployed at ground stations, unmanned aerial vehicles (UAVs) and vertiports form a real-time data fabric that records variables from air traffic density to environmental parameters. These continuous data streams empower AI models ranging from predictive analytics and computer vision (CV) to multi-agent reinforcement learning (MARL) and large language model (LLM) reasoning to optimize flight paths, identify anomalies and coordinate swarm behaviors autonomously. In parallel, blockchain architectures furnish immutable audit trails for regulatory compliance, support secure device authentication via decentralized identifiers (DIDs) and automate contractual exchanges for services such as airspace leasing or payload delivery. By examining current research and practical deployments, this review demonstrates how the synergistic application of IoT, AI and blockchain can bolster operational efficiency, resilience and trustworthiness across the LAE landscape. Full article
Show Figures

Figure 1

25 pages, 8227 KB  
Article
UniAD: A Real-World Multi-Category Industrial Anomaly Detection Dataset with a Unified CLIP-Based Framework
by Junyang Yang, Jiuxin Cao and Chengge Duan
Information 2025, 16(11), 956; https://doi.org/10.3390/info16110956 - 4 Nov 2025
Abstract
Industrial image anomaly detection is critical for automated manufacturing. However, most existing methods rely on single-category training paradigms, resulting in poor scalability and limited cross-category generalization. These approaches require separate models for each product type and fail to model the complex multi-modal distribution [...] Read more.
Industrial image anomaly detection is critical for automated manufacturing. However, most existing methods rely on single-category training paradigms, resulting in poor scalability and limited cross-category generalization. These approaches require separate models for each product type and fail to model the complex multi-modal distribution of normal samples in multi-category scenarios. To overcome these limitations, we propose UniCLIP-AD, a unified anomaly detection framework that leverages the general semantic knowledge of CLIP and adapts it to the industrial domain using Low-Rank Adaptation (LoRA). This design enables a single model to effectively handle diverse industrial parts. In addition, we introduce UniAD, a large-scale industrial anomaly detection dataset collected from real production lines. It contains over 25,000 high-resolution images across 7 categories of electronic components, with both pixel-level and image-level annotations. UniAD captures fine-grained, diverse, and realistic defects, making it a strong benchmark for unified anomaly detection. Experiments show that UniCLIP-AD achieves superior performance on UniAD, with an AU-ROC of 92.1% and F1-score of 89.8% in cross-category tasks, outperforming the strongest baselines (CFA and DSR) by 3% AU-ROC and 23.9% F1-score. Full article
Show Figures

Figure 1

22 pages, 9212 KB  
Article
Semantic-Aware Co-Parallel Network for Cross-Scene Hyperspectral Image Classification
by Xiaohui Li, Chenyang Jin, Yuntao Tang, Kai Xing and Xiaodong Yu
Sensors 2025, 25(21), 6688; https://doi.org/10.3390/s25216688 - 1 Nov 2025
Viewed by 207
Abstract
Cross-scene classification of hyperspectral images poses significant challenges due to the lack of a priori knowledge and the differences in data distribution across scenes. While traditional studies have had limited use of a priori knowledge from other modalities, recent advancements in pre-trained large-scale [...] Read more.
Cross-scene classification of hyperspectral images poses significant challenges due to the lack of a priori knowledge and the differences in data distribution across scenes. While traditional studies have had limited use of a priori knowledge from other modalities, recent advancements in pre-trained large-scale language-vision models have shown strong performance on various downstream tasks, highlighting the potential of cross-modal assisted learning. In this paper, we propose a Semantic-aware Collaborative Parallel Network (SCPNet) to mitigate the impact of data distribution differences by incorporating linguistic modalities to assist in learning cross-domain invariant representations of hyperspectral images. SCPNet uses a parallel architecture consisting of a spatial–spectral feature extraction module and a multiscale feature extraction module, designed to capture rich image information during the feature extraction phase. The extracted features are then mapped into an optimized semantic space, where improved supervised contrastive learning clusters image features from the same category together while separating those from different categories. Semantic space bridges the gap between visual and linguistic modalities, enabling the model to mine cross-domain invariant representations from the linguistic modality. Experimental results demonstrate that SCPNet significantly outperforms existing methods on three publicly available datasets, confirming its effectiveness for cross-scene hyperspectral image classification tasks. Full article
(This article belongs to the Special Issue Remote Sensing Image Processing, Analysis and Application)
Show Figures

Figure 1

21 pages, 1399 KB  
Review
Artificial Intelligence in Oncology: A 10-Year ClinicalTrials.gov-Based Analysis Across the Cancer Control Continuum
by Himanshi Verma, Shilpi Mistry, Krishna Vamsi Jayam, Pratibha Shrestha, Lauren Adkins, Muxuan Liang, Aline Fares, Ali Zarrinpar, Dejana Braithwaite and Shama D. Karanth
Cancers 2025, 17(21), 3537; https://doi.org/10.3390/cancers17213537 - 1 Nov 2025
Viewed by 201
Abstract
Background/Objectives: Artificial Intelligence (AI) is rapidly advancing in medicine, facilitating personalized care by leveraging complex clinical data, imaging, and patient monitoring. This study characterizes current practices in AI use within oncology clinical trials by analyzing completed U.S. trials within the Cancer Control Continuum [...] Read more.
Background/Objectives: Artificial Intelligence (AI) is rapidly advancing in medicine, facilitating personalized care by leveraging complex clinical data, imaging, and patient monitoring. This study characterizes current practices in AI use within oncology clinical trials by analyzing completed U.S. trials within the Cancer Control Continuum (CCC), a framework that spans the stages of cancer etiology, prevention, detection, diagnosis, treatment, and survivorship. Methods: This cross-sectional study analyzed U.S.-based oncology trials registered on ClinicalTrials.gov between January 2015 and April 2025. Using AI-related MeSH terms, we identified trials addressing stages of the CCC. Results: Fifty completed oncology trials involving AI were identified; 66% were interventional and 34% observational. Machine Learning was the most common AI application, though specific algorithm details were often lacking. Other AI domains included Natural Language Processing, Computer Vision, and Integrated Systems. Most trials were single-center with limited participant enrollment. Few published results or reported outcomes, indicating notable reporting gaps. Conclusions: This analysis of ClinicalTrials.gov reveals a dynamic and innovative landscape of AI applications transforming oncology care, from cutting-edge Machine Learning models enhancing early cancer detection to intelligent chatbots supporting treatment adherence and personalized survivorship interventions. These trials highlight AI’s growing role in improving outcomes across the CCC in advancing personalized cancer care. Standardized reporting and enhanced data sharing will be important for facilitating the broader application of trial findings, accelerating the development and clinical integration of reliable AI tools to advance cancer care. Full article
Show Figures

Figure 1

17 pages, 1397 KB  
Article
A Novel Approach for Reliable Classification of Marine Low Cloud Morphologies with Vision–Language Models
by Ehsan Erfani and Farnaz Hosseinpour
Atmosphere 2025, 16(11), 1252; https://doi.org/10.3390/atmos16111252 - 31 Oct 2025
Viewed by 239
Abstract
Marine low clouds have a strong impact on Earth’s system but remain a major source of uncertainty in anthropogenic radiative forcing simulated by general circulation models. This uncertainty arises from incomplete understanding of the many processes controlling their evolution and interactions. A key [...] Read more.
Marine low clouds have a strong impact on Earth’s system but remain a major source of uncertainty in anthropogenic radiative forcing simulated by general circulation models. This uncertainty arises from incomplete understanding of the many processes controlling their evolution and interactions. A key feature of these clouds is their diverse mesoscale morphologies, which are closely tied to their microphysical and radiative properties but remain difficult to characterize with satellite retrievals and numerical models. Here, we develop and apply a vision–language model (VLM) to classify marine low cloud morphologies using two independent datasets based on Moderate Resolution Imaging Spectroradiometer (MODIS) satellite imagery: (1) mesoscale cellular convection types of sugar, gravel, fish, and flower (SGFF; 8800 total samples) and (2) marine stratocumulus (Sc) types of stratus, closed cells, open cells, and other cells (260 total samples). By conditioning frozen image encoders on descriptive prompts, the VLM leverages multimodal priors learned from large-scale image–text training, making it less sensitive to limited sample size. Results show that the k-fold cross-validation of VLM achieves an overall accuracy of 0.84 for SGFF, comparable to prior deep learning benchmarks for the same cloud types, and retains robust performance under the reduction in SGFF training size. For the Sc dataset, the VLM attains 0.86 accuracy, whereas the image-only model is unreliable under such a limited training set. These findings highlight the potential of VLMs as efficient and accurate tools for cloud classification under very low samples, offering new opportunities for satellite remote sensing and climate model evaluation. Full article
Show Figures

Figure 1

16 pages, 579 KB  
Article
IGSMNet: Ingredient-Guided Semantic Modeling Network for Food Nutrition Estimation
by Donglin Zhang, Weixiang Shi, Boyuan Ma, Weiqing Min and Xiao-Jun Wu
Foods 2025, 14(21), 3697; https://doi.org/10.3390/foods14213697 - 30 Oct 2025
Viewed by 389
Abstract
In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement [...] Read more.
In recent years, food nutrition estimation has received growing attention due to its critical role in dietary analysis and public health. Traditional nutrition assessment methods often rely on manual measurements and expert knowledge, which are time-consuming and not easily scalable. With the advancement of computer vision, RGB-based methods have been proposed, and more recently, RGB-D-based approaches have further improved performance by incorporating depth information to capture spatial cues. While these methods have shown promising results, they still face challenges in complex food scenes, such as limited ability to distinguish visually similar items with different ingredients and insufficient modeling of spatial or semantic relationships. To solve these issues, we propose an Ingredient-Guided Semantic Modeling Network (IGSMNet) for food nutrition estimation. The method introduces an ingredient-guided module that encodes ingredient information using a pre-trained language model and aligns it with visual features via cross-modal attention. At the same time, an internal semantic modeling component is designed to enhance structural understanding through dynamic positional encoding and localized attention, allowing for fine-grained relational reasoning. On the Nutrition5k dataset, our method achieves PMAE values of 12.2% for Calories, 9.4% for Mass, 19.1% for Fat, 18.3% for Carb, and 16.0% for Protein. These results demonstrate that our IGSMNet consistently outperforms existing baselines, validating its effectiveness. Full article
(This article belongs to the Section Food Nutrition)
Show Figures

Figure 1

36 pages, 11240 KB  
Article
Public Perception of Urban Recreational Spaces Based on Large Vision–Language Models: A Case Study of Beijing’s Third Ring Area
by Yan Wang, Xin Hou, Xuan Wang and Wei Fan
Land 2025, 14(11), 2155; https://doi.org/10.3390/land14112155 - 29 Oct 2025
Viewed by 419
Abstract
Urban recreational spaces (URSs) are pivotal for enhancing resident well-being, making the accurate assessment of public perceptions crucial for quality optimization. Compared to traditional surveys, social media data provide a scalable means for multi-dimensional perception assessment. However, existing studies predominantly rely on single-modal [...] Read more.
Urban recreational spaces (URSs) are pivotal for enhancing resident well-being, making the accurate assessment of public perceptions crucial for quality optimization. Compared to traditional surveys, social media data provide a scalable means for multi-dimensional perception assessment. However, existing studies predominantly rely on single-modal data, which limits the comprehensive capturing of complex perceptions and lacks interpretability. To address these gaps, this study employs cutting-edge large vision–language models (LVLMs) and develops an interpretable model, Qwen2.5-VL-7B-SFT, through supervised fine-tuning on a manually annotated dataset. The model integrates visual-linguistic features to assess four perceptual dimensions of URSs: esthetics, attractiveness, cultural significance, and restorativeness. Crucially, we generate textual evidence for our judgments by identifying the key spatial elements and emotional characteristics associated with specific perceptions. By integrating multi-source built environment data with Optuna-optimized machine learning and SHAP analysis, we further decipher the nonlinear relationships between built environment variables and perceptual outcomes. The results are as follows: (1) Interpretable LVLMs are highly effective for urban spatial perception research. (2) URSs within Beijing’s Third Ring Road fall into four typologies, historical heritage, commercial entertainment, ecological-natural, and cultural spaces, with significant correlations observed between physical elements and emotional responses. (3) Historical heritage accessibility and POI density are identified as key predictors of public perception. Positive perception significantly improves when a block’s POI functional density exceeds 4000 units/km2 or when its 500 m radius encompasses more than four historical heritage sites. Our methodology enables precise quantification of multidimensional URS perceptions, links built environment elements to perceptual mechanisms, and provides actionable insights for urban planning. Full article
Show Figures

Figure 1

41 pages, 2786 KB  
Review
Research Status and Development Trends of Artificial Intelligence in Smart Agriculture
by Chuang Ge, Guangjian Zhang, Yijie Wang, Dandan Shao, Xiangjin Song and Zhaowei Wang
Agriculture 2025, 15(21), 2247; https://doi.org/10.3390/agriculture15212247 - 28 Oct 2025
Viewed by 584
Abstract
Artificial Intelligence (AI) is a key technological enabler for the transition of agricultural production and management from experience-driven to data-driven, continuously advancing modern agriculture toward smart agriculture. This evolution ultimately aims to achieve a precise agricultural production model characterized by low resource consumption, [...] Read more.
Artificial Intelligence (AI) is a key technological enabler for the transition of agricultural production and management from experience-driven to data-driven, continuously advancing modern agriculture toward smart agriculture. This evolution ultimately aims to achieve a precise agricultural production model characterized by low resource consumption, high safety, high quality, high yield, and stable, sustainable development. Although machine learning, deep learning, computer vision, Internet of Things, and other AI technologies have made significant progress in numerous agricultural production applications, most studies focus on singular agricultural scenarios or specific AI algorithm research, such as object detection, navigation, agricultural machinery maintenance, and food safety, resulting in relatively limited coverage. To comprehensively elucidate the applications of AI in agriculture and provide a valuable reference for practitioners and policymakers, this paper reviews relevant research by investigating the entire agricultural production process—including planting, management, and harvesting—covering application scenarios such as seed selection during the cultivation phase, pest and disease identification and intelligent management during the growth phase, and agricultural product grading during the harvest phase, as well as agricultural machinery and devices like fault diagnosis and predictive maintenance of agricultural equipment, agricultural robots, and the agricultural Internet of Things. It first analyzes the fundamental principles and potential advantages of typical AI technologies, followed by a systematic and in-depth review of the latest progress in applying these core technologies to smart agriculture. The challenges faced by existing technologies are also explored, such as the inherent limitations of AI models—including poor generalization capability, low interpretability, and insufficient real-time performance—as well as the complex agricultural operating environments that result in multi-source, heterogeneous, and low-quality, unevenly annotated data. Furthermore, future research directions are discussed, such as lightweight network models, transfer learning, embodied intelligent agricultural robots, multimodal perception technologies, and large language models for agriculture. The aim is to provide meaningful insights for both theoretical research and practical applications of AI technologies in agriculture. Full article
(This article belongs to the Special Issue Perception, Decision-Making, and Control of Agricultural Robots)
Show Figures

Figure 1

20 pages, 690 KB  
Article
VLM-as-a-Judge Approaches for Evaluating Visual Narrative Coherence in Historical Photographical Records
by Brian Keith, Claudio Meneses, Mauricio Matus, María Constanza Castro and Diego Urrutia
Electronics 2025, 14(21), 4199; https://doi.org/10.3390/electronics14214199 - 27 Oct 2025
Viewed by 375
Abstract
Evaluating the coherence of visual narrative sequences extracted from image collections remains a challenge in digital humanities and computational journalism. While mathematical coherence metrics based on visual embeddings provide objective measures, they require computational resources and technical expertise to interpret. We propose using [...] Read more.
Evaluating the coherence of visual narrative sequences extracted from image collections remains a challenge in digital humanities and computational journalism. While mathematical coherence metrics based on visual embeddings provide objective measures, they require computational resources and technical expertise to interpret. We propose using vision-language models (VLMs) as judges to evaluate visual narrative coherence, comparing two approaches: caption-based evaluation that converts images to text descriptions and direct vision evaluation that processes images without intermediate text generation. Through experiments on 126 narratives from historical photographs, we show that both approaches achieve weak-to-moderate correlations with mathematical coherence metrics (r = 0.28–0.36) while differing in reliability and efficiency. Direct VLM evaluation achieves higher inter-rater reliability (ICC()=0.718 vs. 0.339) but requires 10.8× more computation time after initial caption generation. Both methods successfully discriminate between human-curated, algorithmically extracted, and random narratives, with all pairwise comparisons achieving statistical significance (p<0.05, with five of six comparisons at p<0.001). Human sequences consistently score highest, followed by algorithmic extractions, then random sequences. Our findings indicate that the choice between approaches depends on application requirements: caption-based for efficient large-scale screening versus direct vision for consistent curatorial assessment. Full article
(This article belongs to the Special Issue Artificial Intelligence-Driven Emerging Applications)
Show Figures

Figure 1

20 pages, 13884 KB  
Article
Prototype-Guided Zero-Shot Medical Image Segmentation with Large Vision-Language Models
by Huong Pham and Samuel Cheng
Appl. Sci. 2025, 15(21), 11441; https://doi.org/10.3390/app152111441 - 26 Oct 2025
Viewed by 500
Abstract
Building on advances in promptable segmentation models, this work introduces a framework that integrates Large Vision-Language Model (LVLM) bounding box priors with prototype-based region of interest (ROI) selection to improve zero-shot medical image segmentation. Unlike prior methods such as SaLIP, which often misidentify [...] Read more.
Building on advances in promptable segmentation models, this work introduces a framework that integrates Large Vision-Language Model (LVLM) bounding box priors with prototype-based region of interest (ROI) selection to improve zero-shot medical image segmentation. Unlike prior methods such as SaLIP, which often misidentify regions due to reliance on text–image CLIP similarity, the proposed approach leverages visual prototypes to mitigate language bias and enhance ROI ranking, resulting in more accurate segmentation. Bounding box estimation is further strengthened through systematic prompt engineering to optimize LVLM performance across diverse datasets and imaging modalities. Evaluation was conducted on three publicly available benchmark datasets—CC359 (brain MRI), HC18 (fetal head ultrasound), and CXRMAL (chest X-ray)—without any task-specific fine-tuning. The proposed method achieved substantial improvements over prior approaches. On CC359, it reached a Dice score of 0.95 ± 0.06 and a mean Intersection-over-Union (mIoU) of 0.91 ± 0.10. On HC18, it attained a Dice score of 0.82 ± 0.20 and mIoU of 0.74 ± 0.22. On CXRMAL, the model achieved a Dice score of 0.90 ± 0.08 and mIoU of 0.83 ± 0.12. These standard deviations reflect variability across test images within each dataset, indicating the robustness of the proposed zero-shot framework. These results demonstrate that integrating LVLM-derived bounding box priors with prototype-based selection substantially advances zero-shot medical image segmentation. Full article
Show Figures

Figure 1

37 pages, 10732 KB  
Review
Advances on Multimodal Remote Sensing Foundation Models for Earth Observation Downstream Tasks: A Survey
by Guoqing Zhou, Lihuang Qian and Paolo Gamba
Remote Sens. 2025, 17(21), 3532; https://doi.org/10.3390/rs17213532 - 24 Oct 2025
Viewed by 577
Abstract
Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with [...] Read more.
Remote sensing foundation models (RSFMs) have demonstrated excellent feature extraction and reasoning capabilities under the self-supervised learning paradigm of “unlabeled datasets—model pre-training—downstream tasks”. These models achieve superior accuracy and performance compared to existing models across numerous open benchmark datasets. However, when confronted with multimodal data, such as optical, LiDAR, SAR, text, video, and audio, the RSFMs exhibit limitations in cross-modal generalization and multi-task learning. Although several reviews have addressed the RSFMs, there is currently no comprehensive survey dedicated to vision–X (vision, language, audio, position) multimodal RSFMs (MM-RSFMs). To tackle this gap, this article provides a systematic review of MM-RSFMs from a novel perspective. Firstly, the key technologies underlying MM-RSFMs are reviewed and analyzed, and the available multimodal RS pre-training datasets are summarized. Then, recent advances in MM-RSFMs are classified according to the development of backbone networks and cross-modal interaction methods of vision–X, such as vision–vision, vision–language, vision–audio, vision–position, and vision–language–audio. Finally, potential challenges are analyzed, and perspectives for MM-RSFMs are outlined. This survey from this paper reveals that current MM-RSFMs face the following key challenges: (1) a scarcity of high-quality multimodal datasets, (2) limited capability for multimodal feature extraction, (3) weak cross-task generalization, (4) absence of unified evaluation criteria, and (5) insufficient security measures. Full article
(This article belongs to the Section AI Remote Sensing)
Show Figures

Figure 1

29 pages, 549 KB  
Article
Catch Me If You Can: Rogue AI Detection and Correction at Scale
by Fatemeh Stodt, Jan Stodt, Mohammed Alshawki, Javad Salimi Sratakhti and Christoph Reich
Electronics 2025, 14(20), 4122; https://doi.org/10.3390/electronics14204122 - 21 Oct 2025
Viewed by 409
Abstract
Modern AI systems can strategically misreport information when incentives diverge from truthfulness, posing risks for oversight and deployment. Prior studies often examine this behavior within a single paradigm; systematic, cross-architecture evidence under a unified protocol has been limited. We introduce the Strategy Elicitation [...] Read more.
Modern AI systems can strategically misreport information when incentives diverge from truthfulness, posing risks for oversight and deployment. Prior studies often examine this behavior within a single paradigm; systematic, cross-architecture evidence under a unified protocol has been limited. We introduce the Strategy Elicitation Battery (SEB), a standardized probe suite for measuring deceptive reporting across large language models (LLMs), reinforcement-learning agents, vision-only classifiers, multimodal encoders, state-space models, and diffusion models. SEB uses Bayesian inference tasks with persona-controlled instructions, schema-constrained outputs, deterministic decoding where supported, and a probe mix (near-threshold, repeats, neutralized, cross-checks). Estimates use clustered bootstrap intervals, and significance is assessed with a logistic regression by architecture; a mixed-effects analysis is planned once the per-round agent/episode traces are exported. On the latest pre-correction runs, SEB shows a consistent cross-architecture pattern in deception rates: ViT 80.0%, CLIP 15.0%, Mamba 10.0%, RL agents 10.0%, Stable Diffusion 10.0%, and LLMs 5.0% (20 scenarios/architecture). A logistic regression on per-scenario flags finds a significant overall architecture effect (likelihood-ratio test vs. intercept-only: χ2(5)=41.56, p=7.22×108). Holm-adjusted contrasts indicate ViT is significantly higher than all other architectures in this snapshot; the remaining pairs are not significant. Post-correction acceptance decisions are evaluated separately using residual deception and override rates under SEB-Correct. Latency varies by architecture (sub-second to minutes), enabling pre-deployment screening broadly and real-time auditing for low-latency classes. Results indicate that SEB-Detect deception flags are not confined to any one paradigm, that distinct architectures can converge to similar levels under a common interface, and that reporting interfaces and incentive framing are central levers for mitigation. We operationalize “deception” as reward-sensitive misreport flags, and we separate detection from intervention via a correction wrapper (SEB-Correct), supporting principled acceptance decisions for deployment. Full article
Show Figures

Figure 1

33 pages, 4831 KB  
Article
A General-Purpose Knowledge Retention Metric for Evaluating Distillation Models Across Architectures and Tasks
by Arjay Alba and Jocelyn Villaverde
AI 2025, 6(10), 273; https://doi.org/10.3390/ai6100273 - 21 Oct 2025
Viewed by 478
Abstract
Background: Knowledge distillation (KD) compresses deep neural networks by transferring knowledge from a high-capacity teacher model to a lightweight student model. However, conventional evaluation metrics such as accuracy, mAP, IoU, or RMSE focus mainly on task performance and overlook how effectively the [...] Read more.
Background: Knowledge distillation (KD) compresses deep neural networks by transferring knowledge from a high-capacity teacher model to a lightweight student model. However, conventional evaluation metrics such as accuracy, mAP, IoU, or RMSE focus mainly on task performance and overlook how effectively the student internalizes the teacher’s knowledge. Methods: This study introduces the Knowledge Retention Score (KRS), a composite metric that integrates intermediate feature similarity and output agreement into a single interpretable score to quantify knowledge retention. KRS was primarily validated in computer vision (CV) through 36 experiments covering image classification, object detection, and semantic segmentation using diverse datasets and eight representative KD methods. Supplementary experiments were conducted in natural language processing (NLP) using transformer-based models on SST-2, and in time series regression with convolutional teacher–student pairs. Results: Across all domains, KRS correlated strongly with standard performance metrics while revealing internal retention dynamics that conventional evaluations often overlook. By reporting feature similarity and output agreement separately alongside the composite score, KRS provides transparent and interpretable insights into knowledge transfer. Conclusions: KRS offers a stable diagnostic tool and a complementary evaluation metric for KD research. Its generality across domains demonstrates its potential as a standardized framework for assessing knowledge retention beyond task-specific performance measures. Full article
Show Figures

Figure 1

27 pages, 4945 KB  
Article
A Robust Framework for Coffee Bean Package Label Recognition: Integrating Image Enhancement with Vision–Language OCR Models
by Thi-Thu-Huong Le, Yeonjeong Hwang, Ahmada Yusril Kadiptya, JunYoung Son and Howon Kim
Sensors 2025, 25(20), 6484; https://doi.org/10.3390/s25206484 - 20 Oct 2025
Viewed by 725
Abstract
Text recognition on coffee bean package labels is of great importance for product tracking and brand verification, but it poses a challenge due to variations in image quality, packaging materials, and environmental conditions. In this paper, we propose a pipeline that combines several [...] Read more.
Text recognition on coffee bean package labels is of great importance for product tracking and brand verification, but it poses a challenge due to variations in image quality, packaging materials, and environmental conditions. In this paper, we propose a pipeline that combines several image enhancement techniques and is followed by an Optical Character Recognition (OCR) model based on vision–language (VL) Qwen VL variants, conditioned by structured prompts. To facilitate the evaluation, we construct a coffee bean package image set containing two subsets, namely low-resolution (LRCB) and high-resolution coffee bean image sets (HRCB), enclosing multiple real-world challenges. These cases involve various packaging types (bottles and bags), label sides (front and back), rotation, and different illumination. To address the image quality problem, we design a dedicated preprocessing pipeline for package label situations. We develop and evaluate four Qwen-VL OCR variants with prompt engineering, which are compared against four baselines: DocTR, PaddleOCR, EasyOCR, and Tesseract. Extensive comparison using various metrics, including the Levenshtein distance, Cosine similarity, Jaccard index, Exact Match, BLEU score, and ROUGE scores (ROUGE-1, ROUGE-2, and ROUGE-L), proves significant improvements upon the baselines. In addition, the public POIE dataset validation test proves how well the framework can generalize, thus demonstrating its practicality and reliability for label recognition. Full article
(This article belongs to the Special Issue Digital Imaging Processing, Sensing, and Object Recognition)
Show Figures

Figure 1

25 pages, 2968 KB  
Article
ECSA: Mitigating Catastrophic Forgetting and Few-Shot Generalization in Medical Visual Question Answering
by Qinhao Jia, Shuxian Liu, Mingliang Chen, Tianyi Li and Jing Yang
Tomography 2025, 11(10), 115; https://doi.org/10.3390/tomography11100115 - 20 Oct 2025
Viewed by 287
Abstract
Objective: Medical Visual Question Answering (Med-VQA), a key technology that integrates computer vision and natural language processing to assist in clinical diagnosis, possesses significant potential for enhancing diagnostic efficiency and accuracy. However, its development is constrained by two major bottlenecks: weak few-shot generalization [...] Read more.
Objective: Medical Visual Question Answering (Med-VQA), a key technology that integrates computer vision and natural language processing to assist in clinical diagnosis, possesses significant potential for enhancing diagnostic efficiency and accuracy. However, its development is constrained by two major bottlenecks: weak few-shot generalization capability stemming from the scarcity of high-quality annotated data and the problem of catastrophic forgetting when continually learning new knowledge. Existing research has largely addressed these two challenges in isolation, lacking a unified framework. Methods: To bridge this gap, this paper proposes a novel Evolvable Clinical-Semantic Alignment (ECSA) framework, designed to synergistically solve these two challenges within a single architecture. ECSA is built upon powerful pre-trained vision (BiomedCLIP) and language (Flan-T5) models, with two innovative modules at its core. First, we design a Clinical-Semantic Disambiguation Module (CSDM), which employs a novel debiased hard negative mining strategy for contrastive learning. This enables the precise discrimination of “hard negatives” that are visually similar but clinically distinct, thereby significantly enhancing the model’s representation ability in few-shot and long-tail scenarios. Second, we introduce a Prompt-based Knowledge Consolidation Module (PKC), which acts as a rehearsal-free non-parametric knowledge store. It consolidates historical knowledge by dynamically accumulating and retrieving task-specific “soft prompts,” thus effectively circumventing catastrophic forgetting without relying on past data. Results: Extensive experimental results on four public benchmark datasets, VQA-RAD, SLAKE, PathVQA, and VQA-Med-2019, demonstrate ECSA’s state-of-the-art or highly competitive performance. Specifically, ECSA achieves excellent overall accuracies of 80.15% on VQA-RAD and 85.10% on SLAKE, while also showing strong generalization with 64.57% on PathVQA and 82.23% on VQA-Med-2019. More critically, in continual learning scenarios, the framework achieves a low forgetting rate of just 13.50%, showcasing its significant advantages in knowledge retention. Conclusions: These findings validate the framework’s substantial potential for building robust and evolvable clinical decision support systems. Full article
Show Figures

Figure 1

Back to TopTop