Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (115)

Search Parameters:
Keywords = visual transformers (ViTs)

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
35 pages, 7343 KB  
Article
A Hybrid Deep Learning and Knowledge Graph Approach for Intelligent Image Indexing and Retrieval
by Mohamed Hamroun and Damien Sauveron
Appl. Sci. 2025, 15(19), 10591; https://doi.org/10.3390/app151910591 - 30 Sep 2025
Abstract
Technological advancements have enabled users to digitize and store an unlimited number of multimedia documents, including images and videos. However, the heterogeneous nature of multimedia content poses significant challenges in efficient indexing and retrieval. Traditional approaches primarily focus on visual features, often neglecting [...] Read more.
Technological advancements have enabled users to digitize and store an unlimited number of multimedia documents, including images and videos. However, the heterogeneous nature of multimedia content poses significant challenges in efficient indexing and retrieval. Traditional approaches primarily focus on visual features, often neglecting the semantic context, which limits retrieval efficiency. This paper proposes a hybrid deep learning and knowledge graph approach for intelligent image indexing and retrieval. By integrating deep learning models such as EfficientNet and Vision Transformer (ViT) with structured knowledge graphs, the proposed framework enhances semantic understanding and retrieval performance. The methodology incorporates feature extraction, concept classification, and hierarchical knowledge graph structuring to facilitate effective multimedia retrieval. Experimental results on benchmark datasets, including TRECVID, Corel, and MSCOCO, demonstrate significant improvements in precision, robustness, and query expansion techniques. The findings highlight the potential of combining deep learning with knowledge graphs to bridge the semantic gap and optimize multimedia indexing and retrieval. Full article
(This article belongs to the Special Issue Application of Deep Learning and Big Data Processing)
Show Figures

Figure 1

23 pages, 1668 KB  
Article
Brain Stroke Classification Using CT Scans with Transformer-Based Models and Explainable AI
by Shomukh Qari and Maha A. Thafar
Diagnostics 2025, 15(19), 2486; https://doi.org/10.3390/diagnostics15192486 - 29 Sep 2025
Abstract
Background & Objective: Stroke remains a leading cause of mortality and long-term disability worldwide, demanding rapid and accurate diagnosis to improve patient outcomes. Computed tomography (CT) scans are widely used in emergency settings due to their speed, availability, and cost-effectiveness. This study proposes [...] Read more.
Background & Objective: Stroke remains a leading cause of mortality and long-term disability worldwide, demanding rapid and accurate diagnosis to improve patient outcomes. Computed tomography (CT) scans are widely used in emergency settings due to their speed, availability, and cost-effectiveness. This study proposes an artificial intelligence (AI)-based framework for multiclass stroke classification (ischemic, hemorrhagic, and no stroke) using CT scan images from the Ministry of Health of the Republic of Turkey. Methods: We adopted MaxViT, a state-of-the-art Vision Transformer (ViT)-based architecture, as the primary deep learning model for stroke classification. Additional transformer variants, including Vision Transformer (ViT), Transformer-in-Transformer (TNT), and ConvNeXt, were evaluated for comparison. To improve model generalization and handle class imbalance, classical data augmentation techniques were applied. Furthermore, explainable AI (XAI) was integrated using Grad-CAM++ to provide visual insights into model decisions. Results: The MaxViT model with augmentation achieved the highest performance, reaching an accuracy and F1-score of 98.00%, outperforming the baseline Vision Transformer and other evaluated models. Grad-CAM++ visualizations confirmed that the proposed framework effectively identified stroke-related regions, enhancing transparency and clinical trust. Conclusions: This research contributes to the development of a trustworthy AI-assisted diagnostic tool for stroke, facilitating its integration into clinical practice and improving access to timely and optimal stroke diagnosis in emergency departments. Full article
(This article belongs to the Special Issue 3rd Edition: AI/ML-Based Medical Image Processing and Analysis)
Show Figures

Figure 1

34 pages, 8775 KB  
Review
Towards Fault-Aware Image Captioning: A Review on Integrating Facial Expression Recognition (FER) and Object Detection
by Abdul Saboor Khan, Muhammad Jamshed Abbass and Abdul Haseeb Khan
Sensors 2025, 25(19), 5992; https://doi.org/10.3390/s25195992 - 28 Sep 2025
Abstract
The term “image captioning” refers to the process of converting an image into text through computer vision and natural language processing algorithms. Image captioning is still considered an open-ended topic despite the fact that visual data, most of which pertains to images, is [...] Read more.
The term “image captioning” refers to the process of converting an image into text through computer vision and natural language processing algorithms. Image captioning is still considered an open-ended topic despite the fact that visual data, most of which pertains to images, is readily available in today’s world. This is despite the fact that recent developments in computer vision, such as Vision Transformers (ViT) and language models using BERT and GPT, have opened up new possibilities for the field. The purpose of this review paper is to provide an overview of the present status of the field, with a specific emphasis on the use of facial expression recognition and object detection for the purpose of image captioning, particularly in the context of fault-aware systems and Prognostics and Health Management (PHM) applications within Industry 4.0 environments. However, to the best of our knowledge, no review study has focused on the significance of facial expressions in relation to image captioning, especially in industrial settings where operator facial expressions can provide valuable insights for fault detection and system health monitoring. This is something that has been overlooked in the existing body of research on image captioning, which is the primary reason why this study was conducted. During this paper, we will talk about the most important approaches and procedures that have been utilized for this task, including fault-aware methodologies that leverage visual data for PHM in smart manufacturing contexts, and we will highlight the advantages and disadvantages of each strategy. The purpose of this review is to present a comprehensive assessment of the current state of the field and to recommend topics for future research that will lead to machine-translated captions that are more detailed and accurate, particularly for Industry 4.0 applications where visual monitoring plays a crucial role in system diagnostics and maintenance. Full article
(This article belongs to the Special Issue Sensors and IoT Technologies for the Smart Industry)
Show Figures

Figure 1

24 pages, 4296 KB  
Article
VST-YOLOv8: A Trustworthy and Secure Defect Detection Framework for Industrial Gaskets
by Lei Liang and Junming Chen
Electronics 2025, 14(19), 3760; https://doi.org/10.3390/electronics14193760 - 23 Sep 2025
Viewed by 111
Abstract
The surface quality of industrial gaskets directly impacts sealing performance, operational reliability, and market competitiveness. Inadequate or unreliable defect detection in silicone gaskets can lead to frequent maintenance, undetected faults, and security risks in downstream systems. This paper presents VST-YOLOv8, a trustworthy and [...] Read more.
The surface quality of industrial gaskets directly impacts sealing performance, operational reliability, and market competitiveness. Inadequate or unreliable defect detection in silicone gaskets can lead to frequent maintenance, undetected faults, and security risks in downstream systems. This paper presents VST-YOLOv8, a trustworthy and secure defect detection framework built upon an enhanced YOLOv8 architecture. To address the limitations of C2F feature extraction in the traditional YOLOv8 backbone, we integrate the lightweight Mobile Vision Transformer v2 (ViT v2) to improve global feature representation while maintaining interpretability. For real-time industrial deployment, we incorporate the Gating-Structured Convolution (GSConv) module, which adaptively adjusts convolution kernels to emphasize features of different shapes, ensuring stable detection under varying production conditions. A Slim-neck structure reduces parameter count and computational complexity without sacrificing accuracy, contributing to robustness against performance degradation. Additionally, the Triplet Attention mechanism combines channel, spatial, and fine-grained attention to enhance feature discrimination, improving reliability in challenging visual environments. Experimental results show that VST-YOLOv8 achieves higher accuracy and recall compared to the baseline YOLOv8, while maintaining low latency suitable for edge deployment. When integrated with secure industrial control systems, the proposed framework supports authenticated, tamper-resistant detection pipelines, ensuring both operational efficiency and data integrity in real-world production. These contributions strengthen trust in AI-driven quality inspection, making the system suitable for safety-critical manufacturing processes. Full article
Show Figures

Figure 1

29 pages, 13141 KB  
Article
Automatic Complexity Analysis of UML Class Diagrams Using Visual Question Answering (VQA) Techniques
by Nimra Shehzadi, Javed Ferzund, Rubia Fatima and Adnan Riaz
Software 2025, 4(4), 22; https://doi.org/10.3390/software4040022 - 23 Sep 2025
Viewed by 129
Abstract
Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them [...] Read more.
Context: Modern software systems have become increasingly complex, making it difficult to interpret raw requirements and effectively utilize traditional tools for software design and analysis. Unified Modeling Language (UML) class diagrams are widely used to visualize and understand system architecture, but analyzing them manually, especially for large-scale systems, poses significant challenges. Objectives: This study aims to automate the analysis of UML class diagrams by assessing their complexity using a machine learning approach. The goal is to support software developers in identifying potential design issues early in the development process and to improve overall software quality. Methodology: To achieve this, this research introduces a Visual Question Answering (VQA)-based framework that integrates both computer vision and natural language processing. Vision Transformers (ViTs) are employed to extract global visual features from UML class diagrams, while the BERT language model processes natural language queries. By combining these two models, the system can accurately respond to questions related to software complexity, such as class coupling and inheritance depth. Results: The proposed method demonstrated strong performance in experimental trials. The ViT model achieved an accuracy of 0.8800, with both the F1 score and recall reaching 0.8985. These metrics highlight the effectiveness of the approach in automatically evaluating UML class diagrams. Conclusions: The findings confirm that advanced machine learning techniques can be successfully applied to automate software design analysis. This approach can help developers detect design flaws early and enhance software maintainability. Future work will explore advanced fusion strategies, novel data augmentation techniques, and lightweight model adaptations suitable for environments with limited computational resources. Full article
(This article belongs to the Topic Applications of NLP, AI, and ML in Software Engineering)
Show Figures

Figure 1

26 pages, 8224 KB  
Article
Enhancing Monkeypox Diagnosis with Transformers: Bridging Explainability and Performance with Quantitative Validation
by Delal Şeker and Abdulnasır Yıldız
Diagnostics 2025, 15(18), 2354; https://doi.org/10.3390/diagnostics15182354 - 16 Sep 2025
Viewed by 334
Abstract
Background/Objectives: Monkeypox is a zoonotic virus that presents with smallpox-like symptoms, making visual diagnosis challenging due to overlap with other dermatological conditions. Existing AI-based studies on monkeypox classification have largely relied on Convolutional Neural Networks (CNNs), with limited exploration of Transformer architectures [...] Read more.
Background/Objectives: Monkeypox is a zoonotic virus that presents with smallpox-like symptoms, making visual diagnosis challenging due to overlap with other dermatological conditions. Existing AI-based studies on monkeypox classification have largely relied on Convolutional Neural Networks (CNNs), with limited exploration of Transformer architectures or robust interpretability frameworks. Moreover, most explainability research still depends on conventional heatmap techniques without systematic evaluation. This study addresses these gaps by applying Transformer-based models and introducing a novel hybrid explainability approach. Methods: We fine-tuned Vision Transformer (ViT) and Data-Efficient Image Transformer (DeiT) models for both binary and multi-class classification of monkeypox and other skin lesions. To improve interpretability, we integrated multiple explainable AI techniques—Gradient-weighted Class Activation Mapping (Grad-CAM), Layer-wise Relevance Propagation (LRP), and Attention Rollout (AR)—and proposed a hybrid method that combines these heatmaps using Principal Component Analysis (PCA). The reliability of explanations was quantitatively assessed using deletion and insertion metrics. Results: ViT achieved superior performance with an AUC of 0.9192 in binary classification and 0.9784 in multi-class tasks, outperforming DeiT. The hybrid approach (Grad-CAM + LRP) produced the most informative explanations, achieving higher insertion scores and lower deletion scores than individual methods, thereby enhancing clinical reliability. Conclusions: This study is among the first to combine Transformer models with systematically evaluated hybrid explainability techniques for monkeypox classification. By improving both predictive performance and interpretability, our framework contributes to more transparent and clinically relevant AI applications in dermatology. Future work should expand datasets and integrate clinical metadata to further improve generalizability. Full article
(This article belongs to the Special Issue Explainable Machine Learning in Clinical Diagnostics)
Show Figures

Figure 1

26 pages, 6612 KB  
Article
A Comparative Survey of Vision Transformers for Feature Extraction in Texture Analysis
by Leonardo Scabini, Andre Sacilotti, Kallil M. Zielinski, Lucas C. Ribas, Bernard De Baets and Odemir M. Bruno
J. Imaging 2025, 11(9), 304; https://doi.org/10.3390/jimaging11090304 - 5 Sep 2025
Cited by 1 | Viewed by 693
Abstract
Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance [...] Read more.
Texture, a significant visual attribute in images, plays an important role in many pattern recognition tasks. While Convolutional Neural Networks (CNNs) have been among the most effective methods for texture analysis, alternative architectures such as Vision Transformers (ViTs) have recently demonstrated superior performance on a range of visual recognition problems. However, the suitability of ViTs for texture recognition remains underexplored. In this work, we investigate the capabilities and limitations of ViTs for texture recognition by analyzing 25 different ViT variants as feature extractors and comparing them to CNN-based and hand-engineered approaches. Our evaluation encompasses both accuracy and efficiency, aiming to assess the trade-offs involved in applying ViTs to texture analysis. Our results indicate that ViTs generally outperform CNN-based and hand-engineered models, particularly when using strong pre-training and in-the-wild texture datasets. Notably, BeiTv2-B/16 achieves the highest average accuracy (85.7%), followed by ViT-B/16-DINO (84.1%) and Swin-B (80.8%), outperforming the ResNet50 baseline (75.5%) and the hand-engineered baseline (73.4%). As a lightweight alternative, EfficientFormer-L3 attains a competitive average accuracy of 78.9%. In terms of efficiency, although ViT-B and BeiT(v2) have a higher number of GFLOPs and parameters, they achieve significantly faster feature extraction on GPUs compared to ResNet50. These findings highlight the potential of ViTs as a powerful tool for texture analysis while also pointing to areas for future exploration, such as efficiency improvements and domain-specific adaptations. Full article
(This article belongs to the Special Issue Celebrating the 10th Anniversary of the Journal of Imaging)
Show Figures

Figure 1

22 pages, 2356 KB  
Article
Category-Aware Two-Stage Divide-and-Ensemble Framework for Sperm Morphology Classification
by Aydın Kağan Turkoglu, Gorkem Serbes, Hakkı Uzun, Abdulsamet Aktas, Merve Huner Yigit and Hamza Osman Ilhan
Diagnostics 2025, 15(17), 2234; https://doi.org/10.3390/diagnostics15172234 - 3 Sep 2025
Viewed by 512
Abstract
Introduction: Sperm morphology is a fundamental parameter in the evaluation of male infertility, offering critical insights into reproductive health. However, traditional manual assessments under microscopy are limited by operator dependency and subjective interpretation caused by biological variation. To overcome these limitations, there is [...] Read more.
Introduction: Sperm morphology is a fundamental parameter in the evaluation of male infertility, offering critical insights into reproductive health. However, traditional manual assessments under microscopy are limited by operator dependency and subjective interpretation caused by biological variation. To overcome these limitations, there is a need for accurate and fully automated classification systems. Objectives: This study aims to develop a two-stage, fully automated sperm morphology classification framework that can accurately identify a wide spectrum of abnormalities. The framework is designed to reduce subjectivity, minimize misclassification between visually similar categories, and provide more reliable diagnostic support in reproductive healthcare. Methods: A novel two-stage deep learning-based framework is proposed utilizing images from three staining-specific versions of a comprehensive 18-class dataset. In the first stage, sperm images are categorized into two principal groups: (1) head and neck region abnormalities, and (2) normal morphology together with tail-related abnormalities. In the second stage, a customized ensemble model—integrating four distinct deep learning architectures, including DeepMind’s NFNet-F4 and vision transformer (ViT) variants—is employed for detailed abnormality classification. Unlike conventional majority voting, a structured multi-stage voting strategy is introduced to enhance decision reliability. Results: The proposed framework consistently outperforms single-model baselines, achieving accuracies of 69.43%, 71.34%, and 68.41% across the three staining protocols. These results correspond to a statistically significant 4.38% improvement over prior approaches in the literature. Moreover, the two-stage system substantially reduces misclassification among visually similar categories, demonstrating enhanced ability to detect subtle morphological variations. Conclusions: The proposed two-stage, ensemble-based framework provides a robust and accurate solution for automated sperm morphology classification. By combining hierarchical classification with structured decision fusion, the method advances beyond traditional and single-model approaches, offering a reliable and scalable tool for clinical decision-making in male fertility assessment. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

23 pages, 34310 KB  
Article
One-to-Many Retrieval Between UAV Images and Satellite Images for UAV Self-Localization in Real-World Scenarios
by Jiaqi Li, Yuli Sun, Yaobing Xiang and Lin Lei
Remote Sens. 2025, 17(17), 3045; https://doi.org/10.3390/rs17173045 - 1 Sep 2025
Viewed by 1067
Abstract
Matching drone images to satellite reference images is a critical step for achieving UAV self-localization. Existing drone visual localization datasets mainly focus on target localization, where each drone image is paired with a corresponding satellite image slice, typically with identical coverage. However, this [...] Read more.
Matching drone images to satellite reference images is a critical step for achieving UAV self-localization. Existing drone visual localization datasets mainly focus on target localization, where each drone image is paired with a corresponding satellite image slice, typically with identical coverage. However, this one-to-one approach does not reflect real-world UAV self-localization needs as it cannot guarantee exact matches between drone images and satellite tiles nor reliably identify the correct satellite slice. To bridge this gap, we propose a one-to-many matching method between drone images and satellite reference tiles. First, we enhance the UAV-VisLoc dataset, making it the first in the field tailored for one-to-many imperfect matching in UAV self-localization. Second, we introduce a novel loss function, Incomp-NPair Loss, which better reflects real-world imperfect matching scenarios than traditional methods. Finally, to address challenges such as limited dataset size, training instability, and large-scale differences between drone images and satellite tiles, we adopt a Vision Transformer (ViT) baseline and integrate CNN-extracted features into its patch embedding layer. Full article
Show Figures

Figure 1

36 pages, 10790 KB  
Article
Analysis of Modern Landscape Architecture Evolution Using Image-Based Computational Methods
by Junlei Zhang and Chi Gao
Mathematics 2025, 13(17), 2806; https://doi.org/10.3390/math13172806 - 1 Sep 2025
Viewed by 457
Abstract
We present a novel deep learning framework for high-resolution semantic segmentation, designed to interpret complex visual environments such as cities, rural areas, and natural landscapes. Our method integrates conic geometric embeddings, which is a mathematical approach for capturing spatial relationships, with belief-aware learning, [...] Read more.
We present a novel deep learning framework for high-resolution semantic segmentation, designed to interpret complex visual environments such as cities, rural areas, and natural landscapes. Our method integrates conic geometric embeddings, which is a mathematical approach for capturing spatial relationships, with belief-aware learning, a strategy that adapts model predictions when input data are uncertain or change over time. A multi-scale refinement process further improves boundary accuracy and detail preservation. The proposed model, built on a hybrid Vision Transformer (ViT) backbone and trained end-to-end using adaptive optimization, is evaluated on four benchmark datasets including EDEN, OpenEarthMap, Cityscapes, and iSAID. It achieves 88.94% Accuracy and R2 of 0.859 on EDEN, while surpassing 85.3% Accuracy on Cityscapes. Ablation studies demonstrate that removing Conic Output Embeddings causes drops in Accuracy of up to 2.77% and increases in RMSE, emphasizing their contribution to frequency-aware generalization across diverse conditions. On OpenEarthMap, our model achieves a mean IoU of 73.21%, outperforming previous baselines by 2.9%, and on iSAID, it reaches 80.75% mIoU with improved boundary adherence. Beyond technical performance, the framework enables practical applications such as automated landscape analysis, urban growth monitoring, and sustainable environmental planning. Its consistent results across three independent runs demonstrate both robustness and reproducibility, offering a reliable tool for large-scale geospatial and environmental modeling. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

16 pages, 1458 KB  
Article
Deep Ensemble Learning for Multiclass Skin Lesion Classification
by Tsu-Man Chiu, I-Chun Chi, Yun-Chang Li and Ming-Hseng Tseng
Bioengineering 2025, 12(9), 934; https://doi.org/10.3390/bioengineering12090934 - 29 Aug 2025
Viewed by 601
Abstract
The skin, the largest organ of the body, acts as a protective shield against external stimuli. Skin lesions, which can be the result of inflammation, infection, tumors, or autoimmune conditions, can appear as rashes, spots, lumps, or scales, or remain asymptomatic until they [...] Read more.
The skin, the largest organ of the body, acts as a protective shield against external stimuli. Skin lesions, which can be the result of inflammation, infection, tumors, or autoimmune conditions, can appear as rashes, spots, lumps, or scales, or remain asymptomatic until they become severe. Conventional diagnostic approaches such as visual inspection and palpation often lack accuracy. Artificial intelligence (AI) improves diagnostic precision by analyzing large volumes of skin images to detect subtle patterns that clinicians may not recognize. This study presents a multiclass skin lesion diagnostic model developed using the CSMUH dataset, which focuses on the Eastern population. The dataset was categorized into seven disease classes for model training. A total of 25 pre-trained models, including convolutional neural networks (CNNs) and vision transformers (ViTs), were fine-tuned. The top three models were combined into an ensemble using the hard and soft voting methods. To ensure reliability, the model was tested through five randomized experiments and validated using the holdout technique. The proposed ensemble model, Swin-ViT-EfficientNetB4, achieved the highest test accuracy of 98.5%, demonstrating strong potential for accurate and early skin lesion diagnosis. Full article
(This article belongs to the Special Issue Mathematical Models for Medical Diagnosis and Testing)
Show Figures

Figure 1

13 pages, 2141 KB  
Article
Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT
by Sruthi Keerthi Valicharla, Roghaiyeh Karimzadeh, Xin Li and Yong-Lak Park
Information 2025, 16(9), 741; https://doi.org/10.3390/info16090741 - 28 Aug 2025
Viewed by 579
Abstract
Japanese knotweed (Fallopia japonica) is a noxious invasive plant species that requires scalable and precise monitoring methods. Current visually based ground surveys are resource-intensive and inefficient for detecting Japanese knotweed in landscapes. This study presents a transformer-based semantic segmentation framework for [...] Read more.
Japanese knotweed (Fallopia japonica) is a noxious invasive plant species that requires scalable and precise monitoring methods. Current visually based ground surveys are resource-intensive and inefficient for detecting Japanese knotweed in landscapes. This study presents a transformer-based semantic segmentation framework for the automated detection of Japanese knotweed patches using high-resolution RGB imagery acquired with unmanned aerial vehicles (UAVs). We used the Twins Spatially Separable Vision Transformer (Twins-SVT), which utilizes a hierarchical architecture with spatially separable self-attention to effectively model long-range dependencies and multiscale contextual features. The model was trained on 6945 annotated aerial images collected in three sites infested with Japanese knotweed in West Virginia, USA. The results of this study showed that the proposed framework achieved superior performance compared to other transformer-based baselines. The Twins-SVT model achieved a mean Intersection over Union (mIoU) of 94.94% and an Average Accuracy (AAcc) of 97.50%, outperforming SegFormer, Swin-T, and ViT. These findings highlight the model’s ability to accurately distinguish Japanese knotweed patches from surrounding vegetation. The method and protocol presented in this research provide a robust, scalable solution for mapping Japanese knotweed through aerial imagery and highlight the successful use of advanced vision transformers in ecological and geospatial information analysis. Full article
(This article belongs to the Special Issue Machine Learning and Artificial Intelligence with Applications)
Show Figures

Graphical abstract

31 pages, 3129 KB  
Review
A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods
by Yankun Gong, Chao Bao, Zhengxi He, Yifan Jian, Xiaoye Wang, Haineng Huang and Xintai Song
Information 2025, 16(9), 731; https://doi.org/10.3390/info16090731 - 25 Aug 2025
Cited by 1 | Viewed by 1040
Abstract
Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses [...] Read more.
Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses detection principles, inherent challenges, mitigation strategies, and the state of the art (SOTA). Small leaks refer to low flow leakage originating from defects with apertures at millimeter or submillimeter scales, posing significant detection difficulties. Acoustic detection leverages the acoustic wave signals generated by gas leaks for non-contact monitoring, offering advantages such as rapid response and broad coverage. However, its susceptibility to environmental noise interference often triggers false alarms. This limitation can be mitigated through time-frequency analysis, multi-sensor fusion, and deep-learning algorithms—effectively enhancing leak signals, suppressing background noise, and thereby improving the system’s detection robustness and accuracy. OGI utilizes infrared imaging technology to visualize leakage gas and is applicable to the detection of various polar gases. Its primary limitations include low image resolution, low contrast, and interference from complex backgrounds. Mitigation techniques involve background subtraction, optical flow estimation, fully convolutional neural networks (FCNNs), and vision transformers (ViTs), which enhance image contrast and extract multi-scale features to boost detection precision. Multimodal fusion technology integrates data from diverse sensors, such as acoustic and optical devices. Key challenges lie in achieving spatiotemporal synchronization across multiple sensors and effectively fusing heterogeneous data streams. Current methodologies primarily utilize decision-level fusion and feature-level fusion techniques. Decision-level fusion offers high flexibility and ease of implementation but lacks inter-feature interaction; it is less effective than feature-level fusion when correlations exist between heterogeneous features. Feature-level fusion amalgamates data from different modalities during the feature extraction phase, generating a unified cross-modal representation that effectively resolves inter-modal heterogeneity. In conclusion, we posit that multimodal fusion holds significant potential for further enhancing detection accuracy beyond the capabilities of existing single-modality technologies and is poised to become a major focus of future research in this domain. Full article
Show Figures

Figure 1

26 pages, 30652 KB  
Article
Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification
by Ananya Saha, Mahir Afser Pavel, Md Fahim Shahoriar Titu, Afifa Zain Apurba and Riasat Khan
Vehicles 2025, 7(3), 89; https://doi.org/10.3390/vehicles7030089 - 25 Aug 2025
Viewed by 617
Abstract
Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, [...] Read more.
Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, such as CNNs, often struggle with generalization, require large annotated datasets, and lack interpretability. This study presents a robust and interpretable deep learning framework for vehicle damage classification, integrating Vision Transformers (ViTs) and ensemble detection strategies. The proposed architecture employs a RetinaNet backbone with a ViT-enhanced detection head, implemented in PyTorch using the Detectron2 object detection technique. It is pretrained on COCO weights and fine-tuned through focal loss and aggressive augmentation techniques to improve generalization under real-world damage variability. The proposed system applies the Weighted Box Fusion (WBF) ensemble strategy to refine detection outputs from multiple models, offering improved spatial precision. To ensure interpretability and transparency, we adopt numerous explainability techniques—Grad-CAM, Grad-CAM++, and SHAP—offering semantic and visual insights into model decisions. A custom vehicle damage dataset with 4500 images has been built, consisting of approximately 60% curated images collected through targeted web scraping and crawling covering various damage types (such as bumper dents, panel scratches, and frontal impacts), along with 40% COCO dataset images to support model generalization. Comparative evaluations show that Hybrid ViT-RetinaNet achieves superior performance with an F1-score of 84.6%, mAP of 87.2%, and 22 FPS inference speed. In an ablation analysis, WBF, augmentation, transfer learning, and focal loss significantly improve performance, with focal loss increasing F1 by 6.3% for underrepresented classes and COCO pretraining boosting mAP by 8.7%. Additional architectural comparisons demonstrate that our full hybrid configuration not only maintains competitive accuracy but also achieves up to 150 FPS, making it well suited for real-time use cases. Robustness tests under challenging conditions, including real-world visual disturbances (smoke, fire, motion blur, varying lighting, and occlusions) and artificial noise (Gaussian; salt-and-pepper), confirm the model’s generalization ability. This work contributes a scalable, explainable, and high-performance solution for real-world vehicle damage diagnostics. Full article
Show Figures

Figure 1

15 pages, 2220 KB  
Article
Reproducing the Few-Shot Learning Capabilities of the Visual Ventral Pathway Using Vision Transformers and Neural Fields
by Jiayi Su, Lifeng Xing, Tao Li, Nan Xiang, Jiacheng Shi and Dequan Jin
Brain Sci. 2025, 15(8), 882; https://doi.org/10.3390/brainsci15080882 - 19 Aug 2025
Viewed by 642
Abstract
Background: Studies have shown that humans can rapidly learn the shape of new objects or adjust their behavior when encountering novel situations. Research on visual cognition in the brain further indicates that the ventral visual pathway plays a critical role in core object [...] Read more.
Background: Studies have shown that humans can rapidly learn the shape of new objects or adjust their behavior when encountering novel situations. Research on visual cognition in the brain further indicates that the ventral visual pathway plays a critical role in core object recognition. While existing studies often focus on microscopic simulations of individual neural structures, few adopt a holistic, system-level perspective, making it difficult to achieve robust few-shot learning capabilities. Method: Inspired by the mechanisms and processes of the ventral visual stream, this paper proposes a computational model with a macroscopic neural architecture for few-shot learning. We reproduce the feature extraction functions of V1 and V2 using a well-trained Vision Transformer (ViT) and model the neuronal activity in V4 and IT using two neural fields. By connecting these neurons based on Hebbian learning rules, the proposed model stores the feature and category information of the input samples during support training. Results: By employing a scale adaptation strategy, the proposed model emulates visual neural mechanisms, enables efficient learning, and outperforms state-of-the-art few-shot learning algorithms in comparative experiments on real-world image datasets, demonstrating human-like learning capabilities. Conclusion: Experimental results demonstrate that our ventral-stream-inspired machine-learning model achieves effective few-shot learning on real-world datasets. Full article
Show Figures

Figure 1

Back to TopTop