Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,178)

Search Parameters:
Keywords = ViTs

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
13 pages, 2141 KB  
Article
Transformer-Based Semantic Segmentation of Japanese Knotweed in High-Resolution UAV Imagery Using Twins-SVT
by Sruthi Keerthi Valicharla, Roghaiyeh Karimzadeh, Xin Li and Yong-Lak Park
Information 2025, 16(9), 741; https://doi.org/10.3390/info16090741 - 28 Aug 2025
Abstract
Japanese knotweed (Fallopia japonica) is a noxious invasive plant species that requires scalable and precise monitoring methods. Current visually based ground surveys are resource-intensive and inefficient for detecting Japanese knotweed in landscapes. This study presents a transformer-based semantic segmentation framework for [...] Read more.
Japanese knotweed (Fallopia japonica) is a noxious invasive plant species that requires scalable and precise monitoring methods. Current visually based ground surveys are resource-intensive and inefficient for detecting Japanese knotweed in landscapes. This study presents a transformer-based semantic segmentation framework for the automated detection of Japanese knotweed patches using high-resolution RGB imagery acquired with unmanned aerial vehicles (UAVs). We used the Twins Spatially Separable Vision Transformer (Twins-SVT), which utilizes a hierarchical architecture with spatially separable self-attention to effectively model long-range dependencies and multiscale contextual features. The model was trained on 6945 annotated aerial images collected in three sites infested with Japanese knotweed in West Virginia, USA. The results of this study showed that the proposed framework achieved superior performance compared to other transformer-based baselines. The Twins-SVT model achieved a mean Intersection over Union (mIoU) of 94.94% and an Average Accuracy (AAcc) of 97.50%, outperforming SegFormer, Swin-T, and ViT. These findings highlight the model’s ability to accurately distinguish Japanese knotweed patches from surrounding vegetation. The method and protocol presented in this research provide a robust, scalable solution for mapping Japanese knotweed through aerial imagery and highlight the successful use of advanced vision transformers in ecological and geospatial information analysis. Full article
(This article belongs to the Special Issue Machine Learning and Artificial Intelligence with Applications)
Show Figures

Graphical abstract

29 pages, 11689 KB  
Article
Enhanced Breast Cancer Diagnosis Using Multimodal Feature Fusion with Radiomics and Transfer Learning
by Nazmul Ahasan Maruf, Abdullah Basuhail and Muhammad Umair Ramzan
Diagnostics 2025, 15(17), 2170; https://doi.org/10.3390/diagnostics15172170 - 28 Aug 2025
Abstract
Background: Breast cancer remains a critical public health problem worldwide and is a leading cause of cancer-related mortality. Optimizing clinical outcomes is contingent upon the early and precise detection of malignancies. Advances in medical imaging and artificial intelligence (AI), particularly in the fields [...] Read more.
Background: Breast cancer remains a critical public health problem worldwide and is a leading cause of cancer-related mortality. Optimizing clinical outcomes is contingent upon the early and precise detection of malignancies. Advances in medical imaging and artificial intelligence (AI), particularly in the fields of radiomics and deep learning (DL), have contributed to improvements in early detection methodologies. Nonetheless, persistent challenges, including limited data availability, model overfitting, and restricted generalization, continue to hinder performance. Methods: This study aims to overcome existing challenges by improving model accuracy and robustness through enhanced data augmentation and the integration of radiomics and deep learning features from the CBIS-DDSM dataset. To mitigate overfitting and improve model generalization, data augmentation techniques were applied. The PyRadiomics library was used to extract radiomics features, while transfer learning models were employed to derive deep learning features from the augmented training dataset. For radiomics feature selection, we compared multiple supervised feature selection methods, including RFE with random forest and logistic regression, ANOVA F-test, LASSO, and mutual information. Embedded methods with XGBoost, LightGBM, and CatBoost for GPUs were also explored. Finally, we integrated radiomics and deep features to build a unified multimodal feature space for improved classification performance. Based on this integrated set of radiomics and deep learning features, 13 pre-trained transfer learning models were trained and evaluated, including various versions of ResNet (50, 50V2, 101, 101V2, 152, 152V2), DenseNet (121, 169, 201), InceptionV3, MobileNet, and VGG (16, 19). Results: Among the evaluated models, ResNet152 achieved the highest classification accuracy of 97%, demonstrating the potential of this approach to enhance diagnostic precision. Other models, including VGG19, ResNet101V2, and ResNet101, achieved 96% accuracy, emphasizing the importance of the selected feature set in achieving robust detection. Conclusions: Future research could build on this work by incorporating Vision Transformer (ViT) architectures and leveraging multimodal data (e.g., clinical data, genomic information, and patient history). This could improve predictive performance and make the model more robust and adaptable to diverse data types. Ultimately, this approach has the potential to transform breast cancer detection, making it more accurate and interpretable. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

31 pages, 3129 KB  
Review
A Review on Gas Pipeline Leak Detection: Acoustic-Based, OGI-Based, and Multimodal Fusion Methods
by Yankun Gong, Chao Bao, Zhengxi He, Yifan Jian, Xiaoye Wang, Haineng Huang and Xintai Song
Information 2025, 16(9), 731; https://doi.org/10.3390/info16090731 - 25 Aug 2025
Viewed by 202
Abstract
Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses [...] Read more.
Pipelines play a vital role in material transportation within industrial settings. This review synthesizes detection technologies for early-stage small gas leaks from pipelines in the industrial sector, with a focus on acoustic-based methods, optical gas imaging (OGI), and multimodal fusion approaches. It encompasses detection principles, inherent challenges, mitigation strategies, and the state of the art (SOTA). Small leaks refer to low flow leakage originating from defects with apertures at millimeter or submillimeter scales, posing significant detection difficulties. Acoustic detection leverages the acoustic wave signals generated by gas leaks for non-contact monitoring, offering advantages such as rapid response and broad coverage. However, its susceptibility to environmental noise interference often triggers false alarms. This limitation can be mitigated through time-frequency analysis, multi-sensor fusion, and deep-learning algorithms—effectively enhancing leak signals, suppressing background noise, and thereby improving the system’s detection robustness and accuracy. OGI utilizes infrared imaging technology to visualize leakage gas and is applicable to the detection of various polar gases. Its primary limitations include low image resolution, low contrast, and interference from complex backgrounds. Mitigation techniques involve background subtraction, optical flow estimation, fully convolutional neural networks (FCNNs), and vision transformers (ViTs), which enhance image contrast and extract multi-scale features to boost detection precision. Multimodal fusion technology integrates data from diverse sensors, such as acoustic and optical devices. Key challenges lie in achieving spatiotemporal synchronization across multiple sensors and effectively fusing heterogeneous data streams. Current methodologies primarily utilize decision-level fusion and feature-level fusion techniques. Decision-level fusion offers high flexibility and ease of implementation but lacks inter-feature interaction; it is less effective than feature-level fusion when correlations exist between heterogeneous features. Feature-level fusion amalgamates data from different modalities during the feature extraction phase, generating a unified cross-modal representation that effectively resolves inter-modal heterogeneity. In conclusion, we posit that multimodal fusion holds significant potential for further enhancing detection accuracy beyond the capabilities of existing single-modality technologies and is poised to become a major focus of future research in this domain. Full article
Show Figures

Figure 1

26 pages, 30652 KB  
Article
Hybrid ViT-RetinaNet with Explainable Ensemble Learning for Fine-Grained Vehicle Damage Classification
by Ananya Saha, Mahir Afser Pavel, Md Fahim Shahoriar Titu, Afifa Zain Apurba and Riasat Khan
Vehicles 2025, 7(3), 89; https://doi.org/10.3390/vehicles7030089 - 25 Aug 2025
Viewed by 170
Abstract
Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, [...] Read more.
Efficient and explainable vehicle damage inspection is essential due to the increasing complexity and volume of vehicular incidents. Traditional manual inspection approaches are not time-effective, prone to human error, and lead to inefficiencies in insurance claims and repair workflows. Existing deep learning methods, such as CNNs, often struggle with generalization, require large annotated datasets, and lack interpretability. This study presents a robust and interpretable deep learning framework for vehicle damage classification, integrating Vision Transformers (ViTs) and ensemble detection strategies. The proposed architecture employs a RetinaNet backbone with a ViT-enhanced detection head, implemented in PyTorch using the Detectron2 object detection technique. It is pretrained on COCO weights and fine-tuned through focal loss and aggressive augmentation techniques to improve generalization under real-world damage variability. The proposed system applies the Weighted Box Fusion (WBF) ensemble strategy to refine detection outputs from multiple models, offering improved spatial precision. To ensure interpretability and transparency, we adopt numerous explainability techniques—Grad-CAM, Grad-CAM++, and SHAP—offering semantic and visual insights into model decisions. A custom vehicle damage dataset with 4500 images has been built, consisting of approximately 60% curated images collected through targeted web scraping and crawling covering various damage types (such as bumper dents, panel scratches, and frontal impacts), along with 40% COCO dataset images to support model generalization. Comparative evaluations show that Hybrid ViT-RetinaNet achieves superior performance with an F1-score of 84.6%, mAP of 87.2%, and 22 FPS inference speed. In an ablation analysis, WBF, augmentation, transfer learning, and focal loss significantly improve performance, with focal loss increasing F1 by 6.3% for underrepresented classes and COCO pretraining boosting mAP by 8.7%. Additional architectural comparisons demonstrate that our full hybrid configuration not only maintains competitive accuracy but also achieves up to 150 FPS, making it well suited for real-time use cases. Robustness tests under challenging conditions, including real-world visual disturbances (smoke, fire, motion blur, varying lighting, and occlusions) and artificial noise (Gaussian; salt-and-pepper), confirm the model’s generalization ability. This work contributes a scalable, explainable, and high-performance solution for real-world vehicle damage diagnostics. Full article
Show Figures

Figure 1

33 pages, 8494 KB  
Article
Enhanced Multi-Class Brain Tumor Classification in MRI Using Pre-Trained CNNs and Transformer Architectures
by Marco Antonio Gómez-Guzmán, Laura Jiménez-Beristain, Enrique Efren García-Guerrero, Oscar Adrian Aguirre-Castro, José Jaime Esqueda-Elizondo, Edgar Rene Ramos-Acosta, Gilberto Manuel Galindo-Aldana, Cynthia Torres-Gonzalez and Everardo Inzunza-Gonzalez
Technologies 2025, 13(9), 379; https://doi.org/10.3390/technologies13090379 - 22 Aug 2025
Viewed by 252
Abstract
Early and accurate identification of brain tumors is essential for determining effective treatment strategies and improving patient outcomes. Artificial intelligence (AI) and deep learning (DL) techniques have shown promise in automating diagnostic tasks based on magnetic resonance imaging (MRI). This study evaluates the [...] Read more.
Early and accurate identification of brain tumors is essential for determining effective treatment strategies and improving patient outcomes. Artificial intelligence (AI) and deep learning (DL) techniques have shown promise in automating diagnostic tasks based on magnetic resonance imaging (MRI). This study evaluates the performance of four pre-trained deep convolutional neural network (CNN) architectures for the automatic multi-class classification of brain tumors into four categories: Glioma, Meningioma, Pituitary, and No Tumor. The proposed approach utilizes the publicly accessible Brain Tumor MRI Msoud dataset, consisting of 7023 images, with 5712 provided for training and 1311 for testing. To assess the impact of data availability, subsets containing 25%, 50%, 75%, and 100% of the training data were used. A stratified five-fold cross-validation technique was applied. The CNN architectures evaluated include DeiT3_base_patch16_224, Xception41, Inception_v4, and Swin_Tiny_Patch4_Window7_224, all fine-tuned using transfer learning. The training pipeline incorporated advanced preprocessing and image data augmentation techniques to enhance robustness and mitigate overfitting. Among the models tested, Swin_Tiny_Patch4_Window7_224 achieved the highest classification Accuracy of 99.24% on the test set using 75% of the training data. This model demonstrated superior generalization across all tumor classes and effectively addressed class imbalance issues. Furthermore, we deployed and benchmarked the best-performing DL model on embedded AI platforms (Jetson AGX Xavier and Orin Nano), demonstrating their capability for real-time inference and highlighting their feasibility for edge-based clinical deployment. The results highlight the strong potential of pre-trained deep CNN and transformer-based architectures in medical image analysis. The proposed approach provides a scalable and energy-efficient solution for automated brain tumor diagnosis, facilitating the integration of AI into clinical workflows. Full article
Show Figures

Figure 1

15 pages, 622 KB  
Review
Artificial Intelligence in the Diagnosis and Imaging-Based Assessment of Pelvic Organ Prolapse: A Scoping Review
by Marian Botoncea, Călin Molnar, Vlad Olimpiu Butiurca, Cosmin Lucian Nicolescu and Claudiu Molnar-Varlam
Medicina 2025, 61(8), 1497; https://doi.org/10.3390/medicina61081497 - 21 Aug 2025
Viewed by 235
Abstract
Background and Objectives: Pelvic organ prolapse (POP) is a complex condition affecting the pelvic floor, often requiring imaging for accurate diagnosis and treatment planning. Artificial intelligence (AI), particularly deep learning (DL), is emerging as a powerful tool in medical imaging. This scoping [...] Read more.
Background and Objectives: Pelvic organ prolapse (POP) is a complex condition affecting the pelvic floor, often requiring imaging for accurate diagnosis and treatment planning. Artificial intelligence (AI), particularly deep learning (DL), is emerging as a powerful tool in medical imaging. This scoping review aims to synthesize current evidence on the use of AI in the imaging-based diagnosis and anatomical evaluation of POP. Materials and Methods: Following the PRISMA-ScR guidelines, a comprehensive search was conducted in PubMed, Scopus, and Web of Science for studies published between January 2020 and April 2025. Studies were included if they applied AI methodologies, such as convolutional neural networks (CNNs), vision transformers (ViTs), or hybrid models, to diagnostic imaging modalities such as ultrasound and magnetic resonance imaging (MRI) to women with POP. Results: Eight studies met the inclusion criteria. In these studies, AI technologies were applied to 2D/3D ultrasound and static or stress MRI for segmentation, anatomical landmark localization, and prolapse classification. CNNs were the most commonly used models, often combined with transfer learning. Some studies used hybrid models of ViTs, demonstrating high diagnostic accuracy. However, all studies relied on internal datasets, with limited model interpretability and no external validation. Moreover, clinical deployment and outcome assessments remain underexplored. Conclusions: AI shows promise in enhancing POP diagnosis through improved image analysis, but current applications are largely exploratory. Future work should prioritize external validation, standardization, explainable AI, and real-world implementation to bridge the gap between experimental models and clinical utility. Full article
(This article belongs to the Section Obstetrics and Gynecology)
Show Figures

Graphical abstract

20 pages, 1818 KB  
Article
Image Captioning Model Based on Multi-Step Cross-Attention Cross-Modal Alignment and External Commonsense Knowledge Augmentation
by Liang Wang, Meiqing Jiao, Zhihai Li, Mengxue Zhang, Haiyan Wei, Yuru Ma, Honghui An, Jiaqi Lin and Jun Wang
Electronics 2025, 14(16), 3325; https://doi.org/10.3390/electronics14163325 - 21 Aug 2025
Viewed by 413
Abstract
To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and [...] Read more.
To address the semantic mismatch between limited textual descriptions in image captioning training datasets and the multi-semantic nature of images, as well as the underutilized external commonsense knowledge, this article proposes a novel image captioning model based on multi-step cross-attention cross-modal alignment and external commonsense knowledge enhancement. The model employs a backbone architecture comprising CLIP’s ViT visual encoder, Faster R-CNN, BERT text encoder, and GPT-2 text decoder. It incorporates two core mechanisms: a multi-step cross-attention mechanism that iteratively aligns image and text features across multiple rounds, progressively enhancing inter-modal semantic consistency for more accurate cross-modal representation fusion. Moreover, the model employs Faster R-CNN to extract region-based object features. These features are mapped to corresponding entities within the dataset through entity probability calculation and entity linking. External commonsense knowledge associated with these entities is then retrieved from the ConceptNet knowledge graph, followed by knowledge embedding via TransE and multi-hop reasoning. Finally, the fused multimodal features are fed into the GPT-2 decoder to steer caption generation, enhancing the lexical richness, factual accuracy, and cognitive plausibility of the generated descriptions. In the experiments, the model achieves CIDEr scores of 142.6 on MSCOCO and 78.4 on Flickr30k. Ablations confirm both modules enhance caption quality. Full article
Show Figures

Figure 1

15 pages, 2220 KB  
Article
Reproducing the Few-Shot Learning Capabilities of the Visual Ventral Pathway Using Vision Transformers and Neural Fields
by Jiayi Su, Lifeng Xing, Tao Li, Nan Xiang, Jiacheng Shi and Dequan Jin
Brain Sci. 2025, 15(8), 882; https://doi.org/10.3390/brainsci15080882 - 19 Aug 2025
Viewed by 359
Abstract
Background: Studies have shown that humans can rapidly learn the shape of new objects or adjust their behavior when encountering novel situations. Research on visual cognition in the brain further indicates that the ventral visual pathway plays a critical role in core object [...] Read more.
Background: Studies have shown that humans can rapidly learn the shape of new objects or adjust their behavior when encountering novel situations. Research on visual cognition in the brain further indicates that the ventral visual pathway plays a critical role in core object recognition. While existing studies often focus on microscopic simulations of individual neural structures, few adopt a holistic, system-level perspective, making it difficult to achieve robust few-shot learning capabilities. Method: Inspired by the mechanisms and processes of the ventral visual stream, this paper proposes a computational model with a macroscopic neural architecture for few-shot learning. We reproduce the feature extraction functions of V1 and V2 using a well-trained Vision Transformer (ViT) and model the neuronal activity in V4 and IT using two neural fields. By connecting these neurons based on Hebbian learning rules, the proposed model stores the feature and category information of the input samples during support training. Results: By employing a scale adaptation strategy, the proposed model emulates visual neural mechanisms, enables efficient learning, and outperforms state-of-the-art few-shot learning algorithms in comparative experiments on real-world image datasets, demonstrating human-like learning capabilities. Conclusion: Experimental results demonstrate that our ventral-stream-inspired machine-learning model achieves effective few-shot learning on real-world datasets. Full article
Show Figures

Figure 1

21 pages, 4332 KB  
Article
A Comparative Study of Time–Frequency Representations for Bearing and Rotating Fault Diagnosis Using Vision Transformer
by Ahmet Orhan, Nikolay Yordanov, Merve Ertarğın, Marin Zhilevski and Mikho Mikhov
Machines 2025, 13(8), 737; https://doi.org/10.3390/machines13080737 - 19 Aug 2025
Viewed by 429
Abstract
This paper presents a comparative analysis of bearing and rotating component fault classification based on different time–frequency representations using vision transformer (ViT). Four different time–frequency transformation techniques—short-time Fourier transform (STFT), continuous wavelet transform (CWT), Hilbert–Huang transform (HHT), and Wigner–Ville distribution (WVD)—were applied to [...] Read more.
This paper presents a comparative analysis of bearing and rotating component fault classification based on different time–frequency representations using vision transformer (ViT). Four different time–frequency transformation techniques—short-time Fourier transform (STFT), continuous wavelet transform (CWT), Hilbert–Huang transform (HHT), and Wigner–Ville distribution (WVD)—were applied to convert the signals into 2D images. A pretrained ViT-Base architecture was fine-tuned on the resulting images for classification tasks. The model was evaluated on two separate scenarios: (i) eight-class rotating component fault classification and (ii) four-class bearing fault classification. Importantly, in each task, the samples were collected under varying conditions of the other component (i.e., different rotating conditions in bearing classification and vice versa). This design allowed for an independent assessment of the model’s ability to generalize across fault domains. The experimental results demonstrate that the ViT-based approach achieves high classification performance across various time–frequency representations, highlighting its potential for mechanical fault diagnosis in rotating machinery. Notably, the model achieved higher accuracy in bearing fault classification compared to rotating component faults, suggesting higher sensitivity to bearing-related anomalies. Full article
(This article belongs to the Section Machines Testing and Maintenance)
Show Figures

Figure 1

30 pages, 4741 KB  
Article
TriViT-Lite: A Compact Vision Transformer–MobileNet Model with Texture-Aware Attention for Real-Time Facial Emotion Recognition in Healthcare
by Waqar Riaz, Jiancheng (Charles) Ji and Asif Ullah
Electronics 2025, 14(16), 3256; https://doi.org/10.3390/electronics14163256 - 16 Aug 2025
Viewed by 292
Abstract
Facial emotion recognition has become increasingly important in healthcare, where understanding delicate cues like pain, discomfort, or unconsciousness can support more timely and responsive care. Yet, recognizing facial expressions in real-world settings remains challenging due to varying lighting, facial occlusions, and hardware limitations [...] Read more.
Facial emotion recognition has become increasingly important in healthcare, where understanding delicate cues like pain, discomfort, or unconsciousness can support more timely and responsive care. Yet, recognizing facial expressions in real-world settings remains challenging due to varying lighting, facial occlusions, and hardware limitations in clinical environments. To address this, we propose TriViT-Lite, a lightweight yet powerful model that blends three complementary components: MobileNet, for capturing fine-grained local features efficiently; Vision Transformers (ViT), for modeling global facial patterns; and handcrafted texture descriptors, such as Local Binary Patterns (LBP) and Histograms of Oriented Gradients (HOG), for added robustness. These multi-scale features are brought together through a texture-aware cross-attention fusion mechanism that helps the model focus on the most relevant facial regions dynamically. TriViT-Lite is evaluated on both benchmark datasets (FER2013, AffectNet) and a custom healthcare-oriented dataset covering seven critical emotional states, including pain and unconsciousness. It achieves a competitive accuracy of 91.8% on FER2013 and of 87.5% on the custom dataset while maintaining real-time performance (~15 FPS) on resource-constrained edge devices. Our results show that TriViT-Lite offers a practical and accurate solution for real-time emotion recognition, particularly in healthcare settings. It strikes a balance between performance, interpretability, and efficiency, making it a strong candidate for machine-learning-driven pattern recognition in patient-monitoring applications. Full article
Show Figures

Figure 1

15 pages, 6562 KB  
Article
Smart City Infrastructure Monitoring with a Hybrid Vision Transformer for Micro-Crack Detection
by Rashid Nasimov and Young Im Cho
Sensors 2025, 25(16), 5079; https://doi.org/10.3390/s25165079 - 15 Aug 2025
Viewed by 425
Abstract
Innovative and reliable structural health monitoring (SHM) is indispensable for ensuring the safety, dependability, and longevity of urban infrastructure. However, conventional methods lack full efficiency, remain labor-intensive, and are susceptible to errors, particularly in detecting subtle structural anomalies such as micro-cracks. To address [...] Read more.
Innovative and reliable structural health monitoring (SHM) is indispensable for ensuring the safety, dependability, and longevity of urban infrastructure. However, conventional methods lack full efficiency, remain labor-intensive, and are susceptible to errors, particularly in detecting subtle structural anomalies such as micro-cracks. To address this issue, this study proposes a novel deep-learning framework based on a modified Detection Transformer (DETR) architecture. The framework is enhanced by integrating a Vision Transformer (ViT) backbone and a specially designed Local Feature Extractor (LFE) module. The proposed ViT-based DETR model leverages ViT’s capability to capture global contextual information through its self-attention mechanism. The introduced LFE module significantly enhances the extraction and clarification of complex local spatial features in images. The LFE employs convolutional layers with residual connections and non-linear activations, facilitating efficient gradient propagation and reliable identification of micro-level defects. Thorough experimental validation conducted on the benchmark SDNET2018 dataset and a custom dataset of damaged bridge images demonstrates that the proposed Vision-Local Feature Detector (ViLFD) model outperforms existing approaches, including DETR variants and YOLO-based models (versions 5–9), thereby establishing a new state-of-the-art performance. The proposed model achieves superior accuracy (95.0%), precision (0.94), recall (0.93), F1-score (0.93), and mean Average Precision (mAP@0.5 = 0.89), confirming its capability to accurately and reliably detect subtle structural defects. The introduced architecture represents a significant advancement toward automated, precise, and reliable SHM solutions applicable in complex urban environments. Full article
Show Figures

Figure 1

22 pages, 4009 KB  
Article
A Multi-Dimensional Feature Extraction Model Fusing Fractional-Order Fourier Transform and Convolutional Information
by Haijing Sun, Wen Zhou, Jiapeng Yang, Yichuan Shao, Le Zhang and Zhiqiang Mao
Fractal Fract. 2025, 9(8), 533; https://doi.org/10.3390/fractalfract9080533 - 14 Aug 2025
Viewed by 368
Abstract
In the field of deep learning, the traditional Vision Transformer (ViT) model has some limitations when dealing with local details and long-range dependencies; especially in the absence of sufficient training data, it is prone to overfitting. Structures such as retinal blood vessels and [...] Read more.
In the field of deep learning, the traditional Vision Transformer (ViT) model has some limitations when dealing with local details and long-range dependencies; especially in the absence of sufficient training data, it is prone to overfitting. Structures such as retinal blood vessels and lesion boundaries have distinct fractal properties in medical images. The Fractional Convolution Vision Transformer (FCViT) model is proposed in this paper, which effectively compensates for the deficiency of ViT in local feature capture by fusing convolutional information. The ability to classify medical images is enhanced by analyzing frequency domain features using fractional-order Fourier transform and capturing global information through a self-attention mechanism. The three-branch architecture enables the model to fully understand the data from multiple perspectives, capturing both local details and global context, which in turn improves classification performance and generalization. The experimental results showed that the FCViT model achieved 93.52% accuracy, 93.32% precision, 92.79% recall, and a 93.04% F1-score on the standardized fundus glaucoma dataset. The accuracy on the Harvard Dataverse-V1 dataset reached 94.21%, with a precision of 93.73%, recall of 93.67%, and F1-score of 93.68%. The FCViT model achieves significant performance gains on a variety of neural network architectures and tasks with different source datasets, demonstrating its effectiveness and utility in the field of deep learning. Full article
Show Figures

Figure 1

25 pages, 1734 KB  
Article
A Multimodal Affective Interaction Architecture Integrating BERT-Based Semantic Understanding and VITS-Based Emotional Speech Synthesis
by Yanhong Yuan, Shuangsheng Duo, Xuming Tong and Yapeng Wang
Algorithms 2025, 18(8), 513; https://doi.org/10.3390/a18080513 - 14 Aug 2025
Viewed by 493
Abstract
Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the [...] Read more.
Addressing the issues of coarse emotional representation, low cross-modal alignment efficiency, and insufficient real-time response capabilities in current human–computer emotional language interaction, this paper proposes an affective interaction framework integrating BERT-based semantic understanding with VITS-based speech synthesis. The framework aims to enhance the naturalness, expressiveness, and response efficiency of human–computer emotional interaction. By introducing a modular layered design, a six-dimensional emotional space, a gated attention mechanism, and a dynamic model scheduling strategy, the system overcomes challenges such as limited emotional representation, modality misalignment, and high-latency responses. Experimental results demonstrate that the framework achieves superior performance in speech synthesis quality (MOS: 4.35), emotion recognition accuracy (91.6%), and response latency (<1.2 s), outperforming baseline models like Tacotron2 and FastSpeech2. Through model lightweighting, GPU parallel inference, and load balancing optimization, the system validates its robustness and generalizability across English and Chinese corpora in cross-linguistic tests. The modular architecture and dynamic scheduling ensure scalability and efficiency, enabling a more humanized and immersive interaction experience in typical application scenarios such as psychological companionship, intelligent education, and high-concurrency customer service. This study provides an effective technical pathway for developing the next generation of personalized and immersive affective intelligent interaction systems. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

21 pages, 3126 KB  
Article
WMSA–WBS: Efficient Wave Multi-Head Self-Attention with Wavelet Bottleneck
by Xiangyang Li, Yafeng Li, Pan Fan and Xueya Zhang
Sensors 2025, 25(16), 5046; https://doi.org/10.3390/s25165046 - 14 Aug 2025
Viewed by 312
Abstract
The critical component of the vision transformer (ViT) architecture is multi-head self-attention (MSA), which enables the encoding of long-range dependencies and heterogeneous interactions. However, MSA has two significant limitations: its limited ability to capture local features and its high computational costs. To address [...] Read more.
The critical component of the vision transformer (ViT) architecture is multi-head self-attention (MSA), which enables the encoding of long-range dependencies and heterogeneous interactions. However, MSA has two significant limitations: its limited ability to capture local features and its high computational costs. To address these challenges, this paper proposes an integrated multi-head self-attention approach with a bottleneck enhancement structure, named WMSA–WBS, which mitigates the aforementioned shortcomings of conventional MSA. Different from existing wavelet-enhanced ViT variants that mainly focus on the isolated wavelet decomposition in the attention layer, WMSA–WBS introduces a co-design of wavelet-based frequency processing and bottleneck optimization, achieving more efficient and comprehensive feature learning. Within WMSA–WBS, the proposed wavelet multi-head self-attention (WMSA) approach is combined with a novel wavelet bottleneck structure to capture both global and local information across the spatial, frequency, and channel domains. Specifically, this module achieves these capabilities while maintaining low computational complexity and memory consumption. Extensive experiments demonstrate that ViT models equipped with WMSA–WBS achieve superior trade-offs between accuracy and model complexity across various vision tasks, including image classification, object detection, and semantic segmentation. Full article
(This article belongs to the Section Sensor Networks)
Show Figures

Figure 1

18 pages, 1914 KB  
Article
Hybrid of VGG-16 and FTVT-b16 Models to Enhance Brain Tumors Classification Using MRI Images
by Eman M. Younis, Ibrahim A. Ibrahim, Mahmoud N. Mahmoud and Abdullah M. Albarrak
Diagnostics 2025, 15(16), 2014; https://doi.org/10.3390/diagnostics15162014 - 12 Aug 2025
Viewed by 387
Abstract
Background: The accurate classification of brain tumors from magnetic resonance imaging (MRI) scans is pivotal for timely clinical intervention, yet remains challenged by tumor heterogeneity, morphological variability, and imaging artifacts. Methods: This paper presents a novel hybrid approach for improved brain [...] Read more.
Background: The accurate classification of brain tumors from magnetic resonance imaging (MRI) scans is pivotal for timely clinical intervention, yet remains challenged by tumor heterogeneity, morphological variability, and imaging artifacts. Methods: This paper presents a novel hybrid approach for improved brain tumor classification and proposes a novel hybrid deep learning framework that amalgamates the hierarchical feature extraction capabilities of VGG-16, a convolutional neural network (CNN), with the global contextual modeling of FTVT-b16, a fine-tuned vision transformer (ViT), to advance the precision of brain tumor classification. To evaluate the recommended method’s efficacy, two widely known MRI datasets were utilized in the experiments. The first dataset consisted of 7.023 MRI scans categorized into four classes gliomas, meningiomas, pituitary tumors, and no tumor. The second dataset was obtained from Kaggle, which consisted of 3000 scans categorized into two classes, consisting of healthy brains and brain tumors. Results: The proposed framework addresses critical limitations of conventional CNNs (local receptive fields) and pure ViTs (data inefficiency), offering a robust, interpretable solution aligned with clinical workflows. These findings underscore the transformative potential of hybrid architectures in neuro-oncology, paving the way for AI-assisted precision diagnostics. The proposed framework was run on these two different datasets and demonstrated outstanding performance, with accuracy of 99.46% and 99.90%, respectively. Conclusions: Future work will focus on multi-institutional validation and computational optimization to ensure scalability in diverse clinical settings. Full article
Show Figures

Figure 1

Back to TopTop