MDPI - Publisher of Open Access Journals

20 pages, 3303 KB

Open AccessArticle

Multi-Granularity Mask-Guided Network: An Integrated AI Framework for Region-Level Segmentation and Grading of Cataract Subtypes on AS-OCT Images

by Yiwen Hu, Bingyan Hao, Yilin Sun, Yitian Zhao, Yuanyuan Gu and Fang Liu

J. Clin. Med. 2026, 15(7), 2798; https://doi.org/10.3390/jcm15072798 - 7 Apr 2026

Abstract

Objective: To develop and validate an artificial intelligence (AI) system for automated lens opacities classification system III (LOCS III)-based grading of all three major cataract subtypes using anterior segment optical coherence tomography (AS-OCT). Methods: This is a single-center cross-sectional study. AS-OCT [...] Read more.

Objective: To develop and validate an artificial intelligence (AI) system for automated lens opacities classification system III (LOCS III)-based grading of all three major cataract subtypes using anterior segment optical coherence tomography (AS-OCT). Methods: This is a single-center cross-sectional study. AS-OCT images were collected and manually graded by ophthalmologists according to LOCS III. The dataset was randomly split into training, validation, and test sets. We propose a novel multi-granularity mask-guided network (MMNet) that jointly performs lens substructure segmentation and severity grading. The model’s performance was assessed on an independent test set for automatic grading of cortical cataract (CC), nuclear cataract (NC), and posterior subcapsular cataract (PSC) and the grading performance of the proposed method against ophthalmologists was also evaluated. The model’s interpretability was assessed via attention heatmaps and feature visualization. Results: The proposed MMNet exhibited high agreement with ground truth conducted through gold standard. The proportions of predictions with an absolute error < 1.0 for three subtypes range from 83.02% to 89.94%. The model’s grading accuracy for cataract subtypes was between 82.20 ± 1.41% and 89.76 ± 1.31% among the three subtypes, the Area Under the Curve (AUC) was between 0.954 (95% CI, 0.952–0.969; p < 0.001) and 0.973 (95% CI, 0.964–0.985; p < 0.001). The MMNet shows a satisfactory mean absolute error (MAE) of 0.14 ± 0.35 in CC, 0.10 ± 0.30 in NC, and 0.17 ± 0.38 in PSC grading. It also achieved a fast grading speed of 0.0178 s/image against manual grading. Conclusions: The proposed AI model presented advanced performance on AS-OCT images in automated LOCS III-based cataract grading for CC and NC, and also showed feasibility in PSC assessment. Full article

(This article belongs to the Special Issue Artificial Intelligence and Eye Disease)

► Show Figures

Figure 1

14 pages, 5017 KB

Open AccessArticle

Calibrated Feature Fusion: Enhancing Few-Shot Industrial Anomaly Detection via Cross-Stage Representation Alignment

by Shuangjun Zheng, Songtao Zhang, Zhihuan Huang, Kuoteng Sun, Yuzhong Gong, Jiayan Wen and Eryun Liu

Sensors 2026, 26(7), 2164; https://doi.org/10.3390/s26072164 - 31 Mar 2026

Viewed by 331

Abstract

Few-shot industrial anomaly detection technology has received more and more attention because it does not require a large number of abnormal samples to train. Recent few-shot industrial anomaly detection methods commonly fuse multi-stage features from frozen vision transformers for anomaly scoring. However, we [...] Read more.

Few-shot industrial anomaly detection technology has received more and more attention because it does not require a large number of abnormal samples to train. Recent few-shot industrial anomaly detection methods commonly fuse multi-stage features from frozen vision transformers for anomaly scoring. However, we find that such direct fusion suffers from cross-stage representation misalignment—shallow and deep features differ significantly in scale and semantic granularity, leading to inconsistent anomaly maps and degraded localization. To address this problem, we propose Calibrated Feature Fusion (CFF), a lightweight adapter that enhances feature fusion via cross-stage representation alignment. The CFF module can be integrated into existing state-of-the-art frameworks and operates effectively in few-shot settings. Experiments on MVTec AD and VisA show that CFF consistently improves the state-of-the-art method across 1/2/4-shot settings, achieving gains of up to +1.6% AUROC and +4.1% AP in pixel-level segmentation. Notably, CFF enhances both precision and recall in four-shot scenarios. Ablation studies confirm that cross-stage alignment is key to stable multi-stage fusion. Full article

(This article belongs to the Section Fault Diagnosis & Sensors)

► Show Figures

Figure 1

23 pages, 1395 KB

Open AccessArticle

A Mask-Guided Multigranular Mamba Network for Remote Sensing Change Captioning

by Yifan Qu and Huaidong Zhang

Remote Sens. 2026, 18(7), 1048; https://doi.org/10.3390/rs18071048 - 31 Mar 2026

Viewed by 286

Abstract

Remote sensing image change captioning (RSICC) aims to generate semantic textual descriptions characterizing changes between bi-temporal remote sensing images, with wide applications in disaster assessment and urban planning. However, existing methods face specific drawbacks: CNN-based models have limited ability to capture long-range spatial [...] Read more.

Remote sensing image change captioning (RSICC) aims to generate semantic textual descriptions characterizing changes between bi-temporal remote sensing images, with wide applications in disaster assessment and urban planning. However, existing methods face specific drawbacks: CNN-based models have limited ability to capture long-range spatial correlations due to local receptive fields, and Transformer-based models suffer from quadratic complexity while distributing attention uniformly across all spatial positions, resulting in weak perception of salient changes in background-dominated scenes. In this paper, we present PM3Net (Progressive Mask-guided Multigranular Mamba Network), which leverages Mamba state space models with linear complexity for efficient spatiotemporal change modeling. The Progressive Mask-guided Encoder (PME) creates dual-source change masks combining L2 norm spatial differences with cosine distance semantic differences for progressive change feature extraction from detailed structures to high-level semantics. The Mask-guided Feature Enhancement (MFE) module applies mask-weighted refinement and cross-layer fusion to emphasize salient change regions while suppressing background interference, producing multigranular visual representations. Experiments on LEVIR-MCI and WHU-CDC datasets show PM3Net achieves superior results compared to existing methods, with BLEU-4 scores of 66.89 and 73.05, respectively. The results confirm PM3Net’s ability to solve the RSICC task while demonstrating how Mamba models can succeed in this specific field. Full article

(This article belongs to the Special Issue Machine Learning and Deep Learning Applied to Remote Sensing Image Analysis)

► Show Figures

Figure 1

24 pages, 4909 KB

Open AccessArticle

UniTriM: Unified Text–Image–Video Retrieval via Multi-Granular Alignment and Feature Disentanglement

by Yangchen Wang, Yan Hua, Yingyun Yang and Wenhui Zhang

Electronics 2026, 15(7), 1424; https://doi.org/10.3390/electronics15071424 - 30 Mar 2026

Viewed by 244

Abstract

With the proliferation of multimodal content on social media, creators increasingly require tools that can retrieve both images and videos relevant to a single textual query. However, existing cross-modal retrieval methods are typically confined to binary (text–image or text–video) settings and struggle with [...] Read more.

With the proliferation of multimodal content on social media, creators increasingly require tools that can retrieve both images and videos relevant to a single textual query. However, existing cross-modal retrieval methods are typically confined to binary (text–image or text–video) settings and struggle with fine-grained semantic alignment and spatiotemporal information imbalance. To address this issue, we propose UniTriM, a unified framework for text–image–video joint retrieval. First, UniTriM supports concurrent retrieval of semantically relevant images and videos given one textual input. To overcome the scarcity of text–image–video triplet data, we introduce a self-attention-based keyframe selection strategy that converts existing text–video datasets into triplet format. Second, we design a multi-granularity similarity alignment module that captures hierarchical semantics by modeling patch–frame–video and word–triple–sentence structures and jointly optimizes intra- and cross-granularity alignments to enhance fine-grained cross-modal correspondence. Third, to alleviate the inherent spatiotemporal information imbalance between static images and video-aligned text descriptions, we introduce a feature disentanglement module that disentangles spatial-related features from text and aligns them explicitly with image representations. Experiments conducted on three benchmark datasets MSR-VTT, MSVD, and DiDeMo demonstrate that UniTriM achieves state-of-the-art performance on joint retrieval tasks. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

21 pages, 4785 KB

Open AccessArticle

Fault Diagnosis of Wind Turbine Bearings Based on a Multi-Scale Residual Attention Graph Neural Network

by Yubo Liu, Xiaohui Zhang, Keliang Dong, Zhilei Xu, Fengjuan Zhang and Zhiwei Li

Electronics 2026, 15(7), 1422; https://doi.org/10.3390/electronics15071422 - 29 Mar 2026

Viewed by 224

Abstract

Fault diagnosis of rolling bearings in wind turbines is significantly challenged by strong noise, non-stationary signals, and multi-source interference. To address these issues, a Multi-Scale Attention Residual Graph Convolutional Network (MSAR-GCN) is proposed. First, a fully connected graph is constructed in the frequency [...] Read more.

Fault diagnosis of rolling bearings in wind turbines is significantly challenged by strong noise, non-stationary signals, and multi-source interference. To address these issues, a Multi-Scale Attention Residual Graph Convolutional Network (MSAR-GCN) is proposed. First, a fully connected graph is constructed in the frequency domain using a temporal segmentation strategy, which preserves full spectral resolution and captures cross-frequency coupling features via node embeddings. Second, a multi-scale residual module with a cross-layer pyramid structure is designed to extract features at varying granularities, integrated with a dynamic multi-head attention mechanism to adaptively emphasize damage-sensitive frequency bands. Additionally, a hierarchical feature distillation mechanism is employed to compress high-dimensional features, ensuring model lightweighting while retaining critical fault information. Experimental validations on CWRU and JNU datasets demonstrate that MSAR-GCN achieves 97.02% and 92.5% accuracy under −10 dB Gaussian noise, respectively, outperforming existing methods by over 4%. Specifically, the model exhibits exceptional robustness, maintaining 93.09% accuracy under severe non-Gaussian impulsive noise. With verified feature separability and high computational efficiency, the proposed method offers a promising solution for high-precision, real-time industrial fault diagnosis. Full article

(This article belongs to the Special Issue Advances in Condition Monitoring and Fault Diagnosis)

► Show Figures

Figure 1

31 pages, 5285 KB

Open AccessArticle

Research on Multi-Task Spatio-Temporal Learning Model with Dynamic Graph Attention for Joint Pedestrian Trajectory and Intention Prediction

by Guanchen Zhou, Yongqian Zhao and Zhaoyong Gu

Appl. Sci. 2026, 16(6), 2881; https://doi.org/10.3390/app16062881 - 17 Mar 2026

Viewed by 225

Abstract

Accurate pedestrian trajectory prediction and intention estimation are crucial for autonomous systems and intelligent transportation applications. However, existing methods often address these two highly correlated tasks in isolation and rely on static or heuristic interaction modeling, leading to insufficient adaptability and limited generalization [...] Read more.

Accurate pedestrian trajectory prediction and intention estimation are crucial for autonomous systems and intelligent transportation applications. However, existing methods often address these two highly correlated tasks in isolation and rely on static or heuristic interaction modeling, leading to insufficient adaptability and limited generalization capability in dynamic traffic scenarios. To this end, this paper proposes MTG-TPNet, a Multi-task dynamic Graph Transformer network for joint Trajectory Prediction and intention estimation. The research framework integrates three key innovations: First, a dynamic graph neural network enhanced with motion features, whose graph topology can be adaptively learned end-to-end based on semantic and motion contexts to accurately capture evolving interactions. Second, a multi-granularity attention mechanism that collaboratively fuses geometric proximity, semantic similarity, and physical hard constraints to achieve fine-grained modeling of spatiotemporal dependencies. Third, a dynamic correlation loss based on Bayesian uncertainty, which balances multi-task learning in an adaptive manner and encourages beneficial interactions across tasks. Extensive experiments on the publicly available PIE and ETH/UCY datasets demonstrate that MTG-TPNet achieves state-of-the-art performance. On the PIE dataset, the proposed model significantly outperforms the best baseline model in trajectory prediction metrics, achieving an Average Displacement Error (ADE) of 0.21 and a Final Displacement Error (FDE) of 0.29. This represents a 27.6% reduction in ADE while maintaining stability in intention estimation. Systematic ablation studies validate the effectiveness of each proposed module, with the model retaining an average performance of 69.3%. Furthermore, cross-dataset evaluations confirm its superior generalization capability. This study provides a powerful unified framework for robust pedestrian behavior understanding in complex urban traffic scenarios. Full article

► Show Figures

Figure 1

28 pages, 5658 KB

Open AccessArticle

A Multimodule Collaborative Framework for Unsupervised Visible–Infrared Person Re-Identification with Channel Enhancement Modality

by Baoshan Sun, Yi Du and Liqing Gao

Sensors 2026, 26(6), 1770; https://doi.org/10.3390/s26061770 - 11 Mar 2026

Viewed by 320

Abstract

Unsupervised visible–infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used [...] Read more.

Unsupervised visible–infrared person re-identification (USL-VI-ReID) plays a pivotal role in cross-modal computer vision applications for intelligent surveillance and public safety. However, the task remains hampered by large modality gaps and limited granularity in feature representations. In particular, channel augmentation (CA) is typically used only for data augmentation, and its potential as an independent input modality remains unexplored. To address these shortcomings, we present a multimodule collaborative USL-VI-ReID framework that explicitly treats CA as a separate input modality. The framework combines four complementary modules. The Person-ReID Adaptive Convolutional Block Attention Module (PA-CBAM) module extracts discriminative features using a two-level attention mechanism that refines salient spatial and channel cues. The Varied Regional Alignment (VRA) module performs cross-modal regional alignment and leverages the Multimodal Assisted Adversarial Learning (MAAL) to reinforce region-level correspondence. The Varied Regional Neighbor Learning (VRNL) implements reliable neighborhood learning via multi-region association to stabilize pseudo-labels and capture local structure. Finally, the Uniform Merging (UM) module merges split clusters through alternating contrastive learning to improve cluster consistency. We evaluate the proposed method on SYSU-MM01 and RegDB. On RegDB’s visible-to-infrared setting, the approach achieves Rank-1 = 93.34%, mean Average Precision (mAP) = 87.55%, and mean Inverse Negative Penalty (mINP) = 76.08%. These results indicate that our method effectively reduces modal discrepancies and increases feature discriminability. It outperforms most existing unsupervised baselines and several supervised approaches, thereby advancing the practical applicability of USL-VI-ReID. Full article

(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)

► Show Figures

Figure 1

27 pages, 9169 KB

Open AccessArticle

S²D-Net: A Synergistic Star-Attentive Network with Dynamic Feature Refinement for Robust Inshore SAR Ship Detection

by Shentao Wang, Byung-Won Min, Guoru Li, Depeng Gao, Jianlin Qiu and Yue Hong

Electronics 2026, 15(6), 1160; https://doi.org/10.3390/electronics15061160 - 11 Mar 2026

Viewed by 305

Abstract

Detecting ships using Synthetic Aperture Radar (SAR) in coastal areas is still difficult due to the impact of coherent speckle noise from the ocean surface, complex land clutter and having multi-scale target representations in the radar imagery. Most of the existing ship detection [...] Read more.

Detecting ships using Synthetic Aperture Radar (SAR) in coastal areas is still difficult due to the impact of coherent speckle noise from the ocean surface, complex land clutter and having multi-scale target representations in the radar imagery. Most of the existing ship detection algorithms lose important target features during downsampling and have difficulty recovering those features through upsampling, resulting in a high number of false detections and missed detections. In this work, we present a new ship detection algorithm called Synergistic Star-Attentive Network with Dynamic Feature Refinement (S²D-Net). First, we create a new backbone called Multi-scale PCCA-StarNet to generate robust feature representations. Within the backbone we implement a Progressive Channel-Coordinate Attention (PCCA) mechanism to create a synergy between global channel filtering and adaptive coordinate locking to decouple ship textures from granular speckle noise. Second, we create a Dynamic Feature Refinement Neck. We develop a content-aware dynamic upsampler called DySample to replace conventional interpolation to improve fidelity of the upsampled feature of small targets. Further, we design a Star-PCCA Feature Aggregation module which fuses features together. Using star-operations and the PCCA mechanism, this module refines semantic features and removes background clutter while aggregating features across multiple scales. Third, we develop a Lightweight Shared Convolutional Detection Head with Quality Estimation (LSCD-LQE). The LSCD-LQE decreases parameter redundancy by using shared convolutional layers and adds a localization quality estimation branch. Therefore, the LSCD-LQE effectively reduces false positive detections through alignment of classification scores with localization quality based on Intersection over Union (IoU) in difficult coastal environments. Our experimental results, using the SSDD and HRSID datasets, show that S²D-Net produces results comparable to representative ship detection algorithms. In particular, on the challenging HRSID inshore subset, our proposed method achieved a mean average precision (mAP) of 82.7%, which is 6.9% greater than the YOLOv11n baseline ship detection algorithm. These results demonstrate that S²D-Net is superior at detecting small coastal vessels and mitigating the detrimental effects of the nearshore complex environment on the performance ship detection using SAR. Full article

(This article belongs to the Special Issue AI-Powered Visual Intelligence: Tasks, Methods, and Real-World Applications)

► Show Figures

Figure 1

23 pages, 12547 KB

Open AccessArticle

Data-Efficient Insulator Defect Detection in Power Transmission Systems via Multi-Granularity Feature Learning and Latent Context-Aware Fusion

by Xingxing Fan, Manxiang Gao, Yong Wang, Haining Tang, Fengyong Sun and Changpo Song

Electronics 2026, 15(5), 1081; https://doi.org/10.3390/electronics15051081 - 5 Mar 2026

Viewed by 362

Abstract

Real-world power transmission inspection faces acute data scarcity and severe class imbalance, as defective insulator instances are exceptionally rare compared to normal samples. To enable robust defect detection under such constraints, we present MS-LaT—a backbone networkthat fuses multi-granularity feature learning with latent context-aware [...] Read more.

Real-world power transmission inspection faces acute data scarcity and severe class imbalance, as defective insulator instances are exceptionally rare compared to normal samples. To enable robust defect detection under such constraints, we present MS-LaT—a backbone networkthat fuses multi-granularity feature learning with latent context-aware fusion. The architecture processes visual inputs through a streamlined pipeline: an input stage employing AdaptTeLU-augmented inverted multi-scale separable-residual convolutions to discern subtle local anomalies; a contextual reasoning stage powered by a Latent Transformer encoder with Multi-Head Latent Attention (MLA) for holistic scene understanding; and an output stage utilizing AdaptTeLU-refined inverted multi-scale convolutions to produce precise diagnostic decisions. Domain-adaptive batch normalization (AdaBN) is embedded to minimize cross-domain feature divergence, substantially boosting generalization across diverse operational environments. Research utilising real-world engineering datasets demonstrates the proposed method’s robust insulator defect detection capability in complex environments. Full article

(This article belongs to the Special Issue Advances in Fault Diagnosis Methods of Power Systems and Key Components)

► Show Figures

Figure 1

31 pages, 1466 KB

Open AccessArticle

Fusing Geometric and Semantic Features via Cosine Similarity Cross-Attention for Remote Sensing Scene Classification

by Xuefei Xu and Chengjun Xu

Sensors 2026, 26(5), 1613; https://doi.org/10.3390/s26051613 - 4 Mar 2026

Viewed by 321

Abstract

High-resolution remote sensing image scene classification (HRRSI-SC) is crucial for obtaining accurate Earth surface information. However, the task remains challenging due to significant background interference, high intra-class variation, and subtle inter-class similarities. Convolutional neural networks (CNNs) are constrained by their local receptive fields, [...] Read more.

High-resolution remote sensing image scene classification (HRRSI-SC) is crucial for obtaining accurate Earth surface information. However, the task remains challenging due to significant background interference, high intra-class variation, and subtle inter-class similarities. Convolutional neural networks (CNNs) are constrained by their local receptive fields, which limits their ability to capture long-range spatial dependencies. On the other hand, Vision Transformers (e.g., ViT-B-16) excel at global feature extraction but often suffer from high computational complexity and may lack the inherent inductive biases for local feature modeling that CNNs possess. To address these limitations, this paper proposes a cross-level feature complementary classification framework based on Lie Group manifold space, termed CBCAM-LGM. Within the proposed CBCAM-LGM framework, multi-granularity features are first distilled via a global average pooling layer to suppress redundant information. The core of our approach, the cross-level bidirectional complementary attention module (CBCAM), then enables the adaptive fusion of features from both branches through a cross-query attention mechanism. Furthermore, by employing parallel dilated convolutions and a parameter-sharing strategy, the model captures multi-scale contextual information by sharing a single set of convolutional weights, which reduces the computational complexity to merely 1.21 GMACs while preserving multi-scale representation with minimal parameter overhead. Extensive experiments on challenging benchmarks demonstrate the model’s efficacy, as it achieves a state-of-the-art classification accuracy of 97.81% on the AID, surpassing the ViT-B-16 baseline by 1.63%, while containing only 11.237 million parameters (an 87% reduction). These results collectively affirm that our model presents an efficient solution characterized by high accuracy and low complexity. Full article

(This article belongs to the Section Remote Sensors)

► Show Figures

Figure 1

21 pages, 14880 KB

Open AccessArticle

Beyond the Black Box: Interpretable Multi-Trait Essay Scoring with Trait-Aware Transformer

by Xiaoyi Tang

Electronics 2026, 15(5), 1066; https://doi.org/10.3390/electronics15051066 - 4 Mar 2026

Viewed by 341

Abstract

The rapid advancement of automated essay scoring (AES) has been constrained by a representation bottleneck, where monolithic models collapse diverse facets of writing constructs into a single, uninterpretable signal, undermining the pedagogical value of multi-dimensional rating traits. To address this limitation, the RoBERTa-based [...] Read more.

The rapid advancement of automated essay scoring (AES) has been constrained by a representation bottleneck, where monolithic models collapse diverse facets of writing constructs into a single, uninterpretable signal, undermining the pedagogical value of multi-dimensional rating traits. To address this limitation, the RoBERTa-based Trait-Aware Transformer (RoBERTa-TAT) is introduced. This architectural reframing replaces unified pooling with parallel, trait-specific attention streams, preserving and disentangling critical features such as conceptual depth and mechanical precision. Tested on the ASAP Dataset-7, RoBERTa-TAT attains a new state-of-the-art Quadratic Weighted Kappa (QWK) of 0.936, outperforming sequential baselines and conventional Transformer variants. Beyond gains in accuracy, this trait-specialized architecture recasts scoring from a black-box prediction into a transparent diagnostic tool, enabling actionable, fine-grained feedback at different rating traits. High-resolution inspection reveals that the model’s internal representations correlate with specific linguistic markers—such as discourse connectives for organization—suggesting a degree of structural alignment with expert judgment. By aligning high-capacity representation learning with the granular demands of formative assessment, RoBERTa-TAT provides a practical, interpretable blueprint for deploying accountable AI in education and broadening access to expert diagnostic insight. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

27 pages, 2099 KB

Open AccessArticle

Brain Tumor Classification Using DINO Features and Lightweight Classifiers

by Rim Missaoui, Marco Del Coco, Wajdi Saadaoui, Wided Hechkel, Abdelhamid Helali, Pierluigi Carcagnì and Marco Leo

Electronics 2026, 15(5), 952; https://doi.org/10.3390/electronics15050952 - 26 Feb 2026

Viewed by 520

Abstract

The accurate detection and classification of brain tumors from magnetic resonance imaging (MRI) are critical for diagnosis and treatment planning. While deep learning has shown remarkable success in this domain, many state-of-the-art models rely on complex, end-to-end convolutional neural networks (CNNs) that require [...] Read more.

The accurate detection and classification of brain tumors from magnetic resonance imaging (MRI) are critical for diagnosis and treatment planning. While deep learning has shown remarkable success in this domain, many state-of-the-art models rely on complex, end-to-end convolutional neural networks (CNNs) that require extensive computational resources and large, annotated datasets for training. This work proposes a novel and efficient methodology that, for the first time, leverages self-supervised DINO vision transformer backbones (DINO v1, DINOv2, and DINOv3) on a large corpus of natural images as powerful feature extractors for brain tumor analysis. We utilize the rich, general-purpose features from DINO-family backbones without fine-tuning the core model. These extracted features are then fed into a simpler, task-specific classifier (such as a support vector machine or a multi-layer perceptron) for the final detection and multi-class classification (e.g., glioma, meningioma, and pituitary tumor). Our methodology is evaluated on two benchmark medical imaging datasets with various classifying granularities. The results demonstrate that the proposed method achieves competitive and, in some cases, superior classification accuracy compared to representative fine-tuned convolutional neural networks and attention-based architectures, while significantly reducing the number of trainable parameters and training time. In particular, the best configuration achieves up to 98.17% accuracy and an F1-score of 98.18% on the 15-class dataset and 99.08% accuracy and an F1-score of 99.02% on the 4-class dataset. This study confirms the exceptional transfer learning capabilities of self-supervised vision transformers like DINO in the medical imaging domain, establishing it as a highly effective and efficient backbone for robust brain tumor detection and classification systems. Full article

(This article belongs to the Special Issue Assistive Technology: Advances, Applications and Challenges)

► Show Figures

Figure 1

40 pages, 12177 KB

Open AccessArticle

Dynamic Multi-Relation Learning with Multi-Scale Hypergraph Transformer for Multi-Modal Traffic Forecasting

by Juan Chen and Meiqing Shan

Future Transp. 2026, 6(1), 51; https://doi.org/10.3390/futuretransp6010051 - 22 Feb 2026

Viewed by 393

Abstract

Accurate multi-modal traffic demand forecasting is key to optimizing intelligent transportation systems (ITSs). To overcome the shortcomings of existing methods in capturing dynamic high-order correlations between heterogeneous spatial units and decoupling intra- and inter-mode dependencies at multiple time scales, this paper proposes a [...] Read more.

Accurate multi-modal traffic demand forecasting is key to optimizing intelligent transportation systems (ITSs). To overcome the shortcomings of existing methods in capturing dynamic high-order correlations between heterogeneous spatial units and decoupling intra- and inter-mode dependencies at multiple time scales, this paper proposes a Dynamic Multi-Relation Learning with Multi-Scale Hypergraph Transformer method (MST-Hype Trans). The model integrates three novel modules. Firstly, the Multi-Scale Temporal Hypergraph Convolutional Network (MSTHCN) achieves collaborative decoupling and captures periodic and cross-modal temporal interactions of transportation demand at multiple granularities, such as time, day, and week, by constructing a multi-scale temporal hypergraph. Secondly, the Dynamic Multi-Relationship Spatial Hypergraph Network (DMRSHN) innovatively integrates geographic proximity, passenger flow similarity, and transportation connectivity to construct structural hyperedges and combines KNN and K-means algorithms to generate dynamic hyperedges, thereby accurately modeling the high-order spatial correlations of dynamic evolution between heterogeneous nodes. Finally, the Conditional Meta Attention Gated Fusion Network (CMAGFN), as a lightweight meta network, introduces a gate control mechanism based on multi-head cross-attention. It can dynamically generate node features based on real-time traffic context and adaptively calibrate the fusion weights of multi-source information, achieving optimal prediction decisions for scene perception. Experiments on three real-world datasets (NYC-Taxi, -Bike, and -Subway) demonstrate that MST-Hyper Trans achieves an average reduction of 7.6% in RMSE and 9.2% in MAE across all modes compared to the strongest baseline, while maintaining interpretability of spatiotemporal interactions. This study not only provides good model interpretability but also offers a reliable solution for multi-modal traffic collaborative management. Full article

(This article belongs to the Special Issue Recent Advances in Artificial Intelligence and Big Data for Intelligent Transportation Systems)

► Show Figures

Figure 1

25 pages, 3298 KB

Open AccessArticle

FDE-YOLO: An Improved Algorithm for Small Target Detection in UAV Images

by Jialiang Li, Xu Guo, Xu Zhao and Jie Jin

Mathematics 2026, 14(4), 663; https://doi.org/10.3390/math14040663 - 13 Feb 2026

Viewed by 610

Abstract

Accurate small object detection in unmanned aerial vehicle (UAV) imagery is fundamental to numerous safety-critical applications, including intelligent transportation, urban surveillance, and disaster assessment. However, extreme scale compression, dense object distributions, and complex backgrounds severely constrain the feature representation capability of existing detectors, [...] Read more.

Accurate small object detection in unmanned aerial vehicle (UAV) imagery is fundamental to numerous safety-critical applications, including intelligent transportation, urban surveillance, and disaster assessment. However, extreme scale compression, dense object distributions, and complex backgrounds severely constrain the feature representation capability of existing detectors, leading to degraded reliability in real-world deployments. To overcome these limitations, we propose FDE-YOLO, a lightweight yet high-performance detection framework built upon YOLOv11 with three complementary architectural innovations. The Fine-Grained Detection Pyramid (FGDP) integrates space-to-depth convolution with a CSP-MFE module that fuses multi-granularity features through parallel local, context, and global branches, capturing comprehensive small target information while avoiding computational overhead from layer stacking. The Dynamic Detection Fusion Head (DDFHead) unifies scale-aware, spatial-aware, and task-aware attention mechanisms via sequential refinement with DCNv4 and FReLU activation, adaptively enhancing discriminative capability for densely clustered targets in complex scenes. The EdgeSpaceNet module explicitly fuses Sobel-extracted boundary features with spatial convolution outputs through residual connections, recovering edge details typically lost in standard operations while reducing parameter count via depthwise separable convolutions. Extensive experiments on the VisDrone2019 dataset demonstrate that FDE-YOLO achieves 53.6% precision, 42.5% recall, 43.3% mAP50, and 26.3% mAP50:95, surpassing YOLOv11s by 2.8%, 4.4%, 4.1%, and 2.8% respectively, with only 10.25 M parameters. The proposed approach outperforms UAV-specialized methods including Drone-YOLO and MASF-YOLO while using significantly fewer parameters (37.5% and 29.8% reductions respectively), demonstrating superior efficiency. Cross-dataset evaluations on UAV-DT and NWPU VHR-10 further confirm strong generalization capability with 1.6% and 1.5% mAP50 improvements respectively, validating FDE-YOLO as an effective and efficient solution for reliable UAV-based small object detection in real-world scenarios. Full article

(This article belongs to the Special Issue New Advances in Image Processing and Computer Vision)

► Show Figures

Figure 1

24 pages, 30825 KB

Open AccessArticle

MA-Net: Multi-Granularity Attention Network for Fine-Grained Classification of Ship Targets in Remote Sensing Images

by Jiamin Qi, Peifeng Li, Guangyao Zhou, Ben Niu, Feng Wang, Qiantong Wang, Yuxin Hu and Xiantai Xiang

Remote Sens. 2026, 18(3), 462; https://doi.org/10.3390/rs18030462 - 1 Feb 2026

Viewed by 537

Abstract

The classification of ship targets in remote sensing images holds significant application value in fields such as marine monitoring and national defence. Although existing research has yielded considerable achievements in ship classification, current methods struggle to distinguish highly similar ship categories for fine-grained [...] Read more.

The classification of ship targets in remote sensing images holds significant application value in fields such as marine monitoring and national defence. Although existing research has yielded considerable achievements in ship classification, current methods struggle to distinguish highly similar ship categories for fine-grained classification tasks due to a lack of targeted design. Specifically, they exhibit the following shortcomings: limited ability to extract locally discriminative features; inadequate fusion of features at high and low levels of representation granularity; and sensitivity of model performance to background noise. To address this issue, this paper proposes a fine-grained classification framework for ship targets in remote sensing images based on Multi-Granularity Attention Network (MA-Net), specifically designed to tackle the aforementioned three major challenges encountered in fine-grained classification tasks for ship targets in remote sensing. This framework first performs multi-level feature extraction through a backbone network, subsequently introducing an Adaptive Local Feature Attention (ALFA) module. This module employs dynamic overlapping region segmentation techniques to assist the network in learning spatial structural combinations, thereby optimising the representation of local features. Secondly, a Dynamic Multi-Granularity Feature Fusion (DMGFF) module is designed to dynamically fuse feature maps of varying representational granularities and select key attribute features. Finally, a Feature-Based Data Augmentation (FBDA) method is developed to effectively highlight target detail features, thereby enhancing feature expression capabilities. On the public FGSC-23 and FGSCR-42 datasets, MA-Net attains top-performing accuracies of 93.12% and 98.40%, surpassing the previous best methods and establishing a new state of the art for fine-grained classification of ship targets in remote sensing images. Full article

(This article belongs to the Special Issue Deep Learning-Based Interpretation and Processing of Remote Sensing Images)

► Show Figures

Figure 1

Search Results (136)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (136)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI