Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (47)

Search Parameters:
Keywords = multiscale token

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
28 pages, 32815 KB  
Article
LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery
by Boya Wang, Shuo Wang, Yibin Han, Linfeng Xu and Dong Ye
Remote Sens. 2025, 17(19), 3349; https://doi.org/10.3390/rs17193349 - 1 Oct 2025
Viewed by 182
Abstract
We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV [...] Read more.
We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV applications. LiteSAM integrates three key components to address these issues. First, efficient multi-scale feature extraction optimizes representation, reducing inference latency for edge devices. Second, a Token Aggregation–Interaction Transformer (TAIFormer) with a convolutional token mixer (CTM) models inter- and intra-image correlations, enabling robust global–local feature fusion. Third, a MinGRU-based dynamic subpixel refinement module adaptively learns spatial offsets, enhancing subpixel-level matching accuracy and cross-scenario generalization. The experiments show that LiteSAM achieves competitive performance across multiple datasets. On UAV-VisLoc, LiteSAM attains an RMSE@30 of 17.86 m, outperforming state-of-the-art semi-dense methods such as EfficientLoFTR. Its optimized variant, LiteSAM (opt., without dual softmax), delivers inference times of 61.98 ms on standard GPUs and 497.49 ms on NVIDIA Jetson AGX Orin, which are 22.9% and 19.8% faster than EfficientLoFTR (opt.), respectively. With 6.31M parameters, which is 2.4× fewer than EfficientLoFTR’s 15.05M, LiteSAM proves to be suitable for edge deployment. Extensive evaluations on natural image matching and downstream vision tasks confirm its superior accuracy and efficiency for general feature matching. Full article
Show Figures

Figure 1

32 pages, 9638 KB  
Article
MSSA: A Multi-Scale Semantic-Aware Method for Remote Sensing Image–Text Retrieval
by Yun Liao, Zongxiao Hu, Fangwei Jin, Junhui Liu, Nan Chen, Jiayi Lv and Qing Duan
Remote Sens. 2025, 17(19), 3341; https://doi.org/10.3390/rs17193341 - 30 Sep 2025
Viewed by 310
Abstract
In recent years, the convenience and potential for information extraction offered by Remote Sensing Image–Text Retrieval (RSITR) have made it a significant focus of research in remote sensing (RS) knowledge services. Current mainstream methods for RSITR generally align fused image features at multiple [...] Read more.
In recent years, the convenience and potential for information extraction offered by Remote Sensing Image–Text Retrieval (RSITR) have made it a significant focus of research in remote sensing (RS) knowledge services. Current mainstream methods for RSITR generally align fused image features at multiple scales with textual features, primarily focusing on the local information of RS images while neglecting potential semantic information. This results in insufficient alignment in the cross-modal semantic space. To overcome this limitation, we propose a Multi-Scale Semantic-Aware Remote Sensing Image–Text Retrieval method (MSSA). This method introduces Progressive Spatial Channel Joint Attention (PSCJA), which enhances the expressive capability of multi-scale image features through Window-Region-Global Progressive Attention (WRGPA) and Segmented Channel Attention (SCA). Additionally, the Image-Guided Text Attention (IGTA) mechanism dynamically adjust textual attention weights based on visual context. Furthermore, the Cross-Modal Semantic Extraction Module (CMSE) incorporated learnable semantic tokens at each scale, enabling attention interaction between multi-scale features of different modalities and the capturing of hierarchical semantic associations. This multi-scale semantic-guided retrieval method ensures cross-modal semantic consistency, significantly improving the accuracy of cross-modal retrieval in RS. MSSA demonstrates superior retrieval accuracy in experiments across three baseline datasets, achieving a new state-of-the-art performance. Full article
(This article belongs to the Section Remote Sensing Image Processing)
Show Figures

Figure 1

21 pages, 37484 KB  
Article
Reconstructing Hyperspectral Images from RGB Images by Multi-Scale Spectral–Spatial Sequence Learning
by Wenjing Chen, Lang Liu and Rong Gao
Entropy 2025, 27(9), 959; https://doi.org/10.3390/e27090959 - 15 Sep 2025
Viewed by 646
Abstract
With rapid advancements in transformers, the reconstruction of hyperspectral images from RGB images, also known as spectral super-resolution (SSR), has made significant breakthroughs. However, existing transformer-based methods often struggle to balance computational efficiency with long-range receptive fields. Recently, Mamba has demonstrated linear complexity [...] Read more.
With rapid advancements in transformers, the reconstruction of hyperspectral images from RGB images, also known as spectral super-resolution (SSR), has made significant breakthroughs. However, existing transformer-based methods often struggle to balance computational efficiency with long-range receptive fields. Recently, Mamba has demonstrated linear complexity in modeling long-range dependencies and shown broad applicability in vision tasks. This paper proposes a multi-scale spectral–spatial sequence learning method, named MSS-Mamba, for reconstructing hyperspectral images from RGB images. First, we introduce a continuous spectral–spatial scan (CS3) mechanism to improve cross-dimensional feature extraction of the foundational Mamba model. Second, we propose a sequence tokenization strategy that generates multi-scale-aware sequences to overcome Mamba’s limitations in hierarchically learning multi-scale information. Specifically, we design the multi-scale information fusion (MIF) module, which tokenizes input sequences before feeding them into Mamba. The MIF employs a dual-branch architecture to process global and local information separately, dynamically fusing features through an adaptive router that generates weighting coefficients. This produces feature maps that contain both global contextual information and local details, ultimately reconstructing a high-fidelity hyperspectral image. Experimental results on the ARAD_1k, CAVE and grss_dfc_2018 dataset demonstrate the performance of MSS-Mamba. Full article
Show Figures

Figure 1

21 pages, 4900 KB  
Article
RingFormer-Seg: A Scalable and Context-Preserving Vision Transformer Framework for Semantic Segmentation of Ultra-High-Resolution Remote Sensing Imagery
by Zhan Zhang, Daoyu Shu, Guihe Gu, Wenkai Hu, Ru Wang, Xiaoling Chen and Bingnan Yang
Remote Sens. 2025, 17(17), 3064; https://doi.org/10.3390/rs17173064 - 3 Sep 2025
Viewed by 957
Abstract
Semantic segmentation of ultra-high-resolution remote sensing (UHR-RS) imagery plays a critical role in land use and land cover analysis, yet it remains computationally intensive due to the enormous input size and high spatial complexity. Existing studies have commonly employed strategies such as patch-wise [...] Read more.
Semantic segmentation of ultra-high-resolution remote sensing (UHR-RS) imagery plays a critical role in land use and land cover analysis, yet it remains computationally intensive due to the enormous input size and high spatial complexity. Existing studies have commonly employed strategies such as patch-wise processing, multi-scale model architectures, lightweight networks, and representation sparsification to reduce resource demands, but they have often struggled to maintain long-range contextual awareness and scalability for inputs of arbitrary size. To address this, we propose RingFormer-Seg, a scalable Vision Transformer framework that enables long-range context learning through multi-device parallelism in UHR-RS image segmentation. RingFormer-Seg decomposes the input into spatial subregions and processes them through a distributed three-stage pipeline. First, the Saliency-Aware Token Filter (STF) selects informative tokens to reduce redundancy. Next, the Efficient Local Context Module (ELCM) enhances intra-region features via memory-efficient attention. Finally, the Cross-Device Context Router (CDCR) exchanges token-level information across devices to capture global dependencies. Fine-grained detail is preserved through the residual integration of unselected tokens, and a hierarchical decoder generates high-resolution segmentation outputs. We conducted extensive experiments on three benchmarks covering UHR-RS images from 2048 × 2048 to 8192 × 8192 pixels. Results show that our framework achieves top segmentation accuracy while significantly improving computational efficiency across the DeepGlobe, Wuhan, and Guangdong datasets. RingFormer-Seg offers a versatile solution for UHR-RS image segmentation and demonstrates potential for practical deployment in nationwide land cover mapping, supporting informed decision-making in land resource management, environmental policy planning, and sustainable development. Full article
Show Figures

Figure 1

18 pages, 2565 KB  
Article
Rock Joint Segmentation in Drill Core Images via a Boundary-Aware Token-Mixing Network
by Seungjoo Lee, Yongjin Kim, Yongseong Kim, Jongseol Park and Bongjun Ji
Buildings 2025, 15(17), 3022; https://doi.org/10.3390/buildings15173022 - 25 Aug 2025
Viewed by 503
Abstract
The precise mapping of rock joint traces is fundamental to the design and safety assessment of foundations, retaining structures, and underground cavities in building and civil engineering. Existing deep learning approaches either impose prohibitive computational demands for on-site deployment or disrupt the topological [...] Read more.
The precise mapping of rock joint traces is fundamental to the design and safety assessment of foundations, retaining structures, and underground cavities in building and civil engineering. Existing deep learning approaches either impose prohibitive computational demands for on-site deployment or disrupt the topological continuity of subpixel lineaments that govern rock mass behavior. This study presents BATNet-Lite, a lightweight encoder–decoder architecture optimized for joint segmentation on resource-constrained devices. The encoder introduces a Boundary-Aware Token-Mixing (BATM) block that separates feature maps into patch tokens and directionally pooled stripe tokens, and a bidirectional attention mechanism subsequently transfers global context to local descriptors while refining stripe features, thereby capturing long-range connectivity with negligible overhead. A complementary Multi-Scale Line Enhancement (MLE) module combines depth-wise dilated and deformable convolutions to yield scale-invariant responses to joints of varying apertures. In the decoder, a Skeletal-Contrastive Decoder (SCD) employs dual heads to predict segmentation and skeleton maps simultaneously, while an InfoNCE-based contrastive loss enforces their topological consistency without requiring explicit skeleton labels. Training leverages a composite focal Tversky and edge IoU loss under a curriculum-thinning schedule, improving edge adherence and continuity. Ablation experiments confirm that BATM, MLE, and SCD each contribute substantial gains in boundary accuracy and connectivity preservation. By delivering topology-preserving joint maps with small parameters, BATNet-Lite facilitates rapid geological data acquisition for tunnel face mapping, slope inspection, and subsurface digital twin development, thereby supporting safer and more efficient building and underground engineering practice. Full article
Show Figures

Figure 1

22 pages, 7620 KB  
Article
DSTANet: A Lightweight and High-Precision Network for Fine-Grained and Early Identification of Maize Leaf Diseases in Field Environments
by Xinyue Gao, Lili He, Yinchuan Liu, Jiaxin Wu, Yuying Cao, Shoutian Dong and Yinjiang Jia
Sensors 2025, 25(16), 4954; https://doi.org/10.3390/s25164954 - 10 Aug 2025
Viewed by 630
Abstract
Early and accurate identification of maize diseases is crucial for ensuring sustainable agricultural development. However, existing maize disease identification models face challenges including high inter-class similarity, intra-class variability, and limited capability in identifying early-stage symptoms. To address these limitations, we proposed DSTANet (decomposed [...] Read more.
Early and accurate identification of maize diseases is crucial for ensuring sustainable agricultural development. However, existing maize disease identification models face challenges including high inter-class similarity, intra-class variability, and limited capability in identifying early-stage symptoms. To address these limitations, we proposed DSTANet (decomposed spatial token aggregation network), a lightweight and high-performance model for maize leaf disease identification. In this study, we constructed a comprehensive maize leaf image dataset comprising six common disease types and healthy samples, with early and late stages of northern leaf blight and eyespot specifically differentiated. DSTANet employed MobileViT as the backbone architecture, combining the advantages of CNNs for local feature extraction with transformers for global feature modeling. To enhance lesion localization and mitigate interference from complex field backgrounds, DSFM (decomposed spatial fusion module) was introduced. Additionally, the MSTA (multi-scale token aggregator) was designed to leverage hidden-layer feature channels more effectively, improving information flow and preventing gradient vanishing. Experimental results showed that DSTANet achieved an accuracy of 96.11%, precision of 96.17%, recall of 96.11%, and F1-score of 96.14%. With only 1.9M parameters, 0.6 GFLOPs (floating point operations), and an inference speed of 170 images per second, the model meets real-time deployment requirements on edge devices. This study provided a novel and practical approach for fine-grained and early-stage maize disease identification, offering technical support for smart agriculture and precision crop management. Full article
(This article belongs to the Section Smart Agriculture)
Show Figures

Figure 1

24 pages, 3937 KB  
Article
HyperTransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Hyperspectral Image Classification
by Xin Dai, Zexi Li, Lin Li, Shuihua Xue, Xiaohui Huang and Xiaofei Yang
Remote Sens. 2025, 17(14), 2361; https://doi.org/10.3390/rs17142361 - 9 Jul 2025
Cited by 1 | Viewed by 637
Abstract
Recent advances in hyperspectral image (HSI) classification have demonstrated the effectiveness of hybrid architectures that integrate convolutional neural networks (CNNs) and Transformers, leveraging CNNs for local feature extraction and Transformers for global dependency modeling. However, existing fusion approaches face three critical challenges: (1) [...] Read more.
Recent advances in hyperspectral image (HSI) classification have demonstrated the effectiveness of hybrid architectures that integrate convolutional neural networks (CNNs) and Transformers, leveraging CNNs for local feature extraction and Transformers for global dependency modeling. However, existing fusion approaches face three critical challenges: (1) insufficient synergy between spectral and spatial feature learning due to rigid coupling mechanisms; (2) high computational complexity resulting from redundant attention calculations; and (3) limited adaptability to spectral redundancy and noise in small-sample scenarios. To address these limitations, we propose HyperTransXNet, a novel CNN-Transformer hybrid architecture that incorporates adaptive spectral-spatial fusion. Specifically, the proposed HyperTransXNet comprises three key modules: (1) a Hybrid Spatial-Spectral Module (HSSM) that captures the refined local spectral-spatial features and models global spectral correlations by combining depth-wise dynamic convolution with frequency-domain attention; (2) a Mixture-of-Experts Routing (MoE-R) module that adaptively fuses multi-scale features by dynamically selecting optimal experts via Top-K sparse weights; and (3) a Spatial-Spectral Tokens Enhancer (SSTE) module that ensures causality-preserving interactions between spectral bands and spatial contexts. Extensive experiments on the Indian Pines, Houston 2013, and WHU-Hi-LongKou datasets demonstrate the superiority of HyperTransXNet. Full article
(This article belongs to the Special Issue AI-Driven Hyperspectral Remote Sensing of Atmosphere and Land)
Show Figures

Figure 1

16 pages, 2358 KB  
Article
A Hybrid Content-Aware Network for Single Image Deraining
by Guoqiang Chai, Rui Yang, Jin Ge and Yulei Chen
Computers 2025, 14(7), 262; https://doi.org/10.3390/computers14070262 - 4 Jul 2025
Viewed by 557
Abstract
Rain streaks degrade the quality of optical images and seriously affect the effectiveness of subsequent vision-based algorithms. Although the applications of a convolutional neural network (CNN) and self-attention mechanism (SA) in single image deraining have shown great success, there are still unresolved issues [...] Read more.
Rain streaks degrade the quality of optical images and seriously affect the effectiveness of subsequent vision-based algorithms. Although the applications of a convolutional neural network (CNN) and self-attention mechanism (SA) in single image deraining have shown great success, there are still unresolved issues regarding the deraining performance and the large computational load. The work in this paper fully coordinates and utilizes the advantages between CNN and SA and proposes a hybrid content-aware deraining network (CAD) to reduce complexity and generate high-quality results. Specifically, we construct the CADBlock, including the content-aware convolution and attention mixer module (CAMM) and the multi-scale double-gated feed-forward module (MDFM). In CAMM, the attention mechanism is used for intricate windows to generate abundant features and simple convolution is used for plain windows to reduce computational costs. In MDFM, multi-scale spatial features are double-gated fused to preserve local detail features and enhance image restoration capabilities. Furthermore, a four-token contextual attention module (FTCA) is introduced to explore the content information among neighbor keys to improve the representation ability. Both qualitative and quantitative validations on synthetic and real-world rain images demonstrate that the proposed CAD can achieve a competitive deraining performance. Full article
(This article belongs to the Special Issue Machine Learning Applications in Pattern Recognition)
Show Figures

Figure 1

20 pages, 1166 KB  
Article
MSP-EDA: Multivariate Time Series Forecasting Based on Multiscale Patches and External Data Augmentation
by Shunhua Peng, Wu Sun, Panfeng Chen, Huarong Xu, Dan Ma, Mei Chen, Yanhao Wang and Hui Li
Electronics 2025, 14(13), 2618; https://doi.org/10.3390/electronics14132618 - 28 Jun 2025
Viewed by 661
Abstract
Accurate multivariate time series forecasting remains a major challenge in various real-world applications, primarily due to the limitations of existing models in capturing multiscale temporal dependencies and effectively integrating external data. To address these issues, we propose MSP-EDA, a novel multivariate time series [...] Read more.
Accurate multivariate time series forecasting remains a major challenge in various real-world applications, primarily due to the limitations of existing models in capturing multiscale temporal dependencies and effectively integrating external data. To address these issues, we propose MSP-EDA, a novel multivariate time series forecasting framework that integrates multiscale patching and external data enhancement. Specifically, MSP-EDA utilizes the Discrete Fourier Transform (DFT) to extract dominant global periodic patterns and employs an adaptive Continuous Wavelet Transform (CWT) to capture scale-sensitive local variations. In addition, multiscale patches are constructed to capture temporal patterns at different resolutions, and a specialized encoder is designed for each scale. Each encoder incorporates temporal attention, channel correlation attention, and cross-attention with external data to capture intra-scale temporal dependencies, inter-variable correlations, and external influences, respectively. To fuse information from different temporal scales, we introduce a trainable global token that serves as a variable-wise aggregator across scales. Extensive experiments on four public benchmark datasets and three real-world vector database datasets that we collect demonstrate that MSP-EDA consistently outperforms state-of-the-art methods, achieving particularly notable improvements on vector database workloads. Ablation studies further confirm the effectiveness of each module and the adaptability of MSP-EDA to complex forecasting scenarios involving external dependencies. Full article
(This article belongs to the Special Issue Machine Learning in Data Analytics and Prediction)
Show Figures

Figure 1

24 pages, 2802 KB  
Article
MSDCA: A Multi-Scale Dual-Branch Network with Enhanced Cross-Attention for Hyperspectral Image Classification
by Ning Jiang, Shengling Geng, Yuhui Zheng and Le Sun
Remote Sens. 2025, 17(13), 2198; https://doi.org/10.3390/rs17132198 - 26 Jun 2025
Viewed by 636
Abstract
The high dimensionality of hyperspectral data, coupled with limited labeled samples and complex scene structures, makes spatial–spectral feature learning particularly challenging. To address these limitations, we propose a dual-branch deep learning framework named MSDCA, which performs spatial–spectral joint modeling under limited supervision. First, [...] Read more.
The high dimensionality of hyperspectral data, coupled with limited labeled samples and complex scene structures, makes spatial–spectral feature learning particularly challenging. To address these limitations, we propose a dual-branch deep learning framework named MSDCA, which performs spatial–spectral joint modeling under limited supervision. First, a multiscale 3D spatial–spectral feature extraction module (3D-SSF) employs parallel 3D convolutional branches with diverse kernel sizes and dilation rates, enabling hierarchical modeling of spatial–spectral representations from large-scale patches and effectively capturing both fine-grained textures and global context. Second, a multi-branch directional feature module (MBDFM) enhances the network’s sensitivity to directional patterns and long-range spatial relationships. It achieves this by applying axis-aware depthwise separable convolutions along both horizontal and vertical axes, thereby significantly improving the representation of spatial features. Finally, the enhanced cross-attention Transformer encoder (ECATE) integrates a dual-branch fusion strategy, where a cross-attention stream learns semantic dependencies across multi-scale tokens, and a residual path ensures the preservation of structural integrity. The fused features are further refined through lightweight channel and spatial attention modules. This adaptive alignment process enhances the discriminative power of heterogeneous spatial–spectral features. The experimental results on three widely used benchmark datasets demonstrate that the proposed method consistently outperforms state-of-the-art approaches in terms of classification accuracy and robustness. Notably, the framework is particularly effective for small-sample classes and complex boundary regions, while maintaining high computational efficiency. Full article
Show Figures

Graphical abstract

29 pages, 5553 KB  
Article
Data-Driven Multi-Scale Channel-Aligned Transformer for Low-Carbon Autonomous Vessel Operations: Enhancing CO2 Emission Prediction and Green Autonomous Shipping Efficiency
by Jiahao Ni, Hongjun Tian, Kaijie Zhang, Yihong Xue and Yang Xiong
J. Mar. Sci. Eng. 2025, 13(6), 1143; https://doi.org/10.3390/jmse13061143 - 9 Jun 2025
Viewed by 762
Abstract
The accurate prediction of autonomous vessel CO2 emissions is critical for achieving IMO 2050 carbon neutrality and optimizing low-carbon maritime operations. Traditional models face limitations in real-time multi-source data analysis and dynamic cross-variable dependency modeling, hindering data-driven decision-making for sustainable autonomous shipping. [...] Read more.
The accurate prediction of autonomous vessel CO2 emissions is critical for achieving IMO 2050 carbon neutrality and optimizing low-carbon maritime operations. Traditional models face limitations in real-time multi-source data analysis and dynamic cross-variable dependency modeling, hindering data-driven decision-making for sustainable autonomous shipping. This study proposes a Multi-scale Channel-aligned Transformer (MCAT) model, integrated with a 5G–satellite–IoT communication architecture, to address these challenges. The MCAT model employs multi-scale token reconstruction and a dual-level attention mechanism, effectively capturing spatiotemporal dependencies in heterogeneous data streams (AIS, sensors, weather) while suppressing high-frequency noise. To enable seamless data collaboration, a hybrid transmission framework combining satellite (Inmarsat/Iridium), 5G URLLC slicing, and industrial Ethernet is designed, achieving ultra-low latency (10 ms) and nanosecond-level synchronization via IEEE 1588v2. Validated on a 22-dimensional real autonomous vessel dataset, MCAT reduces prediction errors by 12.5% MAE and 24% MSE compared to state-of-the-art methods, demonstrating superior robustness under noisy scenarios. Furthermore, the proposed architecture supports smart autonomous shipping solutions by providing demonstrably interpretable emission insights through its dual-level attention mechanism (visualized via attention maps) for route optimization, fuel efficiency enhancement, and compliance with CII regulations. This research bridges AI-driven predictive analytics with green autonomous shipping technologies, offering a scalable framework for digitalized and sustainable maritime operations. Full article
(This article belongs to the Special Issue Sustainable Maritime Transport and Port Intelligence)
Show Figures

Figure 1

29 pages, 44456 KB  
Article
AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection
by Hengyu Guo, Qunyong Wu and Yuhang Wang
Remote Sens. 2025, 17(11), 1920; https://doi.org/10.3390/rs17111920 - 31 May 2025
Cited by 2 | Viewed by 1685
Abstract
Real-time object detection on embedded unmanned aerial vehicles (UAVs) is crucial for emergency rescue, autonomous driving, and target tracking applications. However, UAVs’ hardware limitations create conflicts between model size and detection accuracy. Moreover, challenges such as complex backgrounds from the UAV’s perspective, severe [...] Read more.
Real-time object detection on embedded unmanned aerial vehicles (UAVs) is crucial for emergency rescue, autonomous driving, and target tracking applications. However, UAVs’ hardware limitations create conflicts between model size and detection accuracy. Moreover, challenges such as complex backgrounds from the UAV’s perspective, severe occlusion, densely packed small targets, and uneven lighting conditions complicate real-time detection for embedded UAVs. To tackle these challenges, we propose AUHF-DETR, an embedded detection model derived from RT-DETR. In the backbone, we introduce a novel WTC-AdaResNet paradigm that utilizes reversible connections to decouple small-object features. We further replace the original global attention mechanism with the PSA module to strengthen inter-feature relationships within each ROI, thereby resolving the embedded challenges posed by RT-DETR’s complex token computations. In the encoder, we introduce a BDFPN for multi-scale feature fusion, effectively mitigating the small-object detection difficulties caused by the baseline’s Hungarian assignment. Extensive experiments on the public VisDrone2019, HIT-UAV, and CARPK datasets demonstrate that compared with RT-DETR-r18, AUHF-DETR achieves a 2.1% increase in APs on VisDrone2019, reduces the parameter count by 49.0%, and attains 68 FPS (AGX Xavier), thus satisfying the real-time requirements for small-object detection in embedded UAVs. Full article
Show Figures

Figure 1

23 pages, 12686 KB  
Article
A High-Precision Defect Detection Approach Based on BiFDRep-YOLOv8n for Small Target Defects in Photovoltaic Modules
by Yi Lu, Chunsong Du, Xu Li, Shaowei Liang, Qian Zhang and Zhenghui Zhao
Energies 2025, 18(9), 2299; https://doi.org/10.3390/en18092299 - 30 Apr 2025
Cited by 1 | Viewed by 787
Abstract
With the accelerated transition of the global energy structure towards decarbonization, the share of PV power generation in the power system continues to rise. IEA predicts PV will account for 80% of new global renewable installations during 2025–2030. However, latent faults emerging from [...] Read more.
With the accelerated transition of the global energy structure towards decarbonization, the share of PV power generation in the power system continues to rise. IEA predicts PV will account for 80% of new global renewable installations during 2025–2030. However, latent faults emerging from the long-term operation of photovoltaic (PV) power plants significantly compromise their operational efficiency. The existing EL detection methods in PV plants face challenges including grain boundary interference, probe band artifacts, non-uniform luminescence, and complex backgrounds, which elevate the risk of missing small defects. In this paper, we propose a high-precision defect detection method based on BiFDRep-YOLOv8n for small target defects in photovoltaic (PV) power plants, aiming to improve the detection accuracy and real-time performance and to provide an efficient solution for the intelligent detection of PV power plants. Firstly, the visual transformer RepViT is constructed as the backbone network, based on the dual-path mechanism of Token Mixer and Channel Mixer, to achieve local feature extraction and global information modeling, and combined with the structural reparameterization technique, to enhance the sensitivity of detecting small defects. Secondly, for the multi-scale characteristics of defects, the neck network is optimized by introducing a bidirectional weighted feature pyramid network (BiFPN), which adopts an adaptive weight allocation strategy to enhance feature fusion and improve the characterization of defects at different scales. Finally, the detection head part uses DyHead-DCNv3, which combines the triple attention mechanism of scale, space, and task awareness, and introduces deformable convolution (DCNv3) to improve the modeling capability and detection accuracy of irregular defects. Full article
(This article belongs to the Section A2: Solar Energy and Photovoltaic Systems)
Show Figures

Figure 1

16 pages, 716 KB  
Article
Efficient Graph Representation Learning by Non-Local Information Exchange
by Ziquan Wei, Tingting Dan, Jiaqi Ding and Guorong Wu
Electronics 2025, 14(5), 1047; https://doi.org/10.3390/electronics14051047 - 6 Mar 2025
Viewed by 993
Abstract
Graphs are an effective data structure for characterizing ubiquitous connections as well as evolving behaviors that emerge in inter-wined systems. Limited by the stereotype of node-to-node connections, learning node representations is often confined in a graph diffusion process where local information has been [...] Read more.
Graphs are an effective data structure for characterizing ubiquitous connections as well as evolving behaviors that emerge in inter-wined systems. Limited by the stereotype of node-to-node connections, learning node representations is often confined in a graph diffusion process where local information has been excessively aggregated, as the random walk of graph neural networks (GNN) explores far-reaching neighborhoods layer-by-layer. In this regard, tremendous efforts have been made to alleviate feature over-smoothing issues such that current backbones can lend themselves to be used in a deep network architecture. However, compared to designing a new GNN, less attention has been paid to underlying topology by graph re-wiring, which mitigates not only flaws of the random walk but also the over-smoothing risk incurred by reducing unnecessary diffusion in deep layers. Inspired by the notion of non-local mean techniques in the area of image processing, we propose a non-local information exchange mechanism by establishing an express connection to the distant node, instead of propagating information along the (possibly very long) original pathway node-after-node. Since the process of seeking express connections throughout a graph can be computationally expensive in real-world applications, we propose a re-wiring framework (coined the express messenger wrapper) to progressively incorporate express links in a non-local manner, which allows us to capture multi-scale features without using a very deep model; our approach is thus free of the over-smoothing challenge. We integrate our express messenger wrapper with existing GNN backbones (either using graph convolution or tokenized transformer) and achieve a new record on the Roman-empire dataset as well as in terms of SOTA performance on both homophilous and heterophilous datasets. Full article
(This article belongs to the Special Issue Artificial Intelligence in Graphics and Images)
Show Figures

Figure 1

22 pages, 11312 KB  
Article
Multi-Scale Kolmogorov-Arnold Network (KAN)-Based Linear Attention Network: Multi-Scale Feature Fusion with KAN and Deformable Convolution for Urban Scene Image Semantic Segmentation
by Yuanhang Li, Shuo Liu, Jie Wu, Weichao Sun, Qingke Wen, Yibiao Wu, Xiujuan Qin and Yanyou Qiao
Remote Sens. 2025, 17(5), 802; https://doi.org/10.3390/rs17050802 - 25 Feb 2025
Cited by 4 | Viewed by 2272
Abstract
The introduction of an attention mechanism in remote sensing image segmentation improves the accuracy of the segmentation. In this paper, a novel multi-scale KAN-based linear attention (MKLA) segmentation network of MKLANet is developed to promote a better segmentation result. A hybrid global–local attention [...] Read more.
The introduction of an attention mechanism in remote sensing image segmentation improves the accuracy of the segmentation. In this paper, a novel multi-scale KAN-based linear attention (MKLA) segmentation network of MKLANet is developed to promote a better segmentation result. A hybrid global–local attention mechanism in a feature decoder is designed to enhance the ability of aggregating the global–local context and avoiding potential blocking artifacts for feature extraction and segmentation. The local attention channel adopts MKLA block by bringing the merits of KAN convolution in Mamba like the linear attention block to improve the ability of handling linear and nonlinear feature and complex function approximation with a few extra computations. The global attention channel uses long-range cascade encoder–decoder block, where it mainly employs the 7 × 7 depth-wise convolution token mixer and lightweight 7 × 7 dilated deep convolution to capture the long-distance spatial features field and retain key spatial information. In addition, to enrich the input of the attention block, a deformable convolution module is developed between the encoder output and corresponding scale decoder, which can improve the expression ability of the segmentation model without increasing the depth of the network. The experimental results of the Vaihingen dataset (83.68% in mIoU, 92.98 in OA, and 91.08 in mF1), the UAVid dataset (69.78% in mIoU, 96.51 in OA), the LoveDA dataset (51.53% in mIoU, 86.42% in OA, and 67.19% in mF1), and the Potsdam dataset (97.14% in mIoU, 92.64% in OA, and 93.8% in mF1) outperform other advanced attention-based approaches in terms of small targets and edges’ segmentation. Full article
Show Figures

Graphical abstract

Back to TopTop