MDPI - Publisher of Open Access Journals

26 pages, 11619 KB

Open AccessArticle

Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts for Fine-Grained Insect Pest Classification

by Nurullah Şahin, Nuh Alpaslan and Davut Hanbay

Electronics 2026, 15(11), 2268; https://doi.org/10.3390/electronics15112268 (registering DOI) - 23 May 2026

Fine-grained insect pest classification presents a particularly demanding visual recognition challenge due to severe class imbalance, pronounced intra-class morphological variability across developmental stages, and high inter-class visual similarity among taxonomically related species. Existing deep learning approaches typically rely on a single feature representation [...] Read more.

Fine-grained insect pest classification presents a particularly demanding visual recognition challenge due to severe class imbalance, pronounced intra-class morphological variability across developmental stages, and high inter-class visual similarity among taxonomically related species. Existing deep learning approaches typically rely on a single feature representation extracted from a single network depth, overlooking complementary discriminative cues distributed across multiple abstraction levels. Furthermore, classical attention mechanisms perform spatial weighting deterministically, without explicitly modeling the underlying statistical structure of the feature space, which is inherently multimodal on long-tailed benchmarks such as IP102. This study proposes a Multi-Scale Gaussian Mixture Model-Gated Mixture of Experts (GMM-MoE) architecture that operates as a plug-in module insertable into any convolutional backbone, evaluated here on DenseNet-121 at three distinct feature depths. The proposed module computes analytic GMM posterior responsibilities in closed form, softly assigning each spatial location to dedicated convolutional expert sub-networks. At the same time, a conditional prior mechanism π(x) adapts the routing strategy to individual image content rather than relying on fixed priors. The architecture is evaluated on the IP102 benchmark (102 pest classes, ~75,000 images) under a two-stage training protocol. Ablation experiments confirm that increasing the number of experts consistently improves accuracy across all three routing depths, and that multi-scale fusion surpasses any single-scale configuration. The proposed model achieves a mean top-1 accuracy of 74.12% (±0.25%, 95% CI) across three independent runs on the IP102 test set. To the best of our knowledge, this is the first work to employ GMM posterior responsibilities as a spatial routing mechanism within a multi-scale CNN feature hierarchy for fine-grained insect pest classification, establishing a principled probabilistic alternative to deterministic attention weighting in visual recognition systems. Full article

► Show Figures

Figure 1

27 pages, 13198 KB

Open AccessArticle

Meteorology-Conditioned High-Resolution Vegetation Forecasting: A Hierarchical Multi-Modal Fusion Network

by Zhihang Yi, Jianling Yang, Hairong Wang, Xiong Kang, Suzhao Zhang, Xiaowei Zhu and Yingjuan Han

Remote Sens. 2026, 18(11), 1684; https://doi.org/10.3390/rs18111684 - 22 May 2026

Abstract

Predicting high-resolution Normalized Difference Vegetation Index (NDVI) in mountainous ecosystems is challenging due to topographic complexity and climate heterogeneity. Existing methods often struggle to balance fine-grained spatial patterns with multi-scale meteorological drivers. This paper introduces the Hierarchical Multi-Modal Fusion Network (HMMFN), which employs [...] Read more.

Predicting high-resolution Normalized Difference Vegetation Index (NDVI) in mountainous ecosystems is challenging due to topographic complexity and climate heterogeneity. Existing methods often struggle to balance fine-grained spatial patterns with multi-scale meteorological drivers. This paper introduces the Hierarchical Multi-Modal Fusion Network (HMMFN), which employs a conditioned reconstruction strategy to decouple spatial learning from environmental forcing. The architecture utilizes a dual-stream encoder to process NDVI imagery and meteorological data in parallel. A Transformer module captures long-term temporal dependencies, while a multi-level fusion decoder integrates climate semantics with local vegetation details. The model is optimized using a hybrid loss function that combines Mean Squared Error and Structural Similarity Index Measure to ensure both numerical precision and spatial fidelity. Evaluated in the Liupan Mountains, HMMFN consistently outperforms baseline models across multiple lead times. For prediction horizons ranging from one to five months, the model maintains high accuracy with

R^{2}

values between 0.9123 (1-month horizon) and 0.8195 (5-month horizon), achieving a 10.1% and 3.6% reduction in RMSE compared to the optimal baseline model, respectively. These results demonstrate that HMMFN effectively preserves fine-scale spatial structures while maintaining accurate temporal trends across various time steps. Full article

(This article belongs to the Section AI Remote Sensing)

22 pages, 7712 KB

Open AccessArticle

CT-Net: A Hybrid ConvNeXt–Transformer Approach for ASL Alphabet Classification

by Zhuofan Yang, Houjin Lu and Samaneh Shamshiri

Appl. Sci. 2026, 16(10), 5168; https://doi.org/10.3390/app16105168 - 21 May 2026

Viewed by 164

Abstract

Recognition of the American Sign Language (ASL) alphabet is of utmost importance in bridging the communication gap between the hearing-impaired and the hearing. However, robust classification remains difficult because some hand gestures are morphologically very similar. To address this problem, this study presents [...] Read more.

Recognition of the American Sign Language (ASL) alphabet is of utmost importance in bridging the communication gap between the hearing-impaired and the hearing. However, robust classification remains difficult because some hand gestures are morphologically very similar. To address this problem, this study presents CT-Net, a hybrid deep learning architecture that integrates ConvNeXt-Tiny with a lightweight Transformer encoder. CT-Net combines convolutional feature extraction and self-attention mechanisms, which enable it to capture fine-grained local patterns and long-range spatial dependencies effectively. The proposed model was extensively compared with various architectures including traditional CNNs, Transformer-based models, hybrid machine-learning approaches and recent lightweight hybrid networks. The experimental results show that CT-Net achieved the best overall performance with a peak accuracy of 95.67% on the enhanced ASL dataset. Ablation studies demonstrate the effectiveness of our design choice. CT-Net achieves a strong trade-off between recognition accuracy and computational efficiency with an inference rate of 163.55 Frames Per Second (FPS). These findings highlight the potential of hybrid frameworks as a powerful tool for fine-grained gesture recognition tasks. Full article

► Show Figures

Figure 1

26 pages, 4983 KB

Open AccessArticle

Closed-Set vs. Open-Vocabulary Object Detectors for Urban Architectural Typology Classification: A Comparative Study on Athenian Heritage Buildings

by Konstantinos Filippatos, Konstantina Siountri and Christos-Nikolaos Anagnostopoulos

Heritage 2026, 9(5), 206; https://doi.org/10.3390/heritage9050206 - 21 May 2026

Viewed by 51

Abstract

Architectural typology classification plays an important role in large-scale documentation and analysis of urban cultural heritage. Recent advances in computer vision enable automated approaches for detecting and categorizing buildings from street-level imagery, yet the suitability of different detection paradigms for architectural typology analysis [...] Read more.

Architectural typology classification plays an important role in large-scale documentation and analysis of urban cultural heritage. Recent advances in computer vision enable automated approaches for detecting and categorizing buildings from street-level imagery, yet the suitability of different detection paradigms for architectural typology analysis remains insufficiently explored. Despite recent advances in computer vision for architectural analysis, no systematic comparative study has evaluated closed-set CNN-based detectors against open-vocabulary vision–language grounding models for urban architectural typology classification. This study presents a comparative evaluation of closed-set convolutional object detectors and open-vocabulary vision–language grounding models for the classification of Athenian architectural typologies. A dataset of 3349 street-view images containing 11,111 annotated building instances was compiled and organized into five typological categories: Neoclassical, Neoclassical-Eclectic, Interwar-Eclectic, Interwar, and Postwar. The experiments compare several YOLO-based detection configurations with Grounding DINO under zero-shot inference, parameter-efficient adaptation (e.g., Kiw Rank Adaptation—LoRA), and full fine-tuning. Results show that supervised YOLO-based models achieve robust detection and classification performance with high localization accuracy and consistent typology discrimination in dense urban scenes. In contrast, open-vocabulary grounding models demonstrate limited reliability in zero-shot settings and require substantial adaptation to approach comparable performance levels. Analysis of confusion patterns further reveals that most classification errors originate from intrinsic architectural similarities between transitional styles rather than from model instability. The findings highlight the advantages of supervised object detection frameworks for scalable urban heritage documentation and provide insights into the current limitations of vision–language models for fine-grained architectural typology classification. Full article

(This article belongs to the Section Architectural Heritage)

36 pages, 9783 KB

Open AccessArticle

Spectral-YOLOv13: A Dual-Domain Vision-Mamba Sensing Framework for Fine-Grained Coral Health Assessment and Continuous Ecological Forecasting

by Litian Yang, Wenkun Chen, Zhuoyue Mo, Xin Gao, Minzhi Mo, Chunlei Xia and Liankuan Zhang

Sensors 2026, 26(10), 3265; https://doi.org/10.3390/s26103265 - 21 May 2026

Viewed by 250

Abstract

Coral reefs are among the most important and vulnerable marine ecosystems worldwide. AI-powered underwater visual monitoring has become essential for effective reef conservation, yet current methods still face severe limitations: spectral ambiguity caused by underwater turbidity, fine-grained confusion in early coral health assessment, [...] Read more.

Coral reefs are among the most important and vulnerable marine ecosystems worldwide. AI-powered underwater visual monitoring has become essential for effective reef conservation, yet current methods still face severe limitations: spectral ambiguity caused by underwater turbidity, fine-grained confusion in early coral health assessment, and discrete forecasting models that cannot represent continuous ecological degradation dynamics. To address these issues, we propose Spectral-YOLOv13, a dual-domain vision-Mamba sensing framework for high-precision coral health evaluation and continuous ecological forecasting. The framework incorporates three novel components: a Wavelet-Integrated Omni-Neck (WIO-Neck) to perform multi-scale spectral filtering and suppress turbidity-induced noise; a Contrastive Prototype Head (CP-Head) to enhance discriminability between visually similar health states; and a Bio-Mamba Predictor based on state-space models to capture long-term continuous health trajectories. Extensive experiments on the CR-Mix++ dataset demonstrate that Spectral-YOLOv13 achieves 53.8% mAP with strong robustness in turbid underwater environments. It reduces four-week forecasting error by 26.8% and maintains real-time inference speed at 112 FPS. This work provides a reliable and high-performance vision framework for practical underwater coral reef monitoring and proactive conservation management. Full article

(This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition)

► Show Figures

Figure 1

22 pages, 3271 KB

Open AccessArticle

TextureCLIP: Cross-Dataset Zero-Shot Texture Anomaly Segmentation with Triadic Descriptive Prompting

by Xin Peng Ooi and Seong G. Kong

Electronics 2026, 15(10), 2220; https://doi.org/10.3390/electronics15102220 - 21 May 2026

Viewed by 125

Abstract

Texture anomaly segmentation aims to localize irregularities on textured surfaces, a task critical for industrial quality control. Supervised methods require extensive labeled data, while unsupervised approaches often struggle to generalize to unseen target domains. Recent zero-shot methods based on vision-language models such as [...] Read more.

Texture anomaly segmentation aims to localize irregularities on textured surfaces, a task critical for industrial quality control. Supervised methods require extensive labeled data, while unsupervised approaches often struggle to generalize to unseen target domains. Recent zero-shot methods based on vision-language models such as Contrastive Language-Image Pretraining (CLIP) enable anomaly detection through text prompts without target-domain training data. However, existing approaches typically rely on generic prompts and show limited sensitivity to fine-grained texture variations. To address these limitations, we propose TextureCLIP, a cross-dataset zero-shot framework with auxiliary training for texture anomaly segmentation. The framework is trained on source texture data from the MVTec AD texture subset using annotated source-domain samples and directly evaluated on six unseen target datasets without access to target-domain training images, annotations, or fine-tuning. The proposed Triadic Descriptive Prompting (TriDP) integrates normal prompts, generic anomaly prompts, and descriptive anomaly prompts to provide complementary semantic cues for improved cross-domain generalization. To enhance spatial sensitivity, Dual Attention Modules (DAMs) are incorporated into the CLIP image encoder to refine local feature representations. In addition, Softmax-Weighted Averaging (SMWA) aggregates multiple anomaly cues by emphasizing the prompt responses with higher similarity scores. Experimental results demonstrate that TextureCLIP achieves strong and consistent performance across diverse texture datasets, attaining 67.06% AP and 65.69% F1-max, with improvements of 5.17 and 2.66 percentage points over the competitive baselines, respectively. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

27 pages, 72468 KB

Open AccessArticle

Long-Tailed Remote Sensing Image Classification via Multi-Scale Data, Pre-Trained Model, and Efficient Inference Strategy

by Song Han, Xing Han, Yibo Xu, Yongqin Tian, Weidong Zhang and Wenyi Zhao

Remote Sens. 2026, 18(10), 1636; https://doi.org/10.3390/rs18101636 - 19 May 2026

Viewed by 225

Abstract

Remote sensing image classification is one of the fundamental tasks in the field of remote sensing and plays a critical role in Earth observation applications. However, the inherent multi-scale characteristics of this task pose significant challenges to scene classification. To address these issues, [...] Read more.

Remote sensing image classification is one of the fundamental tasks in the field of remote sensing and plays a critical role in Earth observation applications. However, the inherent multi-scale characteristics of this task pose significant challenges to scene classification. To address these issues, we propose a novel framework that integrates the Contrastive Language–Image Pre-training (CLIP) model, multi-scale data, and efficient inference strategy. The proposed framework transfers general-purpose features learnt from natural images to remote sensing image classification. Specifically, this framework leverages the rich feature representations learnt by the CLIP model in the contrastive learning procedure and adopts it as the backbone network of the model to extract fine-grained and multi-scale features for remote sensing images. That is, the model can learn local fine-grained details but also encode global contextual information useful for the classification of visually similar scene categories. Afterwards, AdapterFormer module is inserted into the few selected layers of CLIP model, which can effectively enhance model performance and have low computational overhead. This helps efficient knowledge sharing and introduces new features at the model level. Furthermore, to alleviate possible performance deterioration brought about by multi-scale feature variation, a multi-scale training set is constructed at data level, providing complementary multi-scale information. Through the synergy of all these strategies above, the proposed method greatly improves the classification performance of multi-scale remote sensing images. Extensive experiments on the MEET dataset (it includes 80 fine categories and more than 800,000 samples) show that the proposed method greatly improves the performance. Compared with general-purpose classification networks and remote sensing-related models, the proposed method always gets state-of-the-art results. Full article

(This article belongs to the Special Issue Hyperspectral Remote Sensing Image Analysis via Advanced Deep Learning and Computer Vision)

► Show Figures

Figure 1

30 pages, 5569 KB

Open AccessArticle

GRCD-Net: Guided Global–Local Relational Learning for Few-Shot Fine-Grained and Remote Sensing Scene Classification

by Jianfeng Liu, Yibo Du, Lifan Sun, Xiaozheng Li, Yanna Si, Xiaoli Song and Ruijuan Zheng

Remote Sens. 2026, 18(10), 1632; https://doi.org/10.3390/rs18101632 - 19 May 2026

Viewed by 227

Abstract

Remote sensing scene classification (RSSC) faces severe challenges from data scarcity and complex background clutter. To overcome these limitations, this paper draws inspiration from few-shot fine-grained image classification (FSFGIC) to filter noise and capture subtle details. However, existing methods often process global context [...] Read more.

Remote sensing scene classification (RSSC) faces severe challenges from data scarcity and complex background clutter. To overcome these limitations, this paper draws inspiration from few-shot fine-grained image classification (FSFGIC) to filter noise and capture subtle details. However, existing methods often process global context and local features separately, which limits their ability to suppress background noise in complex scenes. Consequently, the Guided Relational Cross-Attention Dual-branch Network (GRCD-Net) is proposed. Its core Guided Relational Cross-Attention (GRC) block leverages global semantics to filter local background noise prior to bidirectional feature interaction. Additionally, Iterative Global Relation (IGR) and Patch-level Dual-Metric (PDM) modules are integrated to robustly refine global relations and capture local similarities. Extensive experiments demonstrate that GRCD-Net consistently outperforms baselines by 2–4% on standard FSFGIC benchmarks. Notably, on the challenging NWPU-RESISC45 RSSC dataset, it achieves an 81.39% one-shot accuracy and exceeds current state-of-the-art methods by 7.55%, validating its efficacy for complex Earth observation. Full article

(This article belongs to the Special Issue Advanced Applications of Artificial Intelligence in Remote Sensing Image Recognition (2nd Edition))

► Show Figures

Figure 1

18 pages, 7647 KB

Open AccessArticle

WS-DINO: A DINOv2-Based Weed Segmentation Method with Feature Priors and Spatial Fusion

by Hongsheng Zhou, Jiangping Liu, Rigeng Wu and Baoping Zhao

Agriculture 2026, 16(10), 1105; https://doi.org/10.3390/agriculture16101105 - 18 May 2026

Viewed by 260

Abstract

Weed segmentation is a fundamental task in precision agriculture, essential for targeted intervention and sustainable farming. However, achieving accurate segmentation remains challenging due to the high visual similarity between weeds and crops, as well as the ambiguous, fine-grained boundaries often present in complex [...] Read more.

Weed segmentation is a fundamental task in precision agriculture, essential for targeted intervention and sustainable farming. However, achieving accurate segmentation remains challenging due to the high visual similarity between weeds and crops, as well as the ambiguous, fine-grained boundaries often present in complex field environments. To address this, we present WS-DINO, a novel weed segmentation network built upon the DINOv2 vision foundation model. Our framework introduces two key innovations: (1) a Feature Prior Module that leverages a Canny-guided refinement process to extract and inject fine-grained cues related to weed texture, morphology, and boundaries into specific blocks of the Vision Transformer; and (2) a Spatial Feature Fusion Module that leverages convolutional layers to generate multi-scale spatial features, which are then fused with the semantically rich token features from DINOv2, effectively compensating for the Transformer’s limitations in capturing local spatial details. Comprehensive evaluation on the public PhenoBench dataset shows that WS-DINO achieves an mIoU of 88.67% and outperforms the evaluated benchmark methods. Moreover, on the challenging MotionBlurred dataset, WS-DINO reaches 88.75% mIoU, showing stable performance under motion blur and degraded visual conditions. Full article

(This article belongs to the Topic Digital Agriculture, Smart Farming and Crop Monitoring)

► Show Figures

Figure 1

25 pages, 14321 KB

Open AccessArticle

A Woodblock New Year Painting Style Classification Method Based on Structural-Aware Attention and Frequency-Domain Style Statistics

by Hua Wei, Zhihua Diao, Junxiang Diao, Liqin Wen, Binbin Sun, Xiaoxuan Chen and Luping Yin

Electronics 2026, 15(10), 2158; https://doi.org/10.3390/electronics15102158 - 18 May 2026

Viewed by 107

Abstract

To address the problems of subtle style differences, high inter-class similarity, and complex structural and texture features in woodblock New Year paintings, this paper proposes a style classification method for woodblock New Year paintings based on an improved ResNeXt-50. The method introduces SA-CBAM [...] Read more.

To address the problems of subtle style differences, high inter-class similarity, and complex structural and texture features in woodblock New Year paintings, this paper proposes a style classification method for woodblock New Year paintings based on an improved ResNeXt-50. The method introduces SA-CBAM at the middle- and high-level feature stages. Through the synergistic effect of channel attention and edge-enhanced spatial attention, the model is guided to focus on key structural regions such as human contours. Furthermore, single-stage 2D-DWT is introduced to separate deep features into low-frequency global structural components and high-frequency local detail components, thereby enabling effective representation of overall composition information and fine-grained carving textures. The Gram matrix is introduced to conduct statistical modeling of the fusion features, so as to characterize the overall style distribution from the perspective of channel correlation. The model is trained and tested on a dataset of 4043 independent images across six categories, achieving an overall classification accuracy of 97.68%, which is significantly superior to mainstream models such as Vision Transformer. Ablation experiments further verify the complementary effects of each module in structural perception, frequency-domain feature representation, and style statistical modeling, demonstrating the effectiveness and application potential of the proposed method for digital preservation and fine-grained style recognition of woodblock New Year paintings. Full article

(This article belongs to the Section Artificial Intelligence)

► Show Figures

Figure 1

15 pages, 1473 KB

Open AccessArticle

Size of Sand Grains Controls Pore Structure and Water Dynamics: Implications for Water Retention and Hydraulic Conductivity

by Jackson Adriano Albuquerque, André da Costa, Gustavo Henrique Merten, Ana Carolina De Mattos E Avila and Gunnar Kirchhof

Land 2026, 15(5), 864; https://doi.org/10.3390/land15050864 (registering DOI) - 17 May 2026

Viewed by 228

Abstract

Sand grain size strongly influences the physical and hydraulic behaviour of sandy soils, particularly water retention, pore distribution, and water movement under unsaturated conditions. This study evaluated the effect of five sand grain-size classes, ranging from very coarse to very fine, on pore [...] Read more.

Sand grain size strongly influences the physical and hydraulic behaviour of sandy soils, particularly water retention, pore distribution, and water movement under unsaturated conditions. This study evaluated the effect of five sand grain-size classes, ranging from very coarse to very fine, on pore distribution, aeration, water retention, and unsaturated hydraulic conductivity. Quartz sand samples with different particle sizes were saturated and subjected to matric tensions ranging from 10 to 15,000 hPa. Very fine sand (0.053–0.106 mm) showed the highest field capacity (0.38 m³ m⁻³) and available water content (0.30 m³ m⁻³), which were associated with a predominance of pores between 0.2 and 3 μm in diameter. In contrast, coarser sand fractions were dominated by macropores (>50 μm) and exhibited lower water retention. Permanent wilting point values remained low and similar among grain-size classes (≈0.02 m³ m⁻³). Under unsaturated conditions (matric tensions > 100 hPa), very fine sand exhibited hydraulic conductivity values up to ten times greater than those of coarser fractions. Overall, decreasing sand particle size increased water retention and plant-available water while reducing macroporosity and aeration capacity. These findings demonstrate that sand grain-size distribution plays a major role in regulating water dynamics in sandy soils and may support the development of more efficient irrigation and soil management strategies to improve water conservation and plant water availability in drought-prone environments. Full article

(This article belongs to the Special Issue Sustainable Water and Soil Conservation and Management for Agriculture)

► Show Figures

Figure 1

24 pages, 6147 KB

Open AccessArticle

Multi-Scale Transformer-Based Neural Architecture Search for Hyperspectral Image Classification

by Aili Wang, Xinyu Liu and Haisong Chen

Remote Sens. 2026, 18(10), 1586; https://doi.org/10.3390/rs18101586 - 15 May 2026

Viewed by 142

Abstract

Hyperspectral image classification (HSIC) is a crucial task for remote sensing applications, requiring accurate pixel-level labeling while effectively capturing both spectral and spatial information. Traditional convolutional neural network architectures often struggle to balance local texture detail and global contextual consistency, and existing neural [...] Read more.

Hyperspectral image classification (HSIC) is a crucial task for remote sensing applications, requiring accurate pixel-level labeling while effectively capturing both spectral and spatial information. Traditional convolutional neural network architectures often struggle to balance local texture detail and global contextual consistency, and existing neural architecture search (NAS) methods rarely incorporate attention mechanisms, limiting their performance. To address these challenges, this study proposes a multi-scale Transformer-based NAS framework (TR-NAS) for fine-grained hyperspectral image classification. The framework combines local cube sampling, shallow and deep multi-scale convolutions, and a searchable Transformer module that adaptively selects global, local window, and multi-scale attention operators. Lightweight enhanced convolution operators, including dual-gated (DG-Conv) and mixed depthwise (MixConv) convolutions, are incorporated to improve spectral discrimination and scale robustness. Extensive experiments on the PU and Hanchuan datasets demonstrate that TR-NAS achieves superior classification accuracy, stability, and boundary consistency compared to traditional methods and existing NAS architectures, showing improved robustness to spectral similarity and spatial heterogeneity in complex remote sensing scenes. Full article

(This article belongs to the Special Issue Deep Learning for Multi-Sensor Remote Sensing: Advancements in Image Classification and Semantic Segmentation)

► Show Figures

Figure 1

20 pages, 606 KB

Open AccessArticle

Retrieval-Guided and Semantically Grounded Image Captioning for Open-Domain Scenes

by Shanshan Lin, Xiaoxuan Xie, Zexian Yang and Chao Chen

Mathematics 2026, 14(10), 1667; https://doi.org/10.3390/math14101667 - 13 May 2026

Viewed by 322

Abstract

Recent image captioning methods based on pre-trained vision–language models can generate fluent and coherent descriptions, yet they still struggle in open-domain scenes that contain long-tail concepts, uncommon object combinations, and ambiguous visual evidence. Two limitations are especially important. First, the knowledge needed to [...] Read more.

Recent image captioning methods based on pre-trained vision–language models can generate fluent and coherent descriptions, yet they still struggle in open-domain scenes that contain long-tail concepts, uncommon object combinations, and ambiguous visual evidence. Two limitations are especially important. First, the knowledge needed to recognize and name rare or domain-specific entities is only weakly represented in model parameters, causing captions to be generic, incomplete, or biased toward frequent concepts. Second, token generation is typically grounded mainly by local visual matching, making it sensitive to clutter, occlusion, and visually similar distractors, and therefore prone to attribute errors, relation confusion, and object hallucination. To address these issues, we propose R2G (retrieval- and grounding-guided captioning), a lightweight plug-in framework for frozen image captioning backbones. R2G consists of two complementary components. The first, retrieval-guided visual prompting, retrieves image-relevant concepts from an external visual concept memory, converts them into a continuous prompt representation, and injects this representation into selected layers of the visual encoder, so that external semantic information can influence visual feature formation before decoding begins. The second, global–local semantic grounding, derives a global semantic prior from an auxiliary vision–language encoder and adaptively fuses it with token-level local visual evidence through a decoder-state-dependent gating mechanism, thereby improving semantic stability while preserving fine-grained visual support. The resulting framework is lightweight, compatible with frozen pre-trained backbones, and designed to improve both concept coverage and semantic faithfulness. Experimental results on MS-COCO and NoCaps show that R2G consistently improves caption quality over the baseline and yields particularly clear gains in open-domain and out-of-domain settings. Full article

(This article belongs to the Section E1: Mathematics and Computer Science)

► Show Figures

Figure 1

29 pages, 17443 KB

Open AccessArticle

Per-SAM-MCPA: A Lightweight Framework for Individual Tree Crown Segmentation from UAV Imagery

by Chuting Hu, Size Dai, Shifan Wu, Qiaolin Ye and He Yan

Remote Sens. 2026, 18(10), 1559; https://doi.org/10.3390/rs18101559 - 13 May 2026

Viewed by 202

Abstract

Accurate individual tree crown (ITC) segmentation from unmanned aerial vehicle (UAV) imagery is important for fine-scale forest inventory, plantation management, and ecological monitoring. However, delineating ITCs in dense plantation environments remains difficult because crowns are strongly adjacent, canopy structures are highly homogeneous, and [...] Read more.

Accurate individual tree crown (ITC) segmentation from unmanned aerial vehicle (UAV) imagery is important for fine-scale forest inventory, plantation management, and ecological monitoring. However, delineating ITCs in dense plantation environments remains difficult because crowns are strongly adjacent, canopy structures are highly homogeneous, and crown boundaries are often blurred, making it hard for existing methods to preserve both regional integrity and boundary continuity. This study proposes the Perceptual Segment-Anything Model with Multi-head Cross-Parallel Attention (Per-SAM-MCPA), a lightweight and effective framework for fine-grained ITC segmentation in dense plantation scenes. Based on a compact ResNet-50 backbone, the framework integrates perceptual target-aware representation, multi-scale detail enhancement, global contextual modeling, and semantic-boundary collaborative refinement to improve crown discrimination and structural consistency. A perceptual relation module is used to strengthen pixel-level semantic dependency modeling, and a Multi-head Cross-Parallel Attention (MCPA) mechanism is designed to capture long-range contextual interactions through orthogonally decomposed spatial attention, improving global geometric consistency with limited computational overhead. A Composite Constraint Loss (CCL) that combines a weighted cross-entropy loss, a structural similarity loss, and a boundary term based on Hausdorff distance is introduced to jointly optimize region-level segmentation quality and boundary fidelity. Experiments on the Catalpa bungei UAV dataset show that the proposed method achieves an intersection over union (IoU) of 87.3% and an F1-score of 91.0%, outperforming representative baseline methods such as SAM and Mask R-CNN while maintaining an inference speed of 35.7 FPS on a single GPU. These results indicate that Per-SAM-MCPA offers an accurate, efficient, and practical solution for ITC segmentation in dense plantation environments. Full article

(This article belongs to the Special Issue Deep Learning-Based Interpretation and Processing of Remote Sensing Images)

► Show Figures

Figure 1

26 pages, 3544 KB

Open AccessFeature PaperArticle

Quick Response Code Verification Using Anti-Counterfeiting Pattern and Multi-Feature Fusion Network

by Ke Sun, Zhongyuan Guo and Hong Zheng

Sensors 2026, 26(10), 3067; https://doi.org/10.3390/s26103067 - 12 May 2026

Viewed by 438

Abstract

Quick response codes are widely used as anti-counterfeiting labels in the field of product packaging, but they are easily illegally copied. Thus, this paper introduces a quick response code verification method that combines an anti-counterfeiting pattern with a deep feature fusion network. Firstly, [...] Read more.

Quick response codes are widely used as anti-counterfeiting labels in the field of product packaging, but they are easily illegally copied. Thus, this paper introduces a quick response code verification method that combines an anti-counterfeiting pattern with a deep feature fusion network. Firstly, a specialized anti-counterfeiting quick response code is designed, composed of a standard quick response code and an anti-counterfeiting pattern, which is essentially a fine-grained random texture distribution sensitive to copying. Next, the anti-counterfeiting patterns are overlapped and divided into blocks during the data processing, which effectively expands the data volume and avoids the interference of pattern content on the authenticity identification. Then, a convolutional self-learning preprocessing layer is employed to initially learn the feature information that represents the difference between authenticity and forgery. Finally, a multi-feature fusion convolutional neural network is proposed to identity the authenticity of anti-counterfeiting patterns. The proposed network comprises two branches, facilitating multi-scale feature extraction and fusion. The effectiveness of the proposed approach is evaluated on a self-constructed quick response code dataset, and the experimental results demonstrate that the proposed approach outperforms traditional knowledge engineering methods and similar deep learning methods. Full article

(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Sensing Technology)

► Show Figures

Figure 1

Search Results (592)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (592)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI