MDPI - Publisher of Open Access Journals

19 pages, 30364 KB

Open AccessArticle

CLIP-Mono3D: End-to-End Open-Vocabulary Monocular 3D Object Detection via Semantic–Geometric Similarity

by Zichong Gu, Shiyi Mu, Hanqi Lyu and Shugong Xu

Sensors 2026, 26(8), 2380; https://doi.org/10.3390/s26082380 - 13 Apr 2026

Viewed by 347

Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision–language semantics [...] Read more.

Open-vocabulary 3D object detection (OV-3DOD) is crucial for real-world perception, yet existing monocular methods are often limited by predefined categories or heavy reliance on external 2D detectors. In this paper, we propose CLIP-Mono3D, an end-to-end one-stage transformer framework that directly integrates vision–language semantics into monocular 3D detection. By leveraging CLIP-derived semantic priors and grounding object queries in semantically salient regions, our model achieves robust zero-shot generalization to novel categories without requiring auxiliary 2D detectors. Furthermore, we introduce OV-KITTI, a large-scale benchmark extending KITTI with 40 new categories and over 7000 annotated 3D bounding boxes. Extensive experiments on OV-KITTI, KITTI, and Argoverse demonstrate that CLIP-Mono3D achieves competitive performance in open-vocabulary scenarios. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

21 pages, 12609 KB

Open AccessArticle

A Vision Language-Based Framework for Detecting Industrial Mechanical, Electrical, and Plumbing Assets Using Unlabelled Data

by Masoud Kamali, Behnam Atazadeh, Abbas Rajabifard, Yiqun Chen and Ensiyeh Javaherian Pour

Sensors 2026, 26(8), 2379; https://doi.org/10.3390/s26082379 - 12 Apr 2026

Viewed by 306

Abstract

There have been significant advancements in object detection using extensive labelled datasets. However, existing learning-based approaches remain constrained in industrial environments, primarily due to the limited diversity in training datasets; the lack of generalisation of close-set detectors to unseen asset categories; and the [...] Read more.

There have been significant advancements in object detection using extensive labelled datasets. However, existing learning-based approaches remain constrained in industrial environments, primarily due to the limited diversity in training datasets; the lack of generalisation of close-set detectors to unseen asset categories; and the inherent spatial and geometric complexity of mechanical, electrical, and plumbing (MEP) assets. To address this challenge, we propose a new approach that leverages pre-trained vision language models and close-set object detectors to detect unseen MEP assets using unlabelled data. Experimental results reveal the superior performance of Grounding DINO using Swin B transformer in open-vocabulary MEP asset detection, achieving the mean intersection over union (mIoU) of 0.6586 for valve detection and 0.4883 for pump detection. In addition, the combination of Grounding DINO (Swin B) and YOLOv8 outperforms other configurations in MEP asset detection, attaining the highest performance for both valve detection, with mean average precision at IoU = 0.5 (mAP50) of 0.928 and mean average precision over IoU threshold from 0.5 to 0.95 (mAP50:95) of 0.889, and pump detection, with corresponding values of 0.778 and 0.662, respectively. The quantitative and qualitative results of our approach were evaluated against fine-tuned Grounding DINO and fully supervised close-set object detectors. Full article

(This article belongs to the Collection Sensors and Sensing Technology for Industry 4.0)

25 pages, 11059 KB

Open AccessArticle

Few-Shot Open-Set Object Detection with a Synthesized Monument Guided by Contrastive Distilled Prompts

by Hao Chen and Ying Chen

Appl. Sci. 2026, 16(7), 3474; https://doi.org/10.3390/app16073474 - 2 Apr 2026

Viewed by 275

Abstract

Few-shot open-set object detection (FS-OSOD) remains challenging in real-world scenarios, where detectors must accurately recognize known objects from few examples while reliably rejecting vast unknown categories. Under this setting, decision boundaries between known and unknown classes are easily distorted by data scarcity and [...] Read more.

Few-shot open-set object detection (FS-OSOD) remains challenging in real-world scenarios, where detectors must accurately recognize known objects from few examples while reliably rejecting vast unknown categories. Under this setting, decision boundaries between known and unknown classes are easily distorted by data scarcity and background clutter, leading to severe overfitting on base classes and overconfident misclassification of unknowns. Recent research attempts to alleviate these issues by regularizing detection heads to suppress base-class bias, or by leveraging vision–language priors through open-vocabulary alignment and prompt tuning to enhance semantic transferability. However, these solutions often overlook explicit modeling of truly out-of-set unknowns and the instability of prompt adaptation in low-data regimes, which can cause boundary drifts and make unknown proposals be absorbed by similar seen classes or even suppressed as background. To alleviate these issues, a guided prompt–monument network (GPMN) that is proposed, which jointly enhances prompt learning and feature representation learning for FS-OSOD. First, the contrastive distilled prompts (CDP) module employs a teacher–student prompt framework to decouple optimization across base, novel, and unknown classes. This strategy preserves transferability between zero-shot and few-shot settings while enhancing discrimination on base categories. Second, a synthesized monument module (SMM) maintains class-centered memory with momentum-updated prototypes and a non-parametric classifier, which compresses the overlap between seen and unseen distributions and provides a stable rejection margin for unknowns with strong co-occurrence and background noise. Compared with existing head-regularization and open-vocabulary prompt-tuning pipelines, GPMN explicitly targets both base-class bias and seen–unseen overlap at the region level. Extensive experiments on VOC10-5-5 and VOC-COCO benchmarks demonstrate that GPMN consistently improves unknown recall and few-shot mAP over representative FS-OSOD baselines. These results suggest that prompt-level decoupling mitigates base-class bias, whereas memory-anchored regularization enlarges the seen–unseen margin, jointly supporting reliable unknown rejection in scarce-supervision regimes. Full article

(This article belongs to the Special Issue Advances in Computer Vision and Digital Image Processing)

► Show Figures

Figure 1

16 pages, 26055 KB

Open AccessArticle

AeroPinWorld: Revisiting Stride-2 Downsampling for Zero-Shot Transferable Open-Vocabulary UAV Detection

by Jie Li, Mingze Guan, Jincheng Xu, Xun Du, Haonan Chen and Yang Liu

Electronics 2026, 15(7), 1364; https://doi.org/10.3390/electronics15071364 - 25 Mar 2026

Viewed by 309

Abstract

Open-vocabulary object detection in unmanned aerial vehicle (UAV) imagery remains challenging under zero-shot cross-dataset transfer because tiny and cluttered targets are highly sensitive to early resolution reduction under domain shift. This study aims to improve transferable open-vocabulary UAV detection by revisiting stride-2 downsampling [...] Read more.

Open-vocabulary object detection in unmanned aerial vehicle (UAV) imagery remains challenging under zero-shot cross-dataset transfer because tiny and cluttered targets are highly sensitive to early resolution reduction under domain shift. This study aims to improve transferable open-vocabulary UAV detection by revisiting stride-2 downsampling in YOLO-World v2 as a transfer-critical bottleneck. AeroPinWorld is introduced as a pinwheel-augmented YOLO-World v2 that selectively replaces four key stride-2 transitions with pinwheel-shaped convolution (PConv) so as to reduce aliasing, alleviate sampling-phase sensitivity, and preserve fine-grained local structures, while keeping the original detection head unchanged to ensure a fair and efficient comparison. The model is trained on COCO2017 for 24 epochs from official pretrained weights and directly evaluated, without target-domain fine-tuning, on VisDrone2019-DET and UAVDT using fixed offline prompt vocabularies. Compared with YOLO-World v2-S, AeroPinWorld improves zero-shot transfer performance on VisDrone from 0.112 to 0.135 mAP and from 0.054 to 0.063 AP_S, and it also yields consistent gains on UAVDT. Ablation studies show that both early backbone replacements and head bottom–up replacements contribute to the final gains, with their combination achieving the best accuracy–efficiency trade-off. These results indicate that selectively redesigning transfer-critical downsampling operators is an effective and lightweight way to improve zero-shot open-vocabulary UAV detection under aerial domain shift. Full article

(This article belongs to the Section Electronic Multimedia)

► Show Figures

Figure 1

15 pages, 416 KB

Open AccessReview

Artificial Intelligence for the Early Detection of Patients with Cognitive Impairment: A Scoping Review

by María Moreno-Pineda, Víctor Ortiz-Mallasén and Águeda Cervera-Gasch

Healthcare 2026, 14(6), 768; https://doi.org/10.3390/healthcare14060768 - 18 Mar 2026

Viewed by 438

Abstract

Background/Objectives: Cognitive impairment affects multiple brain functions, and its early detection is essential to prevent progression to dementia; artificial intelligence has shown considerable potential in this field. This scoping review aims to map the impact of artificial intelligence–based tools for the early detection [...] Read more.

Background/Objectives: Cognitive impairment affects multiple brain functions, and its early detection is essential to prevent progression to dementia; artificial intelligence has shown considerable potential in this field. This scoping review aims to map the impact of artificial intelligence–based tools for the early detection of cognitive impairment by identifying the main technologies used, examining their effectiveness, and exploring their ethical implications. Methods: A scoping review was conducted between April and May 2025 following the PRISMA-ScR methodological framework; the review protocol was previously registered on the Open Science Framework. PubMed, Scopus, and Cochrane databases were searched using natural language and controlled vocabulary terms via Medical Subject Headings. The search was limited to articles published between 2020 and 2025, in English or Spanish, with free full-text access. Methodological quality was assessed using CASPe, JBI, and MMAT. Results: A total of 14 studies were included after the selection and critical appraisal process. The findings show that artificial intelligence–based tools such as deep-learning models applied to neuroimaging, speech and gait analysis, electronic health record analysis, and mobile health applications demonstrate promising accuracy in detecting early cognitive changes. These technologies enable the identification of subtle patterns that may be difficult to detect using conventional clinical assessments. Conclusions: AI-based tools can provide substantial support for clinical decision-making by effectively identifying subtle changes that are imperceptible to human intelligence. However, their use also raises ethical issues related to patient privacy and data security. Full article

(This article belongs to the Special Issue Novel Research and Care Strategies for Older Adults with Stroke and/or Dementia)

► Show Figures

Figure 1

22 pages, 11365 KB

Open AccessArticle

Addressing Dense Small-Object Detection in Remote Sensing: An Open-Vocabulary Object Detection Framework

by Menghan Ju, Yingchao Feng, Wenhui Diao and Chunbo Liu

Remote Sens. 2026, 18(6), 851; https://doi.org/10.3390/rs18060851 - 10 Mar 2026

Viewed by 566

Abstract

Remote sensing open-vocabulary object detection focuses on identifying and localizing unseen categories within remote sensing imagery. However, constrained by characteristics such as dense target distribution, complex background interference, and drastic scale variations inherent to remote sensing scenarios, existing methods are prone to background [...] Read more.

Remote sensing open-vocabulary object detection focuses on identifying and localizing unseen categories within remote sensing imagery. However, constrained by characteristics such as dense target distribution, complex background interference, and drastic scale variations inherent to remote sensing scenarios, existing methods are prone to background noise interference when extracting features from dense, small target regions. This leads to weakened semantic representation and reduced localization accuracy. Therefore, we propose RS-DINO to address these challenges. Specifically: Firstly, to address the issue of small features being obscured by the background, the feature extraction module incorporates a multi-scale large-kernel attention mechanism. This expands the receptive field while enhancing local detail modelling, significantly improving the feature representation of minute targets. Secondly, a cross-modal feature fusion module employing bidirectional cross-attention achieves deep alignment between image and textual features. Subsequently, a language-guided query selection mechanism enhances detection accuracy through hybrid query strategies. Finally, to enhance the spatial sensitivity and channel adaptability of fusion features, the multimodal decoder integrates a convolutional gated feedforward network, significantly boosting the model’s robustness in dense, multi-scale scenes. Experiments on DIOR, DOTA v2.0, and NWPU-VHR10 demonstrate substantial gains, with fine-tuned RS-DINO surpassing existing methods by 3.5%, 3.7%, and 4.0% in accuracy, respectively. Full article

(This article belongs to the Special Issue Remote Sensing Intelligent Interpretation in the Era of Large Models and Intelligent Agents: New Challenges, Methods and Opportunities)

► Show Figures

Figure 1

23 pages, 15691 KB

Open AccessArticle

ProM-Pose: Language-Guided Zero-Shot 9-DoF Object Pose Estimation from RGB-D with Generative 3D Priors

by Yuchen Li, Kai Qin, Haitao Wu and Xiangjun Qu

Electronics 2026, 15(5), 1111; https://doi.org/10.3390/electronics15051111 - 7 Mar 2026

Viewed by 452

Abstract

Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined [...] Read more.

Object pose estimation is fundamental for robotic manipulation, autonomous driving, and augmented reality, yet recovering the full 9-DoF state (rotation, translation, and anisotropic 3D scale) from RGB-D observations remains challenging for previously unseen objects. Existing methods either rely on instance-specific CAD models, predefined category boundaries, or suffer from scale ambiguity under sparse observations. We propose ProM-Pose, a unified cross-modal temporal perception framework for zero-shot 9-DoF object pose estimation. By integrating language-conditioned generative 3D shape priors as canonical geometric references, an asymmetric cross-modal attention mechanism for spatially aware fusion, and a decoupled pose decoding strategy with temporal refinement, ProM-Pose constructs metrically consistent and semantically grounded representations without relying on category-specific pose priors or instance-level CAD supervision. Extensive experiments on CAMERA25 and REAL275 benchmarks demonstrate that ProM-Pose achieves competitive or superior performance compared to category-level methods, with mAP of

75.0 %

at

5^{°}, 2 cm

and

90.5 %

at

10^{°}, 5 cm

on CAMERA25, and

42.2 %

at

5^{°}, 2 cm

and

76.0 %

at

10^{°}, 5 cm

on REAL275 under zero-shot cross-domain evaluation. Qualitative results on real-world logistics scenarios further validate temporal stability and robustness under occlusion and lighting variations. ProM-Pose effectively bridges semantic grounding and metric geometric reasoning within a unified formulation, enabling stable and scale-aware 9-DoF pose estimation for previously unseen objects under open-vocabulary conditions. Full article

► Show Figures

Figure 1

25 pages, 15267 KB

Open AccessArticle

3D Semantic Map Reconstruction for Orchard Environments Using Multi-Sensor Fusion

by Quanchao Wang, Yiheng Chen, Jiaxiang Li, Yongxing Chen and Hongjun Wang

Agriculture 2026, 16(4), 455; https://doi.org/10.3390/agriculture16040455 - 15 Feb 2026

Viewed by 778

Abstract

Semantic point cloud maps play a pivotal role in smart agriculture. They provide not only core three-dimensional data for orchard management but also empower robots with environmental perception, enabling safer and more efficient navigation and planning. However, traditional point cloud maps primarily model [...] Read more.

Semantic point cloud maps play a pivotal role in smart agriculture. They provide not only core three-dimensional data for orchard management but also empower robots with environmental perception, enabling safer and more efficient navigation and planning. However, traditional point cloud maps primarily model surrounding obstacles from a geometric perspective, failing to capture distinctions and characteristics between individual obstacles. In contrast, semantic maps encompass semantic information and even topological relationships among objects in the environment. Furthermore, existing semantic map construction methods are predominantly vision-based, making them ill-suited to handle rapid lighting changes in agricultural settings that can cause positioning failures. Therefore, this paper proposes a positioning and semantic map reconstruction method tailored for orchards. It integrates visual, LiDAR, and inertial sensors to obtain high-precision pose and point cloud maps. By combining open-vocabulary detection and semantic segmentation models, it projects two-dimensional detected semantic information onto the three-dimensional point cloud, ultimately generating a point cloud map enriched with semantic information. The resulting 2D occupancy grid map is utilized for robotic motion planning. Experimental results demonstrate that on a custom dataset, the proposed method achieves 74.33% mIoU for semantic segmentation accuracy, 12.4% relative error for fruit recall rate, and 0.038803 m mean translation error for localization. The deployed semantic segmentation network Fast-SAM achieves a processing speed of 13.36 ms per frame. These results demonstrate that the proposed method combines high accuracy with real-time performance in semantic map reconstruction. This exploratory work provides theoretical and technical references for future research on more precise localization and more complete semantic mapping, offering broad application prospects and providing key technological support for intelligent agriculture. Full article

(This article belongs to the Special Issue Advances in Robotic Systems for Precision Orchard Operations)

► Show Figures

Figure 1

21 pages, 3256 KB

Open AccessArticle

Open-Vocabulary Segmentation of Aerial Point Clouds

by Ashkan Alami and Fabio Remondino

Remote Sens. 2026, 18(4), 572; https://doi.org/10.3390/rs18040572 - 12 Feb 2026

Viewed by 662

Abstract

The growing diversity and dynamics of urban environments demand 3D semantic segmentation methods that can recognize a wide range of objects without relying on predefined classes or time-consuming labelled training data. As urban scenes evolve and application requirements vary across locations, flexible, annotation-free [...] Read more.

The growing diversity and dynamics of urban environments demand 3D semantic segmentation methods that can recognize a wide range of objects without relying on predefined classes or time-consuming labelled training data. As urban scenes evolve and application requirements vary across locations, flexible, annotation-free 3D segmentation methods are becoming increasingly desirable for large-scale 3D analytics. This work presents the first training-free, open-vocabulary (OV) method for 3D aerial point cloud classification and benchmarks it against state-of-the-art supervised 3D neural networks for the semantic enrichment of these geospatial data. The proposed approach leverages open-vocabulary object recognition in multiple 2D imagery and subsequently projects and refines these detections in 3D space, enabling semantic labelling without prior class definitions or annotated data. In contrast, the supervised baselines are trained on labelled datasets and restricted to a fixed set of object categories. We evaluate all methods with quantitative metrics and qualitative analysis, highlighting their respective strengths, limitations and suitability for scalable urban 3D mapping. By removing the dependency on annotated data and fixed taxonomies, this work represents a key step toward adaptive, scalable and semantic understanding of 3D urban environments. Full article

(This article belongs to the Special Issue GeoAI for Urban Understanding: Fusing Multi-Source Geospatial Data)

► Show Figures

Figure 1

19 pages, 8340 KB

Open AccessArticle

Open-Vocabulary Multi-Object Tracking Based on Multi-Cue Fusion

by Liangfeng Xu, Jinqi Bai, Lin Nai and Chang Liu

Appl. Sci. 2025, 15(24), 13151; https://doi.org/10.3390/app152413151 - 15 Dec 2025

Viewed by 818

Abstract

Multi-object tracking (MOT) technology integrates multiple fields such as pattern recognition, machine learning, and object detection, demonstrating broad application potential in scenarios like low-altitude logistics delivery, urban security, autonomous driving, and intelligent navigation. However, in open-world scenarios, existing MOT methods often face challenges [...] Read more.

Multi-object tracking (MOT) technology integrates multiple fields such as pattern recognition, machine learning, and object detection, demonstrating broad application potential in scenarios like low-altitude logistics delivery, urban security, autonomous driving, and intelligent navigation. However, in open-world scenarios, existing MOT methods often face challenges of imprecise target category identification and insufficient tracking accuracy, especially when dealing with numerous target types affected by occlusion and deformation. To address this, we propose a multi-object tracking strategy based on multi-cue fusion. This strategy combines appearance features and spatial feature information, employing BYTE and weighted Intersection over Union (IoU) modules to handle target association, thereby improving tracking accuracy. Furthermore, to tackle the challenge of large vocabularies in open-world scenarios, we introduce an open-vocabulary prompting strategy. By incorporating diverse sentence structures, emotional elements, and image quality descriptions, the expressiveness of text descriptions is enhanced. Combined with the CLIP model, this strategy significantly improves the recognition capability for novel category targets without requiring model retraining. Experimental results on the public TAO benchmark show that our method yields consistent TETA improvements over existing open-vocabulary trackers, with gains of 10% and 16% on base and novel categories, respectively. The results demonstrate that the proposed framework offers a more robust solution for open-vocabulary multi-object tracking in complex environments. Full article

(This article belongs to the Special Issue AI for Sustainability and Innovation—2nd Edition)

► Show Figures

Figure 1

28 pages, 583 KB

Open AccessArticle

Multiple Large AI Models’ Consensus for Object Detection—A Survey

by Marcin Iwanowski and Marcin Gahbler

Appl. Sci. 2025, 15(24), 12961; https://doi.org/10.3390/app152412961 - 9 Dec 2025

Viewed by 2330

Abstract

The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual [...] Read more.

The rapid development of large artificial intelligence (AI) models—large language models (LLMs), multimodel large language models (MLLMs) and vision–language models (VLMs)—has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual models remain inconsistent—LLMs hallucinate nonexistent entities, while VLMs exhibit limited recall and unstable calibration compared to purpose-trained detectors. To address these limitations, a new paradigm termed “multiple large AI model’s consensus” has emerged. In this approach, multiple heterogeneous LLMs, MLLMs or VLMs process a shared visual–textual instruction and generate independent structured outputs (bounding boxes and categories). Next, their results are merged through consensus mechanisms. This cooperative inference improves spatial accuracy and semantic correctness, making it particularly suitable for generating high-quality training datasets for fast real-time object detectors. This survey provides a comprehensive overview of the large multi-AI model’s consensus for object detection. We formalize the concept, review related literature on ensemble reasoning and multimodal perception, and categorize existing methods into four frameworks: prompt-level, reasoning-to-detection, box-level, and hybrid consensus. We further analyze fusion algorithms, evaluation metrics, and benchmark datasets, highlighting their strengths and limitations. Finally, we discuss open challenges—vocabulary alignment, uncertainty calibration, computational efficiency, and bias propagation—and identify emerging trends such as consensus-aware training, structured reasoning, and collaborative perception ecosystems. Full article

(This article belongs to the Special Issue Machine Learning for Object Detection and Scene Description in Images and Videos)

► Show Figures

Figure 1

19 pages, 2418 KB

Open AccessArticle

D-Know: Disentangled Domain Knowledge-Aided Learning for Open-Domain Continual Object Detection

by Bintao He, Caixia Yan, Yan Kou, Yinghao Wang, Xin Lv, Haipeng Du and Yugui Xie

Appl. Sci. 2025, 15(23), 12723; https://doi.org/10.3390/app152312723 - 1 Dec 2025

Viewed by 641

Abstract

Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world [...] Read more.

Continual learning for open-vocabulary object detection aims to enable pretrained vision–language detectors to adapt to diverse specialized domains while preserving their zero-shot generalization capabilities. However, existing methods primarily focus on mitigating catastrophic forgetting, often neglecting the substantial domain shifts commonly encountered in real-world applications. To address this critical oversight, we pioneer Open-Domain Continual Object Detection (OD-COD), a new paradigm that requires detectors to continually adapt across domains with significant stylistic gaps. We propose Disentangled Domain Knowledge-Aided Learning (D-Know) to tackle this challenge. This framework explicitly disentangles domain-general priors from category-specific adaptation, managing them dynamically in a scalable domain knowledge base. Specifically, D-Know first learns domain priors in a self-supervised manner and then leverages these priors to facilitate category-specific adaptation within each domain. To rigorously evaluate this task, we construct OD-CODB, the first dedicated benchmark spanning six domains with substantial visual variations. Extensive experiments demonstrate that D-Know achieves superior performance, surpassing current state-of-the-art methods by an average of 4.2% mAP under open-domain continual settings while maintaining strong zero-shot generalization. Furthermore, experiments under the few-shot setting confirm D-Know’s superior data efficiency. Full article

► Show Figures

Figure 1

24 pages, 4364 KB

Open AccessArticle

Determining the Optimal T-Value for the Temperature Scaling Calibration Method Using the Open-Vocabulary Detection Model YOLO-World

by Max Andreas Ingrisch, Rani Marcel Schilling, Ingo Chmielewski and Stefan Twieg

Appl. Sci. 2025, 15(22), 12062; https://doi.org/10.3390/app152212062 - 13 Nov 2025

Cited by 1 | Viewed by 1753

Abstract

Object detection is an important tool in many areas, such as robotics or autonomous driving. Especially in these areas, a wide variety of object classes must be detected or interacted with. Models from the field of Open-Vocabulary Detection (OVD) provide a solution here, [...] Read more.

Object detection is an important tool in many areas, such as robotics or autonomous driving. Especially in these areas, a wide variety of object classes must be detected or interacted with. Models from the field of Open-Vocabulary Detection (OVD) provide a solution here, as they can detect not only base classes but also novel object classes, i.e., those classes that were not seen during training. However, one problem with OVD models is their poor calibration, meaning that the predictions are often too over- or under-confident. To improve the calibration, Temperature Scaling is used in this study. Using YOLO World, one of the best-performing OVD models, the aim is to determine the optimal T-value for this calibration method. For this reason, it is investigated whether there is a correlation between the logit distribution and the optimal T-value and how this can be modeled. Finally, the influence of Temperature Scaling on the Expected Calibration Error (ECE) and the mAP (Mean Average Precision) will be analyzed. The results of this study show that similar logit distributions of different datasets result in the same optimal T-values. This correlation could be best modeled using Kernel Ridge Regression (KRR) and Support Vector Machine (SVM). In all cases, the ECE could be improved by Temperature Scaling without significantly reducing the mAP. Full article

► Show Figures

Figure 1

24 pages, 4764 KB

Open AccessArticle

Mask-Guided Teacher–Student Learning for Open-Vocabulary Object Detection in Remote Sensing Images

by Shuojie Wang, Yu Song, Jiajun Xiang, Yanyan Chen, Ping Zhong and Ruigang Fu

Remote Sens. 2025, 17(19), 3385; https://doi.org/10.3390/rs17193385 - 9 Oct 2025

Viewed by 1853

Abstract

Open-vocabulary object detection in remote sensing aims to detect novel categories not seen during training, which is crucial for practical aerial image analysis applications. While some approaches accomplish this task through large-scale data construction, such methods incur substantial annotation and computational costs. In [...] Read more.

Open-vocabulary object detection in remote sensing aims to detect novel categories not seen during training, which is crucial for practical aerial image analysis applications. While some approaches accomplish this task through large-scale data construction, such methods incur substantial annotation and computational costs. In contrast, we focus on efficient utilization of limited datasets. However, existing methods such as CastDet struggle with inefficient data utilization and class imbalance issues in pseudo-label generation for novel categories. We propose an enhanced open-vocabulary detection framework that addresses these limitations through two key innovations. First, we introduce a selective masking strategy that enables direct utilization of partially annotated images by masking base category regions in teacher model inputs. This approach eliminates the need for strict data separation and significantly improves data efficiency. Second, we develop a dynamic frequency-based class weighting that automatically adjusts category weights based on real-time pseudo-label statistics to mitigate class imbalance issues. Our approach integrates these components into a student–teacher learning framework with RemoteCLIP for novel category classification. Comprehensive experiments demonstrate significant improvements on both datasets: on VisDroneZSD, we achieve 42.7% overall mAP and 41.4% harmonic mean, substantially outperforming existing methods. On DIOR dataset, our method achieves 63.7% overall mAP with 49.5% harmonic mean. Our framework achieves more balanced performance between base and novel categories, providing a practical and data-efficient solution for open-vocabulary aerial object detection. Full article

(This article belongs to the Special Issue Multi-Object Detection and Feature Extraction of Remote Sensing Images)

► Show Figures

Figure 1

21 pages, 3747 KB

Open AccessArticle

Open-Vocabulary Crack Object Detection Through Attribute-Guided Similarity Probing

by Hyemin Yoon and Sangjin Kim

Appl. Sci. 2025, 15(19), 10350; https://doi.org/10.3390/app151910350 - 24 Sep 2025

Viewed by 2304

Abstract

Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to [...] Read more.

Timely detection of road surface defects such as cracks and potholes is critical for ensuring traffic safety and reducing infrastructure maintenance costs. While recent advances in image-based deep learning techniques have shown promise for automated road defect detection, existing models remain limited to closed-set detection settings, making it difficult to recognize newly emerging or fine-grained defect types. To address this limitation, we propose an attribute-aware open-vocabulary crack detection (AOVCD) framework, which leverages the alignment capability of pretrained vision–language models to generalize beyond fixed class labels. In this framework, crack types are represented as combinations of visual attributes, enabling semantic grounding between image regions and natural language descriptions. To support this, we extend the existing PPDD dataset with attribute-level annotations and incorporate a multi-label attribute recognition task as an auxiliary objective. Experimental results demonstrate that the proposed AOVCD model outperforms existing baselines. In particular, compared to CLIP-based zero-shot inference, the proposed model achieves approximately a 10-fold improvement in average precision (AP) for novel crack categories. Attribute classification performance—covering geometric, spatial, and textural features—also increases by 40% in balanced accuracy (BACC) and 23% in AP. These results indicate that integrating structured attribute information enhances generalization to previously unseen defect types, especially those involving subtle visual cues. Our study suggests that incorporating attribute-level alignment within a vision–language framework can lead to more adaptive and semantically grounded defect recognition systems. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

Search Results (22)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (22)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI