Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (680)

Search Parameters:
Keywords = high-level semantic information

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 4198 KB  
Article
CGHP: Component-Guided Hierarchical Progressive Point Cloud Unsupervised Segmentation Framework
by Shuo Shi, Haifeng Zhao, Wei Gong and Sifu Bi
Remote Sens. 2025, 17(21), 3589; https://doi.org/10.3390/rs17213589 - 30 Oct 2025
Abstract
With the rapid development of airborne LiDAR and photogrammetric techniques, massive amounts of high-resolution 3D point cloud data have become increasingly available. However, extracting meaningful semantic information from such unstructured and noisy point clouds remains a challenging task, particularly in the absence of [...] Read more.
With the rapid development of airborne LiDAR and photogrammetric techniques, massive amounts of high-resolution 3D point cloud data have become increasingly available. However, extracting meaningful semantic information from such unstructured and noisy point clouds remains a challenging task, particularly in the absence of manually annotated labels. We present CGHP, a novel component-guided hierarchical progressive framework that addresses this challenge through a two-stage learning approach. Our method first decomposes point clouds into components using geometric and appearance consistency, constructing comprehensive geometric-appearance descriptors that capture shape, scale, and gravity-aligned distribution information to guide initial feature learning. These component-level features then undergo progressive growth through an adjacency-constrained clustering algorithm that gradually merges components into object-level semantic clusters. Extensive experiments on publicly available point cloud datasets S3DIS and ScanNet++ datasets demonstrate the effectiveness of the proposed method. On the S3DIS dataset, our method achieves state-of-the-art performance, with 48.69% mIoU and 79.68% OA, without using any annotations, closely approaching the results of fully supervised PointNet++ (50.1% mIoU, 77.5% OA). On the more challenging ScanNet++ benchmark, our approach also demonstrates competitive performance in terms of both mAcc and mIoU. Full article
Show Figures

Figure 1

23 pages, 1456 KB  
Article
Progressive Prompt Generative Graph Convolutional Network for Aspect-Based Sentiment Quadruple Prediction
by Yun Feng and Mingwei Tang
Electronics 2025, 14(21), 4229; https://doi.org/10.3390/electronics14214229 - 29 Oct 2025
Abstract
Aspect-based sentiment quadruple prediction has important application value in the current information age. There are often implicit expressions and multi-level semantic relationships in sentences, making accurate prediction for existing methods still a complex and challenging task. To address the above problems, this paper [...] Read more.
Aspect-based sentiment quadruple prediction has important application value in the current information age. There are often implicit expressions and multi-level semantic relationships in sentences, making accurate prediction for existing methods still a complex and challenging task. To address the above problems, this paper proposes the Progressive Prompt-Driven Generative Graph Convolutional Network for Aspect-Based Sentiment Quadruple Prediction (ProPGCN). Firstly, a progressive prompt module is proposed. The module uses progressive prompt templates to generate paradigm expressions of corresponding orders and introduces third-order element prompt templates to associate high-order semantics in sentences, providing a bridge for modeling the final global semantics. Secondly, a graph convolutional relation-enhanced reasoning module is designed, which can make full use of contextual dependency information to enhance the recognition of implicit aspects and implicit opinions. In addition, a graph convolutional aggregation strategy is constructed. The strategy uses graph convolutional networks to aggregate adjacent node information and correct conflicting implicit logical relationships. Finally, experimental results show that the ProPGCN model can achieve state-of-the-art performance. Specifically, our ProPGCN model achieves overall F1 scores of 65.04% and 47.89% on the Restaurant and Laptop datasets, respectively, which represent improvements of +0.83% and +0.61% over the previous strongest generative baseline. Full article
Show Figures

Figure 1

25 pages, 1179 KB  
Article
Quantifying Fire Risk Index in Chemical Industry Using Statistical Modeling Procedure
by Hyewon Jung, Seungil Ahn, Seungho Choi and Yeseul Jeon
Appl. Sci. 2025, 15(21), 11508; https://doi.org/10.3390/app152111508 - 28 Oct 2025
Viewed by 74
Abstract
Fire incident reports contain detailed textual narratives that capture causal factors often overlooked in structured records, while financial damage amounts provide measurable outcomes of these events. Integrating these two sources of information is essential for uncovering interpretable links between descriptive causes and their [...] Read more.
Fire incident reports contain detailed textual narratives that capture causal factors often overlooked in structured records, while financial damage amounts provide measurable outcomes of these events. Integrating these two sources of information is essential for uncovering interpretable links between descriptive causes and their economic consequences. To this end, we develop a data-driven framework that constructs a composite Risk Index, enabling systematic quantification of how specific keywords relate to property damage amounts. This index facilitates both the identification of high-impact terms and the aggregation of risks across semantically related clusters, thereby offering a principled measure of fire-related financial risk. Using more than a decade of Korean fire investigation reports on the chemical industry classified as Special Buildings (2013–2024), we employ topic modeling and network-based embedding to estimate semantic similarities from interactions among words, and subsequently apply Lasso regression to quantify their associations with property damage amounts, thereby estimating the fire risk index. This approach enables us to assess fire risk not only at the level of individual terms, but also within their broader textual context, where highly interactive related words provide insights into collective patterns of hazard representation and their potential impact on expected losses. The analysis highlights several domains of risk, including hazardous chemical leakage, unsafe storage practices, equipment and facility malfunctions, and environmentally induced ignition. The results demonstrate that text-derived indices provide interpretable and practically relevant insights, bridging unstructured narratives with structured loss information and offering a basis for evidence-based fire risk assessment and management. The derived Risk Index provides practical reference data for both safety management and insurance underwriting by enabling the prioritization of preventive measures within industrial sites and offering quantitative guidance for assessing facility-specific risk levels in insurance decisions. An R implementation of the proposed framework is openly available for public use. Full article
(This article belongs to the Special Issue Advanced Methodology and Analysis in Fire Protection Science)
Show Figures

Figure 1

14 pages, 555 KB  
Article
A Symmetric Multiscale Detail-Guided Attention Network for Cardiac MR Image Semantic Segmentation
by Hengqi Hu, Bin Fang, Bin Duo, Xuekai Wei, Jielu Yan, Weizhi Xian and Dongfen Li
Symmetry 2025, 17(11), 1807; https://doi.org/10.3390/sym17111807 - 27 Oct 2025
Viewed by 191
Abstract
Cardiac medical image segmentation can advance healthcare and embedded vision systems. In this paper, a symmetric semantic segmentation architecture for cardiac magnetic resonance (MR) images based on a symmetric multiscale detail-guided attention network is presented. Detailed information and multiscale attention maps can be [...] Read more.
Cardiac medical image segmentation can advance healthcare and embedded vision systems. In this paper, a symmetric semantic segmentation architecture for cardiac magnetic resonance (MR) images based on a symmetric multiscale detail-guided attention network is presented. Detailed information and multiscale attention maps can be exploited more efficiently in this model. A symmetric encoder and decoder are used to generate high-dimensional semantic feature maps and segmentation masks, respectively. First, a series of densely connected residual blocks is introduced for extracting high-dimensional semantic features. Second, an asymmetric detail-guided module is proposed. In this module, a feature pyramid is used to extract detailed information and generate detailed feature maps as part of the detail guidance of the model during the training phase, which are used to extract deep features of multiscale information and calculate a detail loss with specific encoder semantic features. Third, a series of multiscale upsampling attention blocks symmetrical to the encoder is introduced in the decoder of the model. For each upsampling attention block, feature fusion is first performed on the previous-level low-resolution features and the symmetric skip connections of the same layer, and then spatial and channel attention are used to enhance the features. Image gradients of the input images are also introduced at the end of the decoder. Finally, the predicted segmentation masks are obtained by calculating a detail loss and a segmentation loss. Our method demonstrates outstanding performance on the public cardiac MR image dataset, which can achieve significant results for endocardial and epicardial segmentation of the left ventricle (LV). Full article
(This article belongs to the Special Issue Symmetry and Asymmetry in Embedded Systems)
Show Figures

Figure 1

40 pages, 33004 KB  
Article
Sampling-Based Path Planning and Semantic Navigation for Complex Large-Scale Environments
by Shakeeb Ahmad and James Sean Humbert
Robotics 2025, 14(11), 149; https://doi.org/10.3390/robotics14110149 - 24 Oct 2025
Viewed by 191
Abstract
This article proposes a multi-agent path planning and decision-making solution for high-tempo field robotic operations, such as search-and-rescue, in large-scale unstructured environments. As a representative example, the subterranean environments can span many kilometers and are loaded with challenges such as limited to no [...] Read more.
This article proposes a multi-agent path planning and decision-making solution for high-tempo field robotic operations, such as search-and-rescue, in large-scale unstructured environments. As a representative example, the subterranean environments can span many kilometers and are loaded with challenges such as limited to no communication, hazardous terrain, blocked passages due to collapses, and vertical structures. The time-sensitive nature of these operations inherently requires solutions that are reliably deployable in practice. Moreover, a human-supervised multi-robot team is required to ensure that mobility and cognitive capabilities of various agents are leveraged for efficiency of the mission. Therefore, this article attempts to propose a solution that is suited for both air and ground vehicles and is adapted well for information sharing between different agents. This article first details a sampling-based autonomous exploration solution that brings significant improvements with respect to the current state of the art. These improvements include relying on an occupancy grid-based sample-and-project solution to terrain assessment and formulating the solution-search problem as a constraint-satisfaction problem to further enhance the computational efficiency of the planner. In addition, the demonstration of the exploration planner by team MARBLE at the DARPA Subterranean Challenge finals is presented. The inevitable interaction of heterogeneous autonomous robots with human operators demands the use of common semantics for reasoning across the robot and human teams making use of different geometric map capabilities suited for their mobility and computational resources. To this end, the path planner is further extended to include semantic mapping and decision-making into the framework. Firstly, the proposed solution generates a semantic map of the exploration environment by labeling position history of a robot in the form of probability distributions of observations. The semantic reasoning solution uses higher-level cues from a semantic map in order to bias exploration behaviors toward a semantic of interest. This objective is achieved by using a particle filter to localize a robot on a given semantic map followed by a Partially Observable Markov Decision Process (POMDP)-based controller to guide the exploration direction of the sampling-based exploration planner. Hence, this article aims to bridge an understanding gap between human and a heterogeneous robotic team not just through a common-sense semantic map transfer among the agents but by also enabling a robot to make use of such information to guide its lower-level reasoning in case such abstract information is transferred to it. Full article
(This article belongs to the Special Issue Autonomous Robotics for Exploration)
Show Figures

Figure 1

22 pages, 1087 KB  
Article
Modeling the Internal and Contextual Attention for Self-Supervised Skeleton-Based Action Recognition
by Wentian Xin, Yue Teng, Jikang Zhang, Yi Liu, Ruyi Liu, Yuzhi Hu and Qiguang Miao
Sensors 2025, 25(21), 6532; https://doi.org/10.3390/s25216532 - 23 Oct 2025
Viewed by 234
Abstract
Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. [...] Read more.
Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. To overcome these problems, we explore intra-modality feature-wise self-similarity and inter-modality instance-wise cross-consistency, and discover two inherent correlations that benefit recognition: (i) Global Perspective expresses how action semantics carry a broad and high-level understanding, which supports the use of globally discriminative feature representations. (ii) Focus Adaptation refers to the role of the frequency spectrum in guiding attention toward key joints by emphasizing compact and salient signal patterns. Building upon these insights, we propose a novel language–skeleton contrastive learning framework comprising two key components: (a) Feature Modulation, which constructs a skeleton–language action conceptual domain to minimize the expected information gain between vision and language modalities. (b) Frequency Feature Learning, which introduces a Frequency-domain Spatial–Temporal block (FreST) that focuses on sparse key human joints in the frequency domain with compact signal energy. Extensive experiments demonstrate the effectiveness of our method achieves remarkable action recognition performance on widely used benchmark datasets, including NTU RGB+D 60 and NTU RGB+D 120. Especially on the challenging PKU-MMD dataset, MICA has achieved at least a 4.6% improvement over classical methods such as CrosSCLR and AimCLR, effectively demonstrating its ability to capture internal and contextual attention information. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

25 pages, 18442 KB  
Article
Exploring the Spatial Coupling Between Visual and Ecological Sensitivity: A Cross-Modal Approach Using Deep Learning in Tianjin’s Central Urban Area
by Zhihao Kang, Chenfeng Xu, Yang Gu, Lunsai Wu, Zhiqiu He, Xiaoxu Heng, Xiaofei Wang and Yike Hu
Land 2025, 14(11), 2104; https://doi.org/10.3390/land14112104 - 23 Oct 2025
Viewed by 372
Abstract
Amid rapid urbanization, Chinese cities face mounting ecological pressure, making it critical to balance environmental protection with public well-being. As visual perception accounts for over 80% of environmental information acquisition, it plays a key role in shaping experiences and evaluations of ecological space. [...] Read more.
Amid rapid urbanization, Chinese cities face mounting ecological pressure, making it critical to balance environmental protection with public well-being. As visual perception accounts for over 80% of environmental information acquisition, it plays a key role in shaping experiences and evaluations of ecological space. However, current ecological planning often overlooks public perception, leading to increasing mismatches between ecological conditions and spatial experiences. While previous studies have attempted to introduce public perspectives, a systematic framework for analyzing the spatial relationship between ecological and visual sensitivity remains lacking. This study takes 56,210 street-level points in Tianjin’s central urban area to construct a coordinated analysis framework of ecological and perceptual sensitivity. Visual sensitivity is derived from social media sentiment analysis (via GPT-4o) and street-view image semantic features extracted using the ADE20K semantic segmentation model, and subsequently processed through a Multilayer Perceptron (MLP) model. Ecological sensitivity is calculated using the Analytic Hierarchy Process (AHP)—based model integrating elevation, slope, normalized difference vegetation index (NDVI), land use, and nighttime light data. A coupling coordination model and bivariate Moran’s I are employed to examine spatial synergy and mismatches between the two dimensions. Results indicate that while 72.82% of points show good coupling, spatial mismatches are widespread. The dominant types include “HL” (high visual–low ecological) areas (e.g., Wudadao) with high visual attention but low ecological resilience, and “LH” (low visual–high ecological) areas (e.g., Huaiyuanli) with strong ecological value but low public perception. This study provides a systematic path for analyzing the spatial divergence between ecological and perceptual sensitivity, offering insights into ecological landscape optimization and perception-driven street design. Full article
Show Figures

Figure 1

34 pages, 8070 KB  
Article
AI-Enhanced Rescue Drone with Multi-Modal Vision and Cognitive Agentic Architecture
by Nicoleta Cristina Gaitan, Bianca Ioana Batinas and Calin Ursu
AI 2025, 6(10), 272; https://doi.org/10.3390/ai6100272 - 20 Oct 2025
Viewed by 731
Abstract
In post-disaster search and rescue (SAR) operations, unmanned aerial vehicles (UAVs) are essential tools, yet the large volume of raw visual data often overwhelms human operators by providing isolated, context-free information. This paper presents an innovative system with a novel cognitive–agentic architecture that [...] Read more.
In post-disaster search and rescue (SAR) operations, unmanned aerial vehicles (UAVs) are essential tools, yet the large volume of raw visual data often overwhelms human operators by providing isolated, context-free information. This paper presents an innovative system with a novel cognitive–agentic architecture that transforms the UAV from an intelligent tool into a proactive reasoning partner. The core innovation lies in the LLM’s ability to perform high-level semantic reasoning, logical validation, and robust self-correction through internal feedback loops. A visual perception module based on a custom-trained YOLO11 model feeds the cognitive core, which performs contextual analysis and hazard assessment, enabling a complete perception–reasoning–action cycle. The system also incorporates a physical payload delivery module for first-aid supplies, which acts on prioritized, actionable recommendations to reduce operator cognitive load and accelerate victim assistance. This work, therefore, presents the first developed LLM-driven architecture of its kind, transforming a drone from a mere data-gathering tool into a proactive reasoning partner and demonstrating a viable path toward reducing operator cognitive load in critical missions. Full article
Show Figures

Figure 1

23 pages, 3132 KB  
Article
Symmetry-Aware Superpixel-Enhanced Few-Shot Semantic Segmentation
by Lan Guo, Xuyang Li, Jinqiang Wang, Yuqi Tong, Jie Xiao, Rui Zhou, Ling-Huey Li, Qingguo Zhou and Kuan-Ching Li
Symmetry 2025, 17(10), 1726; https://doi.org/10.3390/sym17101726 - 14 Oct 2025
Viewed by 370
Abstract
Few-Shot Semantic Segmentation (FSS) faces significant challenges in modeling complex backgrounds and maintaining prediction consistency due to limited training samples. Existing methods oversimplify backgrounds as single negative classes and rely solely on pixel-level alignments. To address these issues, we propose a symmetry-aware superpixel-enhanced [...] Read more.
Few-Shot Semantic Segmentation (FSS) faces significant challenges in modeling complex backgrounds and maintaining prediction consistency due to limited training samples. Existing methods oversimplify backgrounds as single negative classes and rely solely on pixel-level alignments. To address these issues, we propose a symmetry-aware superpixel-enhanced FSS framework with a symmetric dual-branch architecture that explicitly models the superpixel region-graph in both the support and query branches. First, top–down cross-layer fusion injects low-level edge and texture cues into high-level semantics to build a more complete representation of complex backgrounds, improving foreground–background separability and boundary quality. Second, images are partitioned into superpixels and aggregated into “superpixel tokens” to construct a Region Adjacency Graph (RAG). Support-set prototypes are used to initialize query-pixel predictions, which are then projected into the superpixel space for cross-image prototype alignment with support superpixels. We further perform message passing/energy minimization on the RAG to enhance intra-region consistency and boundary adherence, and finally back-project the predictions to the pixel space. Lastly, by aggregating homogeneous semantic information, we construct robust foreground and background prototype representations, enhancing the model’s ability to perceive both seen and novel targets. Extensive experiments on the PASCAL-5i and COCO-20i benchmarks demonstrate that our proposed model achieves superior segmentation performance over the baseline and remains competitive with existing FSS methods. Full article
(This article belongs to the Special Issue Symmetry in Process Optimization)
Show Figures

Figure 1

27 pages, 6909 KB  
Article
Comparative Analysis of Deep Learning and Traditional Methods for High-Resolution Cropland Extraction with Different Training Data Characteristics
by Dujuan Zhang, Xiufang Zhu, Yaozhong Pan, Hengliang Guo, Qiannan Li and Haitao Wei
Land 2025, 14(10), 2038; https://doi.org/10.3390/land14102038 - 13 Oct 2025
Viewed by 358
Abstract
High-resolution remote sensing (HRRS) imagery enables the extraction of cropland information with high levels of detail, especially when combined with the impressive performance of deep convolutional neural networks (DCNNs) in understanding these images. Comprehending the factors influencing DCNNs’ performance in HRRS cropland extraction [...] Read more.
High-resolution remote sensing (HRRS) imagery enables the extraction of cropland information with high levels of detail, especially when combined with the impressive performance of deep convolutional neural networks (DCNNs) in understanding these images. Comprehending the factors influencing DCNNs’ performance in HRRS cropland extraction is of considerable importance for practical agricultural monitoring applications. This study investigates the impact of classifier selection and different training data characteristics on the HRRS cropland classification outcomes. Specifically, Gaofen-1 composite images with 2 m spatial resolution are employed for HRRS cropland extraction, and two county-wide regions with distinct agricultural landscapes in Shandong Province, China, are selected as the study areas. The performance of two deep learning (DL) algorithms (UNet and DeepLabv3+) and a traditional classification algorithm, Object-Based Image Analysis with Random Forest (OBIA-RF), is compared. Additionally, the effects of different band combinations, crop growth stages, and class mislabeling on the classification accuracy are evaluated. The results demonstrated that the UNet and DeepLabv3+ models outperformed OBIA-RF in both simple and complex agricultural landscapes, and were insensitive to the changes in band combinations, indicating their ability to learn abstract features and contextual semantic information for HRRS cropland extraction. Moreover, compared with the DL models, OBIA-RF was more sensitive to changes in the temporal characteristics. The performance of all three models was unaffected when the mislabeling error ratio remained below 5%. Beyond this threshold, the performance of all models decreased, with UNet and DeepLabv3+ showing similar performance decline trends and OBIA-RF suffering a more drastic reduction. Furthermore, the DL models exhibited relatively low sensitivity to the patch size of sample blocks and data augmentation. These findings can facilitate the design of operational implementations for practical applications. Full article
Show Figures

Figure 1

23 pages, 23535 KB  
Article
FANT-Det: Flow-Aligned Nested Transformer for SAR Small Ship Detection
by Hanfu Li, Dawei Wang, Jianming Hu, Xiyang Zhi and Dong Yang
Remote Sens. 2025, 17(20), 3416; https://doi.org/10.3390/rs17203416 - 12 Oct 2025
Viewed by 439
Abstract
Ship detection in synthetic aperture radar (SAR) remote sensing imagery is of great significance in military and civilian applications. However, two factors limit detection performance: (1) a high prevalence of small-scale ship targets with limited information content and (2) interference affecting ship detection [...] Read more.
Ship detection in synthetic aperture radar (SAR) remote sensing imagery is of great significance in military and civilian applications. However, two factors limit detection performance: (1) a high prevalence of small-scale ship targets with limited information content and (2) interference affecting ship detection from speckle noise and land–sea clutter. To address these challenges, we propose a novel end-to-end (E2E) transformer-based SAR ship detection framework, called Flow-Aligned Nested Transformer for SAR Small Ship Detection (FANT-Det). Specifically, in the feature extraction stage, we introduce a Nested Swin Transformer Block (NSTB). The NSTB employs a two-level local self-attention mechanism to enhance fine-grained target representation, thereby enriching features of small ships. For multi-scale feature fusion, we design a Flow-Aligned Depthwise Efficient Channel Attention Network (FADEN). FADEN achieves precise alignment of features across different resolutions via semantic flow and filters background clutter through lightweight channel attention, further enhancing small-target feature quality. Moreover, we propose an Adaptive Multi-scale Contrastive Denoising (AM-CDN) training paradigm. AM-CDN constructs adaptive perturbation thresholds jointly determined by a target scale factor and a clutter factor, generating contrastive denoising samples that better match the physical characteristics of SAR ships. Finally, extensive experiments on three widely used open SAR ship datasets demonstrate that the proposed method achieves superior detection performance, outperforming current state-of-the-art (SOTA) benchmarks. Full article
Show Figures

Figure 1

19 pages, 8850 KB  
Article
Intelligent Defect Recognition of Glazed Components in Ancient Buildings Based on Binocular Vision
by Youshan Zhao, Xiaolan Zhang, Ming Guo, Haoyu Han, Jiayi Wang, Yaofeng Wang, Xiaoxu Li and Ming Huang
Buildings 2025, 15(20), 3641; https://doi.org/10.3390/buildings15203641 - 10 Oct 2025
Viewed by 206
Abstract
Glazed components in ancient Chinese architecture hold profound historical and cultural value. However, over time, environmental erosion, physical impacts, and human disturbances gradually lead to various forms of damage, severely impacting the durability and stability of the buildings. Therefore, preventive protection of glazed [...] Read more.
Glazed components in ancient Chinese architecture hold profound historical and cultural value. However, over time, environmental erosion, physical impacts, and human disturbances gradually lead to various forms of damage, severely impacting the durability and stability of the buildings. Therefore, preventive protection of glazed components is crucial. The key to preventive protection lies in the early detection and repair of damage, thereby extending the component’s service life and preventing significant structural damage. To address this challenge, this study proposes a Restoration-Scale Identification (RSI) method that integrates depth information. By combining RGB-D images acquired from a depth camera with intrinsic camera parameters, and embedding a Convolutional Block Attention Module (CBAM) into the backbone network, the method dynamically enhances critical feature regions. It then employs a scale restoration strategy to accurately identify damage areas and recover the physical dimensions of glazed components from a global perspective. In addition, we constructed a dedicated semantic segmentation dataset for glazed tile damage, focusing on cracks and spalling. Both qualitative and quantitative evaluation results demonstrate that, compared with various high-performance semantic segmentation methods, our approach significantly improves the accuracy and robustness of damage detection in glazed components. The achieved accuracy deviates by only ±10 mm from high-precision laser scanning, a level of precision that is essential for reliably identifying and assessing subtle damages in complex glazed architectural elements. By integrating depth information, real scale information can be effectively obtained during the intelligent recognition process, thereby efficiently and accurately identifying the type of damage and size information of glazed components, and realizing the conversion from two-dimensional (2D) pixel coordinates to local three-dimensional (3D) coordinates, providing a scientific basis for the protection and restoration of ancient buildings, and ensuring the long-term stability of cultural heritage and the inheritance of historical value. Full article
(This article belongs to the Section Building Materials, and Repair & Renovation)
Show Figures

Figure 1

16 pages, 4268 KB  
Article
Research on the Detection Method of Flight Trainees’ Attention State Based on Multi-Modal Dynamic Depth Network
by Gongpu Wu, Changyuan Wang, Zehui Chen and Guangyi Jiang
Multimodal Technol. Interact. 2025, 9(10), 105; https://doi.org/10.3390/mti9100105 - 10 Oct 2025
Viewed by 339
Abstract
In aviation safety, pilots must efficiently process dynamic visual information and maintain a high level of attention. Any missed judgment of critical information or delay in decision-making may lead to mission failure or catastrophic consequences. Therefore, accurately detecting pilots’ attention states is the [...] Read more.
In aviation safety, pilots must efficiently process dynamic visual information and maintain a high level of attention. Any missed judgment of critical information or delay in decision-making may lead to mission failure or catastrophic consequences. Therefore, accurately detecting pilots’ attention states is the primary prerequisite for improving flight safety and performance. To better detect the attention state of pilots, this paper takes flight trainees as the research object and the simulated flight environment as the experimental background. It proposes a method for detecting the attention state of flight trainees based on a multi-modal dynamic depth network (M3D-Net). The M3D-Net architecture is a lightweight neural network architecture that integrates temporal image features, visual information features, and flight operation data features. It aligns image and text features through an attention mechanism to enhance the semantic association between modalities; it utilizes the Depth-wise Separable Convolution and LSTM (DSC-LSTM) module to model temporal information, dynamically capturing the contextual dependencies within the sequence, and achieving six-level attention state classification. This paper conducted ablation experiments to comparatively analyze the classification effects of the model and also evaluates the effectiveness of our proposed method through model evaluation metrics. Experiments show that the classification effect of the model architecture proposed in this paper reaches 97.56%, with a model size of 18.6 M. Compared with traditional algorithms, the M3D-Net architecture has better performance prospects in terms of application. Full article
Show Figures

Figure 1

21 pages, 3712 KB  
Article
CISC-YOLO: A Lightweight Network for Micron-Level Defect Detection on Wafers via Efficient Cross-Scale Feature Fusion
by Yulun Chi, Xingyu Gong, Bing Zhao and Lei Yao
Electronics 2025, 14(19), 3960; https://doi.org/10.3390/electronics14193960 - 9 Oct 2025
Viewed by 466
Abstract
With the development of the semiconductor manufacturing process towards miniaturization and high integration, the detection of microscopic defects on wafer surfaces faces the challenge of balancing precision and efficiency. Therefore, this study proposes a lightweight inspection model based on the YOLOv8 framework, aiming [...] Read more.
With the development of the semiconductor manufacturing process towards miniaturization and high integration, the detection of microscopic defects on wafer surfaces faces the challenge of balancing precision and efficiency. Therefore, this study proposes a lightweight inspection model based on the YOLOv8 framework, aiming to achieve an optimal balance between inspection accuracy, model complexity, and inference speed. First, we design a novel lightweight module called IRB-GhostConv-C2f (IGC) to replace the C2f module in the backbone, thereby significantly minimizing redundant feature computations. Second, a CNN-based cross-scale feature fusion neck network, the CCFF-ISC neck, is proposed to reduce the redundant computation of low-level features and enhance the expression of multi-scale semantic information. Meanwhile, the novel IRB-SCSA-C2f (ISC) module replaces the C2f in the neck to further improve the efficiency of feature fusion. In addition, a novel dynamic head network, DyHeadv3, is integrated into the head structure, aiming to improve the small-scale target detection performance by dynamically adjusting the feature interaction mechanism. Finally, so as to comprehensively assess the proposed algorithm’s performance, an industrial dataset of wafer defects, WSDD, is constructed, which covers “broken edges”, “scratches”, “oil pollution”, and “minor defects”. The experimental results demonstrate that the CISC-YOLO model attains an mAP50 of 93.7%, and the parameter amount is reduced to 1.92 M, outperforming other mainstream leading algorithms in the field. The proposed approach provides a high-precision and low-latency real-time defect detection solution for semiconductor industry scenarios. Full article
Show Figures

Figure 1

22 pages, 6212 KB  
Article
VLA-MP: A Vision-Language-Action Framework for Multimodal Perception and Physics-Constrained Action Generation in Autonomous Driving
by Maoning Ge, Kento Ohtani, Yingjie Niu, Yuxiao Zhang and Kazuya Takeda
Sensors 2025, 25(19), 6163; https://doi.org/10.3390/s25196163 - 5 Oct 2025
Viewed by 1754
Abstract
Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action framework that integrates multimodal Bird’s-Eye View (BEV) perception, vision-language alignment, and a GRU-bicycle dynamics cascade [...] Read more.
Autonomous driving in complex real-world environments requires robust perception, reasoning, and physically feasible planning, which remain challenging for current end-to-end approaches. This paper introduces VLA-MP, a unified vision-language-action framework that integrates multimodal Bird’s-Eye View (BEV) perception, vision-language alignment, and a GRU-bicycle dynamics cascade adapter for physics-informed action generation. The system constructs structured environmental representations from RGB images and LiDAR, aligns scene features with natural language instructions through a cross-modal projector and large language model, and converts high-level semantic hidden states outputs into executable and physically consistent trajectories. Experiments on the LMDrive dataset and CARLA simulator demonstrate that VLA-MP achieves high performance across the LangAuto benchmark series, with best driving scores of 44.3, 63.5, and 78.4 on LangAuto, LangAuto-Short, and LangAuto-Tiny, respectively, while maintaining high infraction scores of 0.89–0.95, outperforming recent VLA methods such as LMDrive and AD-H. Visualization and video results further validate the framework’s ability to follow complex language-conditioned instructions, adapt to dynamic environments, and prioritize safety. These findings highlight the potential of combining multimodal perception, language reasoning, and physics-aware adapters for robust and interpretable autonomous driving. Full article
(This article belongs to the Special Issue Large AI Models for Positioning and Perception in Autonomous Driving)
Show Figures

Figure 1

Back to TopTop