MDPI - Publisher of Open Access Journals

27 pages, 66966 KB

Open AccessArticle

Physics-Driven Deep Feature Fusion: A Lightweight CSAKansformer Architecture for Tool Wear Diagnosis in P25 Turning

by Shuqiang Wang, Tianyue Zhang, Ximin Liu, Wei Liu, Huanqi Zhang and Feng Chang

Sensors 2026, 26(10), 2937; https://doi.org/10.3390/s26102937 - 7 May 2026

Viewed by 663

Accurate tool wear identification is essential for ensuring the continuity of intelligent machining and workpiece quality. To address the challenges of multi-source fusion inefficiency and inadequate feature extraction, this study proposes a novel identification architecture combining physics-guided multi-channel Gramian angular field (PG-MGAF) with [...] Read more.

Accurate tool wear identification is essential for ensuring the continuity of intelligent machining and workpiece quality. To address the challenges of multi-source fusion inefficiency and inadequate feature extraction, this study proposes a novel identification architecture combining physics-guided multi-channel Gramian angular field (PG-MGAF) with a minimalist 14-layer CSA-Kansformer network. Multi-source signals are preprocessed via PG-MGAF to convert 1D time-series into 2D RGB images, effectively characterizing spatial coupling and interactive energy across three channels. Subsequently, the minimalist network maps these composite features to tool states, significantly reducing computational overhead. Experimental results demonstrate that the proposed model achieves an average accuracy of 93.6% with a single-step inference latency of only 5.90 ms, significantly outperforming mainstream methods such as MobileNet-V2 and ConvNeXt. This architecture provides a high-efficiency, low-latency solution for real-time tool condition monitoring under complex industrial conditions. Full article

(This article belongs to the Section Industrial Sensors)

► Show Figures

Figure 1

47 pages, 5495 KB

Open AccessArticle

STAC: A Spatio-Temporal Transformer with Adaptive Context for Video Compression

by Reka Sandaruwan Gallena Watthage and Anil Fernando

Appl. Sci. 2026, 16(9), 4568; https://doi.org/10.3390/app16094568 - 6 May 2026

Viewed by 348

Abstract

The rapid growth of video content development requires more effective compression solutions than traditional ones. Although neural video compression has demonstrated impressive advances, the current methods are having a hard time with how to effectively model long-range temporal dependencies and react to different [...] Read more.

The rapid growth of video content development requires more effective compression solutions than traditional ones. Although neural video compression has demonstrated impressive advances, the current methods are having a hard time with how to effectively model long-range temporal dependencies and react to different content properties. We introduce STAC (Spatio-Temporal Adaptive Context), a transformer-based neural video compression scheme that does not have these limitations, and makes three original contributions. First, the Adaptive Context Selector (ACS) is the dynamic evaluation and selection of the most informative reference frames, based on learned relevance scoring, in place of the traditional use of predetermined adjacent frame sets. Second, Enhanced Sliding Window Attention (ESWA) is an effective computational model of spatio-temporal correlations by the integration of learnable local bias and temporal gating information into a computationally adjustable attention model. Third, a dual-path entropy model is an adaptively learned fusion gate that combines channel-wise autoregressive prediction with spatio-temporal prediction to produce better probability estimations for entropy coding. Trained on the Vimeo-90k dataset using a four-phase curriculum with the Adam optimiser over approximately 2.2 M total steps. We tested STAC using six benchmark videos, such as UVG, MCL-JCV, and HEVC Class B, C, D and E videos, at varying test settings. The experimental findings prove that STAC, on average, saves a BD-rate of 32.20% in the YUV colourspace with an intra-period of −1. The consistent improvement across both PSNR and MS-SSIM metrics confirms that STAC’s coding gains arise from genuinely improved probability modelling, rather than metric-specific optimisation. Evaluations were performed on six standard benchmarks (UVG, MCL-JCV, and HEVC Classes B, C, D, and E) under 24 experimental configurations (six datasets × 2, and colourspaces × 2 intra-period settings), with all methods tested under identical conditions using the same sequences, frames (96 per sequence), and VTM-17.0 anchor codec. STAC achieves 32.20% average BD-rate savings over VTM under YUV IP = −1, outperforming the prior state-of-the-art DCMVC by 2.70 percentage points. Under IP = 32, STAC achieves −27.01%, with only 5.19 pp degradation versus 6.42 pp for DCMVC. The results generalise to the RGB colourspace (−31.23%) and scale from 240p (−35.19%) to 4K (−36.35%). Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

22 pages, 3386 KB

Open AccessArticle

UAV Visual Localization via Multimodal Fusion and Multi-Scale Attention Enhancement

by Yiheng Wang, Yushuai Zhang, Zhenyu Wang, Jianxin Guo, Feng Wang, Rui Zhu and Dejing Lin

Sustainability 2026, 18(9), 4277; https://doi.org/10.3390/su18094277 - 25 Apr 2026

Viewed by 1081

Abstract

For power-grid applications such as transmission corridor inspection, substation asset inspection, and post-disaster emergency repair, reliable UAV self-localization under GNSS-degraded or GNSS-denied conditions is critical to ensuring operational safety and accurate defect geotagging. Due to substantial discrepancies in viewpoint, scale, and geometric structure [...] Read more.

For power-grid applications such as transmission corridor inspection, substation asset inspection, and post-disaster emergency repair, reliable UAV self-localization under GNSS-degraded or GNSS-denied conditions is critical to ensuring operational safety and accurate defect geotagging. Due to substantial discrepancies in viewpoint, scale, and geometric structure between oblique UAV images and nadir satellite images, conventional RGB-based cross-view retrieval methods often suffer from unstable alignment and insufficient geometric modeling, particularly in scenarios with repetitive textures and partial overlap. To address these challenges, we propose a cross-view visual geo-localization model that integrates RGBD multimodal inputs with multi-scale attention enhancement. Specifically, MiDaS is used to estimate relative depth from UAV imagery, which is concatenated with RGB to form a four-channel input, while satellite images are padded with an additional zero channel to maintain dimensional consistency. A shared-weight ViTAdapter is adopted to learn joint semantic–geometric representations, and a lightweight Efficient Multi-scale Attention (EMA) module is adopted on spatial feature maps to strengthen multi-scale spatial consistency. In addition, an IoU-weighted InfoNCE loss is employed to accommodate partial matching during training, thereby improving the robustness of feature alignment. Experiments on the GTA-UAV dataset under the cross-area protocol show stable performance across both retrieval and localization metrics. Specifically, Recall@1, Recall@5, and Recall@10 reach 18.12%, 38.83%, and 49.47%, respectively; AP is 28.01 and SDM@3 is 0.53; meanwhile, the top-1 geodesic distance error Dis@1 is 1052.73 m. These results indicate that explicit geometric priors combined with multi-scale spatial enhancement can effectively improve cross-view feature alignment, leading to enhanced robustness and accuracy for localization in challenging power inspection scenarios. Full article

(This article belongs to the Special Issue Planning, Operation, and Energy Efficiency of Sustainable Electric Power Systems)

► Show Figures

Figure 1

21 pages, 3375 KB

Open AccessArticle

Deep6DHead: A 6D Head Pose Estimation Method Based on Deep Feature Enhancement

by Fake Jiang, Shucheng Huang and Mingxing Li

Symmetry 2026, 18(5), 705; https://doi.org/10.3390/sym18050705 - 22 Apr 2026

Cited by 1 | Viewed by 267

Abstract

To address the bottlenecks of accuracy in head pose estimation caused by occlusion and rotational representation ambiguities, we propose Deep6DHead, a 6-degree-of-freedom (6DoF) head pose estimation method based on deep feature enhancement. This method innovatively integrates RGB and depth information to construct a [...] Read more.

To address the bottlenecks of accuracy in head pose estimation caused by occlusion and rotational representation ambiguities, we propose Deep6DHead, a 6-degree-of-freedom (6DoF) head pose estimation method based on deep feature enhancement. This method innovatively integrates RGB and depth information to construct a four-channel input and achieves feature fusion of RGB-D through a dual-branch network. First, a Squeeze-and-Excitation (SE) module adaptively weights the depth geometric features of key anatomical regions to achieve channel recalibration. Second, based on the 6DoF rotation representation framework, we introduce an anatomical constraint loss using the nasal bridge normal. This constraint corrects rotation deviations caused by noise by enforcing consistency in local geometric orientation. Finally, the model outputs the rotation matrix end-to-end for final pose estimation. Experiments on the 300W-LP, BIWI, and AFLW2000 datasets demonstrate that our method significantly improves robustness and accuracy, particularly under extreme head poses. Notably, it achieves state-of-the-art performance on the roll axis (lowest error: 2.05) and a competitive overall MAE of 3.45, providing an effective solution for head pose estimation in complex real-world scenarios including extreme viewing angles. Full article

(This article belongs to the Section Computer)

► Show Figures

Figure 1

28 pages, 5786 KB

Open AccessArticle

Multi-Wavelet Fusion Transformer with Token-to-Spectrum Traceback for Physically Interpretable Bearing Fault Diagnosis

by Hongzhi Fan, Chao Zhang, Mingyu Sun, Kexi Xu, Wenyang Zhang and Ximing Zhang

Vibration 2026, 9(2), 28; https://doi.org/10.3390/vibration9020028 - 15 Apr 2026

Viewed by 444

Abstract

Rolling bearing fault diagnosis under complex and noisy operating conditions requires not only high diagnostic accuracy but also interpretability that can be quantitatively verified against physically meaningful excitation structures. However, many existing deep learning approaches rely on a single time–frequency (TF) representation and [...] Read more.

Rolling bearing fault diagnosis under complex and noisy operating conditions requires not only high diagnostic accuracy but also interpretability that can be quantitatively verified against physically meaningful excitation structures. However, many existing deep learning approaches rely on a single time–frequency (TF) representation and provide limited, non-verifiable links between model decisions and the original vibration patterns. To address this issue, we propose MBT-XAI, a multi-wavelet TF fusion network with a Token-to-Spectrum Traceback (TST) mechanism for structure-preserving, physics-consistent interpretability. Three complementary wavelets, namely Morlet, Mexican Hat, and Complex Morlet, are used to construct multi-view TF representations, which are encoded into RGB channels and adaptively fused via cross-channel attention within a Transformer backbone. TST maps patch-token attributions back to the TF domain, enabling quantitative evaluation of physics consistency through overlap-based metrics. Experiments on the public CWRU dataset and an industrial IMUST dataset show that MBT-XAI achieves 98.13 ± 0.24% and 96.23 ± 0.31% accuracy at SNR = 0 dB, outperforming the strongest baseline by 2.83% and 2.43%, respectively. Under AWGN contamination, MBT-XAI maintains 95.44 ± 0.38%/93.45 ± 0.47% accuracy on CWRU and 95.80 ± 0.33%/92.91 ± 0.51% accuracy on IMUST at SNR = −2/−4 dB. Under colored-noise contamination, the proposed method also preserves robust performance under pink and brown noise at the same SNR levels. Quantitative interpretability evaluation further indicates high alignment between salient frequency regions and theoretical fault-characteristic bands, with IoU = 80.21 ± 0.86% and Coverage = 91.70 ± 0.63%. In addition, MBT-XAI requires 10.393 M parameters and 10.678 GFLOPs, with an inference latency of 14.7 ms per sample (batch size = 1) on an NVIDIA GeForce RTX 3060 GPU. These results suggest that multi-wavelet TF modeling with attention-based fusion and TF-level traceback provides an accurate, robust, and physics-consistent framework for intelligent bearing fault diagnosis. Full article

► Show Figures

Figure 1

17 pages, 1167 KB

Open AccessArticle

HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction

by Jinsong Zhang and Yuqin Lin

Biomimetics 2026, 11(3), 214; https://doi.org/10.3390/biomimetics11030214 - 17 Mar 2026

Viewed by 700

Abstract

Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues [...] Read more.

Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues (e.g., contact locality) with channel-wise semantic cues (e.g., action/affordance), and provides limited control for representing directional and asymmetric physical influence between humans and objects. This paper presents HOIMamba, a state-space sequence modeling framework that reformulates HOI reconstruction as bidirectional, multi-scale interaction state inference. Instead of relying on symmetric correlation aggregation, HOIMamba uses structured state evolution to propagate interaction evidence. We introduce a multi-scale state-space module (MSSM) to capture interaction dependencies spanning local contact details and global body–object coordination. Building on MSSM, we propose a spatial-channel grouped SSM (SCSSM) block that factorizes interaction modeling into a spatial pathway for geometric/contact dependencies and a channel pathway for semantic/functional correlations, followed by gated fusion. HOIMamba further performs explicit bidirectional propagation between human and object states to better reflect asymmetric reciprocity in physical interactions. We evaluate HOIMamba on two public benchmarks, BEHAVE and InterCap, using Chamfer distance for human/object meshes and contact precision/recall induced by reconstructed geometry. HOIMamba achieves consistent improvements over representative prior methods. On the BEHAVE dataset, it reduces human Chamfer distance by 8.6% and improves contact recall by 13.5% compared to the strongest Transformer-based baseline, with similar gains observed on the InterCap dataset. Ablation studies on BEHAVE verify the contributions of state-space modeling, multi-scale inference, spatial-channel factorization, and bidirectional interaction reasoning. Full article

(This article belongs to the Special Issue Human–Robot Interaction and Collaboration: Advances in Sensing, Control, and Learning)

► Show Figures

Graphical abstract

18 pages, 2199 KB

Open AccessArticle

Brain-Oct-Pvt: A Physics-Guided Transformer with Radial Prior and Deformable Alignment for Neurovascular Segmentation

by Quan Lan, Jianuo Huang, Chenxi Huang, Songyuan Song, Yuhao Shi, Zijun Zhao, Wenwen Wu, Hongbin Chen and Nan Liu

Bioengineering 2026, 13(3), 332; https://doi.org/10.3390/bioengineering13030332 - 13 Mar 2026

Viewed by 603

Abstract

The primary objective of this study is to develop a specialized deep learning framework specifically adapted for the unique physical characteristics of neurovascular Optical Coherence Tomography (OCT) imaging. Although Polyp-PVT, originally designed for polyp segmentation, shows promise for OCT analysis, it faces limitations [...] Read more.

The primary objective of this study is to develop a specialized deep learning framework specifically adapted for the unique physical characteristics of neurovascular Optical Coherence Tomography (OCT) imaging. Although Polyp-PVT, originally designed for polyp segmentation, shows promise for OCT analysis, it faces limitations in neurovascular applications. The default RGB input wastes resources on duplicated grayscale data, while its fixed-scale fusion struggles with vascular curvature variations. Furthermore, the attention mechanism fails to capture radial vessel patterns, and geometric constraints limit thin boundary detection. To address these challenges, we propose Brain-OCT-PVT with key innovations: a single-channel input stem reducing parameters by two-thirds; a Radial Intensity Module (RIM) using polar transforms and angular convolution to model annular structures; and a Deformable Cross-scale Fusion Module (D-CFM) with learnable offsets. The Boundary-aware Attention Module (BAM) combines Laplace edge detection with Swin-Transformer for sub-pixel consistency. A specialized loss function combines Dice Similarity Coefficient (Dice), BoundaryIoU on 2-pixel dilated edges, and Focal Tversky to handle extreme class imbalance. Evaluation on 13 clinical cases achieves a Dice score of 95.06% and an 95% Hausdorff Distance (HD95) of 0.269 mm, demonstrating superior performance compared to existing approaches. Full article

(This article belongs to the Special Issue AI-Driven Imaging and Analysis for Biomedical Applications)

► Show Figures

Graphical abstract

15 pages, 551 KB

Open AccessArticle

Query-Side Adversarial Attacks on Event-Based Person Re-Identification: A First-Order Robustness Analysis

by Jung Heum Woo and Eun-Kyu Lee

Appl. Sci. 2026, 16(5), 2430; https://doi.org/10.3390/app16052430 - 3 Mar 2026

Viewed by 384

Abstract

Event-based person re-identification (Re-ID) has recently emerged as a privacy-friendly alternative to conventional RGB-based surveillance. However, the security and adversarial robustness of these systems remain largely understudied. This paper presents a systematic investigation into the vulnerabilities of event-based person Re-ID models operating on [...] Read more.

Event-based person re-identification (Re-ID) has recently emerged as a privacy-friendly alternative to conventional RGB-based surveillance. However, the security and adversarial robustness of these systems remain largely understudied. This paper presents a systematic investigation into the vulnerabilities of event-based person Re-ID models operating on 5-channel event voxels. We evaluate the impact of a one-step FGSM attack on query-side event voxel inputs and measure the resulting retrieval performance. Our experiments demonstrate a significant susceptibility: under subtle perturbations, the Top-1 accuracy drops drastically from 0.462 to 0.154. Critically, these adversarial inputs maintain high perceptual similarity to the original data, with an average SSIM of approximately 0.99 and an average PSNR of 45 dB, rendering the modifications nearly imperceptible. These findings suggest that the sparse and asynchronous nature of event-based person Re-ID, despite its potential privacy advantages, is highly susceptible to gradient-based exploits. This study highlights the need for robustness-aware design and defense mechanisms in event-based surveillance systems. Full article

(This article belongs to the Special Issue Advanced Cybersecurity Applications: Solutions to Counteract Cyber Threats)

► Show Figures

Figure 1

21 pages, 1284 KB

Open AccessArticle

Probabilistic Indoor 3D Object Detection from RGB-D via Gaussian Distribution Estimation

by Hyeong-Geun Kim

Mathematics 2026, 14(3), 421; https://doi.org/10.3390/math14030421 - 26 Jan 2026

Viewed by 614

Abstract

Conventional object detectors represent each object by a deterministic bounding box, regressing its center and size from RGB images. However, such discrete parameterization ignores the inherent uncertainty in object appearance and geometric projection, which can be more naturally modeled as a probabilistic density [...] Read more.

Conventional object detectors represent each object by a deterministic bounding box, regressing its center and size from RGB images. However, such discrete parameterization ignores the inherent uncertainty in object appearance and geometric projection, which can be more naturally modeled as a probabilistic density field. Recent works have introduced Gaussian-based formulations that treat objects as distributions rather than boxes, yet they remain limited to 2D images or require late fusion between image and depth modalities. In this paper, we propose a unified Gaussian-based framework for direct 3D object detection from RGB-D inputs. Our method is built upon a vision transformer backbone to effectively capture global context. Instead of separately embedding RGB and depth features or refining depth within region proposals, our method takes a full four-channel RGB-D tensor and predicts the mean and covariance of a 3D Gaussian distribution for each object in a single forward pass. We extend a pretrained vision transformer to accept four-channel inputs by augmenting the patch embedding layer while preserving ImageNet-learned representations. This formulation allows the detector to represent both object location and geometric uncertainty in 3D space. By optimizing divergence metrics such as the Kullback–Leibler or Bhattacharyya distances between predicted and target distributions, the network learns a physically consistent probabilistic representation of objects. Experimental results on the SUN RGB-D benchmark demonstrate that our approach achieves competitive performance compared to state-of-the-art point-cloud-based methods while offering uncertainty-aware and geometrically interpretable 3D detections. Full article

► Show Figures

Figure 1

25 pages, 2891 KB

Open AccessArticle

Automated Measurement of Sheep Body Dimensions via Fusion of YOLOv12n-Seg-SSM and 3D Point Clouds

by Xiaona Zhao, Xifeng Liu, Zihao Gao, Xinran Liang, Yanjun Yuan, Yangfan Bai, Zhimin Zhang, Fuzhong Li and Wuping Zhang

Agriculture 2026, 16(2), 272; https://doi.org/10.3390/agriculture16020272 - 21 Jan 2026

Viewed by 692

Abstract

Accurate measurement of sheep body dimensions is fundamental for growth monitoring and breeding management. To address the limited segmentation accuracy and the trade-off between lightweight design and precision in existing non-contact measurement methods, this study proposes an improved model, YOLOv12n-Seg-SSM, for the automatic [...] Read more.

Accurate measurement of sheep body dimensions is fundamental for growth monitoring and breeding management. To address the limited segmentation accuracy and the trade-off between lightweight design and precision in existing non-contact measurement methods, this study proposes an improved model, YOLOv12n-Seg-SSM, for the automatic measurement of body height, body length, and chest circumference from side-view images of sheep. The model employs a synergistic strategy that combines semantic segmentation with 3D point cloud geometric fitting. It incorporates the SegLinearSimAM feature enhancement module, the SEAttention channel optimization module, and the ENMPDIoU loss function to improve measurement robustness under complex backgrounds and occlusions. After segmentation, valid RGB-D point clouds are generated through depth completion and point cloud filtering, enabling 3D computation of key body measurements. Experimental results demonstrate that the improved model outperforms the baseline YOLOv12n-Seg: the mAP@0.5 for segmentation reaches 94.20%, the mAP@0.5 for detection reaches 95.00% (improvements of 0.5 and 1.3 percentage points, respectively), and the recall increases to 99.00%. In validation tests on 43 Hu sheep, the R² values for chest circumference, body height, and body length were 0.925, 0.888 and 0.819, respectively, with measurement errors within 5%. The model requires only 10.71 MB of memory and 9.9 GFLOPs of computation, enabling real-time operation on edge devices. This study demonstrates that the proposed method achieves non-contact automatic measurement of sheep body dimensions, providing a practical solution for on-site growth monitoring and intelligent management in livestock farms. Full article

(This article belongs to the Special Issue Computer Vision Analysis Applied to Farm Animals)

► Show Figures

Figure 1

27 pages, 4064 KB

Open AccessArticle

RDINet: A Deep Learning Model Integrating RGB-D and Ingredient Features for Food Nutrition Estimation

by Zhejun Kuang, Haobo Gao, Jiaxuan Yu, Dawen Sun, Jian Zhao and Lei Sun

Appl. Sci. 2026, 16(1), 454; https://doi.org/10.3390/app16010454 - 1 Jan 2026

Viewed by 919

Abstract

With growing public health awareness, accurate food nutrition estimation plays an increasingly important role in dietary management and disease prevention. The main bottleneck lies in how to effectively integrate multi-source heterogeneous information. We propose RDINet, a multimodal network that fuses RGB appearance, depth [...] Read more.

With growing public health awareness, accurate food nutrition estimation plays an increasingly important role in dietary management and disease prevention. The main bottleneck lies in how to effectively integrate multi-source heterogeneous information. We propose RDINet, a multimodal network that fuses RGB appearance, depth geometry, and ingredient semantics for food nutrition estimation. It comprises two core modules: The RGB-D fusion module integrates the textural appearance of RGB images and the 3D shape information conveyed by depth images through a channel–spatial attention mechanism, achieving a joint understanding of food appearance and geometric morphology without explicit 3D reconstruction; the ingredient fusion module embeds ingredient information into visual features via attention mechanisms, enabling the model to fully leverage components that are visually difficult to discern or prone to confusion, thereby activating corresponding nutritional reasoning pathways and achieving cross-modal inference from explicit observations to latent attributes. Experimental results on the Nutrition5k dataset show that RDINet achieves percentage mean absolute errors (PMAE) of 14.9%, 11.2%, 19.7%, 18.9%, and 19.5% for estimating calories, mass, fat, carbohydrates, and protein, respectively, with a mean PMAE of 16.8% across all metrics, outperforming existing mainstream methods. The results demonstrate that the appearance–geometry–semantics fusion framework is effective. Full article

► Show Figures

Figure 1

23 pages, 6289 KB

Open AccessArticle

Suitability of UAV-Based RGB and Multispectral Photogrammetry for Riverbed Topography in Hydrodynamic Modelling

by Vytautas Akstinas, Karolina Gurjazkaitė, Diana Meilutytė-Lukauskienė, Andrius Kriščiūnas, Dalia Čalnerytė and Rimantas Barauskas

Water 2026, 18(1), 38; https://doi.org/10.3390/w18010038 - 22 Dec 2025

Cited by 2 | Viewed by 879

Abstract

This study assesses the suitability of UAV aerial imagery-based photogrammetry for reconstructing underwater riverbed topography and its application in two-dimensional (2D) hydrodynamic modelling, with a particular focus on comparing RGB, multispectral, and fused RGB–multispectral imagery. Four Lithuanian rivers—Verknė, Šušvė, Jūra, and Mūša—were selected [...] Read more.

This study assesses the suitability of UAV aerial imagery-based photogrammetry for reconstructing underwater riverbed topography and its application in two-dimensional (2D) hydrodynamic modelling, with a particular focus on comparing RGB, multispectral, and fused RGB–multispectral imagery. Four Lithuanian rivers—Verknė, Šušvė, Jūra, and Mūša—were selected to represent a wide range of hydromorphological and hydraulic conditions, including variations in bed texture, vegetation cover, and channel complexity. High-resolution digital elevation models (DEMs) were generated from field-based surveys and UAV imagery processed using Structure-from-Motion photogrammetry. Two-dimensional hydrodynamic models were created and calibrated in HEC-RAS 6.5 using measurement-based DEMs and subsequently applied using photogrammetry-derived DEMs to isolate the influence of terrain input on model performance. The results showed that UAV-derived DEMs systematically overestimate riverbed elevation, particularly in deeper or vegetated sections, resulting in underestimated water depths. RGB imagery provided greater spatial detail but was more susceptible to local anomalies, whereas multispectral imagery produced smoother surfaces with a stronger positive elevation bias. The fusion of RGB and multispectral imagery consistently reduced spatial noise and improved hydrodynamic simulation performance across all river types. Despite moderate vertical deviations of 0.10–0.25 m, relative flow patterns and velocity distributions were reproduced with acceptable accuracy. The findings demonstrate that combined spectral UAV aerial imagery in photogrammetry is a robust and cost-effective alternative for hydrodynamic modelling in shallow lowland rivers, particularly where relative hydraulic characteristics are of primary interest. Full article

(This article belongs to the Section Hydraulics and Hydrodynamics)

► Show Figures

Figure 1

20 pages, 5222 KB

Open AccessArticle

A Real-Time Tractor Recognition and Positioning Method in Fields Based on Machine Vision

by Liang Wang, Dashuang Zhou and Zhongxiang Zhu

Agriculture 2025, 15(24), 2548; https://doi.org/10.3390/agriculture15242548 - 9 Dec 2025

Viewed by 801

Abstract

Multi-machine collaborative navigation in agricultural machinery can significantly improve field operation efficiency. Most existing multi-machine collaborative navigation systems rely on satellite navigation systems, which is costly and cannot meet the obstacle avoidance needs of field operations. In this paper, a real-time tractor recognition [...] Read more.

Multi-machine collaborative navigation in agricultural machinery can significantly improve field operation efficiency. Most existing multi-machine collaborative navigation systems rely on satellite navigation systems, which is costly and cannot meet the obstacle avoidance needs of field operations. In this paper, a real-time tractor recognition and positioning method in fields based on machine vision was proposed. First, we collected tractor images, annotated them, and constructed a tractor dataset. Second, we implemented lightweight improvements to the YOLOv4 algorithm, incorporating sparse training, channel pruning, layer pruning, and knowledge distillation fine-tuning based on the baseline model training. The test results of the lightweight model show that the model size was reduced by 98.73%, the recognition speed increased by 43.74%, and the recognition accuracy remains largely comparable to that of the baseline high-precision model. Then, we proposed a tractor positioning method based on an RGB-D camera. Finally, we established a field vehicle recognition and positioning experimental platform and designed a test plan. The results indicate that when IYO-RGBD recognized and positioned the leader tractor within a 10 m range, the root mean square (RMS) of longitudinal and lateral errors during straight-line travel were 0.0687 m and 0.025 m, respectively. During S-curve travel, the RMS values of longitudinal and lateral errors were 0.1101 m and 0.0481 m, respectively. IYO-RGBD can meet the accuracy requirements for recognizing and positioning the leader tractor by the follower tractor in practical autonomous following field operations. Our research outcomes can provide a new solution and certain technical references for visual navigation in multi-machine collaborative field operations of agricultural machinery. Full article

(This article belongs to the Section Agricultural Technology)

► Show Figures

Figure 1

23 pages, 2403 KB

Open AccessArticle

LI-AGCN: A Lightweight Initialization-Enhanced Adaptive Graph Convolutional Network for Effective Skeleton-Based Action Recognition

by Qingsheng Xie and Hongmin Deng

Sensors 2025, 25(23), 7282; https://doi.org/10.3390/s25237282 - 29 Nov 2025

Viewed by 1051

Abstract

The graph convolutional network (GCN) has become a mainstream technology in skeleton-based action recognition since it was first applied to this field. However, previous studies often overlooked the pivotal role of heuristic model initialization in the extraction of spatial features, impeding the model [...] Read more.

The graph convolutional network (GCN) has become a mainstream technology in skeleton-based action recognition since it was first applied to this field. However, previous studies often overlooked the pivotal role of heuristic model initialization in the extraction of spatial features, impeding the model from achieving its optimal performance. To address this issue, a lightweight initialization-enhanced adaptive graph convolutional network (LI-AGCN) is proposed, which effectively captures spatiotemporal features while maintaining low computational complexity. LI-AGCN employs three coordinate-based input branches (CIB) to dynamically adjust graph structures, which facilitates the extraction of informative spatial features. In addition, the model incorporates a lightweight and multi-scale temporal module to extract temporal feature, and employs an attention module that considers the temporal, spatial, and channel dimensions simultaneously to enhance key features. Finally, the performance of our proposed model is evaluated on three large-scale public datasets: NTU RGB+D, NTU RGB+D 120, and UAV-Human. The experimental results demonstrate that the LI-AGCN achieves excellent comprehensive performances on these datasets, especially obtaining 90.03% accuracy on the cross-subject benchmark of the NTU RGB+D dataset with only 0.18 million parameters, showcasing the effectiveness of the model. Full article

(This article belongs to the Special Issue Computer Vision Sensing and Pattern Recognition)

► Show Figures

Figure 1

20 pages, 2894 KB

Open AccessArticle

End-to-End Swallowing Event Localization via Blue-Channel-to-Depth Substitution in RGB-D: GRNConvNeXt-Modified AdaTAD with KAN-Chebyshev Decoder

by Derek Ka-Hei Lai, Zi-An Zhao, Andy Yiu-Chau Tam, Jing Li, Jason Zhi-Shen Zhang, Duo Wai-Chi Wong and James Chung-Wai Cheung

AI 2025, 6(11), 276; https://doi.org/10.3390/ai6110276 - 22 Oct 2025

Viewed by 1345

Abstract

Background: Swallowing is a complex biomechanical process, and its impairment (dysphagia) poses major health risks for older adults. Current diagnostic methods such as videofluoroscopic swallowing (VFSS) and fiberoptic endoscopic evaluation of swallowing (FEES) are effective but invasive, resource-intensive, and unsuitable for continuous [...] Read more.

Background: Swallowing is a complex biomechanical process, and its impairment (dysphagia) poses major health risks for older adults. Current diagnostic methods such as videofluoroscopic swallowing (VFSS) and fiberoptic endoscopic evaluation of swallowing (FEES) are effective but invasive, resource-intensive, and unsuitable for continuous monitoring. This study proposes a novel end-to-end RGB–D framework for automated swallowing event localization in continuous video streams. Methods: The framework enhances the AdaTAD backbone through three key innovations: (i) finding the optimal strategy to integrate depth information to capture subtle neck movements, (ii) examining the best adapter design for efficient temporal feature adaptation, and (iii) introducing a Kolmogorov–Arnold Network (KAN) decoder that leverages Chebyshev polynomials for non-linear temporal modeling. Evaluation on a proprietary swallowing dataset comprising 641 clips and 3153 annotated events demonstrated the effectiveness of the proposed framework. We analysed and compared the modification strategy across designs of adapters, decoders, input channel combinations, regression methods, and patch embedding techniques. Results: The optimized configuration (VideoMAE + GRNConvNeXtAdapter + KAN + RGD + boundary regression + sinusoidal embedding) achieved an average mAP of 83.25%, significantly surpassing the baseline I3D + RGB + MLP model (61.55%). Ablation studies further confirmed that each architectural component contributed incrementally to the overall improvement. Conclusions: These results establish the feasibility of accurate, non-invasive, and automated swallowing event localization using depth-augmented video. The proposed framework paves the way for practical dysphagia screening and long-term monitoring in clinical and home-care environments. Full article

(This article belongs to the Special Issue Artificial Intelligence in Biomedical Engineering: Challenges and Developments)

► Show Figures

Figure 1

Search Results (63)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (63)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI