Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (71)

Search Parameters:
Keywords = monocular benchmark

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 1167 KB  
Article
HOIMamba: Bidirectional State-Space Modeling for Monocular 3D Human–Object Interaction Reconstruction
by Jinsong Zhang and Yuqin Lin
Biomimetics 2026, 11(3), 214; https://doi.org/10.3390/biomimetics11030214 - 17 Mar 2026
Viewed by 474
Abstract
Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues [...] Read more.
Monocular 3D human–object interaction (HOI) reconstruction requires jointly recovering articulated human geometry, object pose, and physically plausible contact from a single RGB image. While recent token-based methods commonly employ dense self-attention to capture global dependencies, isotropic all-to-all mixing tends to entangle spatial-geometric cues (e.g., contact locality) with channel-wise semantic cues (e.g., action/affordance), and provides limited control for representing directional and asymmetric physical influence between humans and objects. This paper presents HOIMamba, a state-space sequence modeling framework that reformulates HOI reconstruction as bidirectional, multi-scale interaction state inference. Instead of relying on symmetric correlation aggregation, HOIMamba uses structured state evolution to propagate interaction evidence. We introduce a multi-scale state-space module (MSSM) to capture interaction dependencies spanning local contact details and global body–object coordination. Building on MSSM, we propose a spatial-channel grouped SSM (SCSSM) block that factorizes interaction modeling into a spatial pathway for geometric/contact dependencies and a channel pathway for semantic/functional correlations, followed by gated fusion. HOIMamba further performs explicit bidirectional propagation between human and object states to better reflect asymmetric reciprocity in physical interactions. We evaluate HOIMamba on two public benchmarks, BEHAVE and InterCap, using Chamfer distance for human/object meshes and contact precision/recall induced by reconstructed geometry. HOIMamba achieves consistent improvements over representative prior methods. On the BEHAVE dataset, it reduces human Chamfer distance by 8.6% and improves contact recall by 13.5% compared to the strongest Transformer-based baseline, with similar gains observed on the InterCap dataset. Ablation studies on BEHAVE verify the contributions of state-space modeling, multi-scale inference, spatial-channel factorization, and bidirectional interaction reasoning. Full article
Show Figures

Graphical abstract

21 pages, 23671 KB  
Article
Zero-Shot Polarization-Intensity Physical Fusion Monocular Depth Estimation for High Dynamic Range Scenes
by Renhao Rao, Zhizhao Ouyang, Shuang Chen, Liang Chen, Guoqin Huang and Changcai Cui
Photonics 2026, 13(3), 268; https://doi.org/10.3390/photonics13030268 - 11 Mar 2026
Viewed by 356
Abstract
Monocular 3D reconstruction remains a persistent challenge for autonomous driving systems in Degraded Visual Environments (DVEs) with extreme glare and low illumination, such as highway tunnels, due to the lack of reliable texture cues. This paper proposes a physics-aware deep learning framework that [...] Read more.
Monocular 3D reconstruction remains a persistent challenge for autonomous driving systems in Degraded Visual Environments (DVEs) with extreme glare and low illumination, such as highway tunnels, due to the lack of reliable texture cues. This paper proposes a physics-aware deep learning framework that overcomes these limitations by fusing polarization sensing with conventional intensity imaging. Unlike traditional end-to-end data-driven fusion strategies, we propose a Modality-Aligned Parameter Injectionstrategy. By remapping the weight space of the input layer, this strategy achieves a smooth transfer of the pre-trained Vision Transformer (i.e., MiDaS) to multi-modal inputs. Its core advantage lies in the seamless integration of four-channel polarization geometric information while fully preserving the pre-trained semantic representation capabilities of the backbone network, thereby avoiding the overfitting risk associated with training from scratch on small-sample data. Furthermore, we design a Reliability-Aware Gating mechanism that dynamically re-weights appearance and geometric cues based on intensity saturation and the physical validity of polarization signals as measured by the Degree of Linear Polarization (DoLP). We validate the proposed method on our self-constructed POLAR-GLV benchmark, a real-world dataset collected specifically for high dynamic range tunnel scenarios. Extensive experiments demonstrate that our method consistently outperforms intensity-only baselines, reducing geometric reconstruction error by 24.2% in high-glare tunnel exit zones and 10.0% at tunnel entrances. Crucially, compared to multi-stream fusion architectures, these performance gains come with negligible additional computational cost, making the framework highly suitable for resource-constrained onboard inference environments. Full article
Show Figures

Figure 1

23 pages, 16353 KB  
Article
RepACNet: A Lightweight Reparameterized Asymmetric Convolution Network for Monocular Depth Estimation
by Wanting Jiang, Jun Li, Yaoqian Niu, Hao Chen and Shuang Peng
Sensors 2026, 26(4), 1199; https://doi.org/10.3390/s26041199 - 12 Feb 2026
Viewed by 397
Abstract
Monocular depth estimation (MDE) is a cornerstone task in 2D/3D scene reconstruction and recognition with widespread applications in autonomous driving, robotics, and augmented reality. However, existing state-of-the-art methods face a fundamental trade-off between computational efficiency and estimation accuracy, limiting their deployment in resource-constrained [...] Read more.
Monocular depth estimation (MDE) is a cornerstone task in 2D/3D scene reconstruction and recognition with widespread applications in autonomous driving, robotics, and augmented reality. However, existing state-of-the-art methods face a fundamental trade-off between computational efficiency and estimation accuracy, limiting their deployment in resource-constrained real-world scenarios. It is of high interest to design lightweight but effective models to enable potential deployment on resource-constrained mobile devices. To address this problem, we present RepACNet, a novel lightweight network that addresses this challenge through reparameterized asymmetric convolution designs and CNN-based architecture that integrates MLP-Mixer components. First, we propose Reparameterized Token Mixer with Asymmetric Convolution (RepTMAC), an efficient block that captures long-range dependencies while maintaining linear computational complexity. Unlike Transformer-based methods, our approach achieves global feature interaction with tiny overhead. Second, we introduce Squeeze-and-Excitation Consecutive Dilated Convolutions (SECDCs), which integrates adaptive channel attention with dilated convolutions to capture depth-specific features across multiple scales. We validate the effectiveness of our approach through extensive experiments on two widely recognized benchmarks, NYU Depth v2 and KITTI Eigen. The experimental results demonstrate that our model achieves competitive performance while maintaining significantly fewer parameters compared to state-of-the-art models. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

20 pages, 2389 KB  
Article
A Monocular Depth Estimation Method for Autonomous Driving Vehicles Based on Gaussian Neural Radiance Fields
by Ziqin Nie, Zhouxing Zhao, Jieying Pan, Yilong Ren, Haiyang Yu and Liang Xu
Sensors 2026, 26(3), 896; https://doi.org/10.3390/s26030896 - 29 Jan 2026
Viewed by 604
Abstract
Monocular depth estimation is one of the key tasks in autonomous driving, which derives depth information of the scene from a single image. And it is a fundamental component for vehicle decision-making and perception. However, approaches currently face challenges such as visual artifacts, [...] Read more.
Monocular depth estimation is one of the key tasks in autonomous driving, which derives depth information of the scene from a single image. And it is a fundamental component for vehicle decision-making and perception. However, approaches currently face challenges such as visual artifacts, scale ambiguity and occlusion handling. These limitations lead to suboptimal performance in complex environments, reducing model efficiency and generalization and hindering their broader use in autonomous driving and other applications. To solve these challenges, this paper introduces a Neural Radiance Field (NeRF)-based monocular depth estimation method for autonomous driving. It introduces a Gaussian probability-based ray sampling strategy to effectively solve the problem of massive sampling points in large complex scenes and reduce computational costs. To improve generalization, a lightweight spherical network incorporating a fine-grained adaptive channel attention mechanism is designed to capture detailed pixel-level features. These features are subsequently mapped to 3D spatial sampling locations, resulting in diverse and expressive point representations for improving the generalizability of the NeRF model. Our approach exhibits remarkable performance on the KITTI benchmark, surpassing traditional methods in depth estimation tasks. This work contributes significant technical advancements for practical monocular depth estimation in autonomous driving applications. Full article
Show Figures

Figure 1

14 pages, 10199 KB  
Article
Relaxing Accurate Initialization for Monocular Dynamic Scene Reconstruction with Gaussian Splatting
by Xinyu Wang, Jiafu Chen, Wei Xing, Huaizhong Lin and Lei Zhao
Appl. Sci. 2026, 16(3), 1321; https://doi.org/10.3390/app16031321 - 28 Jan 2026
Viewed by 546
Abstract
Monocular dynamic scene reconstruction is a challenging task due to the inherent limitation of observing the scene from a single viewpoint at each timestamp, particularly in the presence of object motion and illumination changes. Recent methods combine Gaussian Splatting with deformation modeling to [...] Read more.
Monocular dynamic scene reconstruction is a challenging task due to the inherent limitation of observing the scene from a single viewpoint at each timestamp, particularly in the presence of object motion and illumination changes. Recent methods combine Gaussian Splatting with deformation modeling to enable fast training and rendering; however, their performance in real-world scenarios strongly depends on accurate point cloud initialization. When such initialization is unavailable and random point clouds are used instead, reconstruction quality degrades significantly. To address this limitation, we propose an optimization strategy that relaxes the requirement for accurate initialization in Gaussian-Splatting-based monocular dynamic scene reconstruction. The scene is first reconstructed under a static assumption using all monocular frames, allowing stable convergence of background regions. Based on reconstruction errors, a subset of Gaussians is then activated as dynamic to model motion and deformation. In addition, an annealing jitter regularization term is introduced to improve robustness to camera pose inaccuracies commonly observed in real-world datasets. Extensive experiments on established benchmarks demonstrate that the proposed method enables stable training from randomly initialized point clouds and achieves reconstruction performance comparable to approaches relying on accurate point cloud initialization. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

20 pages, 7566 KB  
Article
Temporal Probability-Guided Graph Topology Learning for Robust 3D Human Mesh Reconstruction
by Hongsheng Wang, Jie Yang, Feng Lin and Fei Wu
Mathematics 2026, 14(2), 367; https://doi.org/10.3390/math14020367 - 21 Jan 2026
Viewed by 302
Abstract
Reconstructing 3D human motion from monocular video presents challenges when frames contain occlusions or blur, as conventional approaches depend on features extracted within limited temporal windows, resulting in structural distortions. In this paper, we introduce a novel framework that combines temporal probability guidance [...] Read more.
Reconstructing 3D human motion from monocular video presents challenges when frames contain occlusions or blur, as conventional approaches depend on features extracted within limited temporal windows, resulting in structural distortions. In this paper, we introduce a novel framework that combines temporal probability guidance with graph topology learning to achieve robust 3D human mesh reconstruction from incomplete observations. Our method leverages topology-aware probability distributions spanning entire motion sequences to recover missing anatomical regions. The Graph Topological Modeling (GTM) component captures structural relationships among body parts by learning the inherent connectivity patterns in human anatomy. Building upon GTM, our Temporal-alignable Probability Distribution (TPDist) mechanism predicts missing features through probabilistic inference, establishing temporal coherence across frames. Additionally, we propose a Hierarchical Human Loss (HHLoss) that hierarchically regularizes probability distribution errors for inter-frame features while accounting for topological variations. Experimental validation demonstrates that our approach outperforms state-of-the-art methods on the 3DPW benchmark, particularly excelling in scenarios involving occlusions and motion blur. Full article
Show Figures

Figure 1

23 pages, 3329 KB  
Article
MogaDepth: Multi-Order Feature Hierarchy Fusion for Lightweight Monocular Depth Estimation
by Gengsheng Lin and Guangping Li
Sensors 2026, 26(2), 685; https://doi.org/10.3390/s26020685 - 20 Jan 2026
Cited by 1 | Viewed by 516
Abstract
Monocular depth estimation is a fundamental task with broad applications in autonomous driving and augmented reality. While recent lightweight methods achieve impressive performance, they often neglect the interaction of mid-order semantic features, which are crucial for capturing object structures and spatial relationships [...] Read more.
Monocular depth estimation is a fundamental task with broad applications in autonomous driving and augmented reality. While recent lightweight methods achieve impressive performance, they often neglect the interaction of mid-order semantic features, which are crucial for capturing object structures and spatial relationships that directly impact depth accuracy. To address this limitation, we propose MogaDepth, a lightweight yet expressive architecture. It introduces a novel Continuous Multi-Order Gated Aggregation (CMOGA) module that explicitly enhances mid-level feature representations through multi-order receptive fields. In addition, we present MambaSync, a global–local interaction unit that enables efficient feature communication across different contexts. Extensive experiments demonstrate that MogaDepth achieves highly competitive or superior performance on KITTI, improving key error metrics while maintaining comparable model size. On the Make3D benchmark, it consistently outperforms existing methods, showing strong robustness to domain shifts and challenging scenarios such as low-texture regions. Moreover, MogaDepth achieves an improved trade-off between accuracy and efficiency, running up to 13% faster on edge devices without compromising performance. These results establish MogaDepth as an effective and efficient solution for real-world monocular depth estimation. Full article
(This article belongs to the Section Vehicular Sensing)
Show Figures

Figure 1

19 pages, 1885 KB  
Article
A Hierarchical Multi-Resolution Self-Supervised Framework for High-Fidelity 3D Face Reconstruction Using Learnable Gabor-Aware Texture Modeling
by Pichet Mareo and Rerkchai Fooprateepsiri
J. Imaging 2026, 12(1), 26; https://doi.org/10.3390/jimaging12010026 - 5 Jan 2026
Viewed by 692
Abstract
High-fidelity 3D face reconstruction from a single image is challenging, owing to the inherently ambiguous depth cues and the strong entanglement of multi-scale facial textures. In this regard, we propose a hierarchical multi-resolution self-supervised framework (HMR-Framework), which reconstructs coarse-, medium-, and fine-scale facial [...] Read more.
High-fidelity 3D face reconstruction from a single image is challenging, owing to the inherently ambiguous depth cues and the strong entanglement of multi-scale facial textures. In this regard, we propose a hierarchical multi-resolution self-supervised framework (HMR-Framework), which reconstructs coarse-, medium-, and fine-scale facial geometry progressively through a unified pipeline. A coarse geometric prior is first estimated via 3D morphable model regression, followed by medium-scale refinement using a vertex deformation map constrained by a global–local Markov random field loss to preserve structural coherence. In order to improve fine-scale fidelity, a learnable Gabor-aware texture enhancement module has been proposed to decouple spatial–frequency information and thus improve sensitivity for high-frequency facial attributes. Additionally, we employ a wavelet-based detail perception loss to preserve the edge-aware texture features while mitigating noise commonly observed in in-the-wild images. Extensive qualitative and quantitative evaluation of benchmark datasets indicate that the proposed framework provides better fine-detail reconstruction than existing state-of-the-art methods, while maintaining robustness over pose variations. Notably, the hierarchical design increases semantic consistency across multiple geometric scales, providing a functional solution for high-fidelity 3D face reconstruction from monocular images. Full article
Show Figures

Figure 1

27 pages, 26025 KB  
Article
LFP-Mono: Lightweight Self-Supervised Network Applying Monocular Depth Estimation to Low-Altitude Environment Scenarios
by Hao Cai, Jiafu Liu, Jinhong Zhang, Jingxuan Xu, Yi Zhang and Qin Yang
Computers 2026, 15(1), 19; https://doi.org/10.3390/computers15010019 - 4 Jan 2026
Viewed by 696
Abstract
For UAVs, the industry currently relies on expensive sensors for obstacle avoidance. A significant challenge arises from the scarcity of high-quality depth estimation datasets tailored for low-altitude environments, which hinders the advancement of self-supervised learning methods in these settings. Furthermore, mainstream depth estimation [...] Read more.
For UAVs, the industry currently relies on expensive sensors for obstacle avoidance. A significant challenge arises from the scarcity of high-quality depth estimation datasets tailored for low-altitude environments, which hinders the advancement of self-supervised learning methods in these settings. Furthermore, mainstream depth estimation models capable of achieving obstacle avoidance through image recognition are built upon convolutional neural networks or hybrid Transformers. Their high computational costs make deployment on resource-constrained edge devices challenging. While existing lightweight convolutional networks reduce parameter counts, they struggle to simultaneously capture essential features and fine details in complex scenes. In this work, we introduce LFP-Mono as a lightweight self-supervised monocular depth estimation network. In the paper, we will detail the Pooling Convolution Downsampling (PCD) module, Continuously Dilated and Weighted Convolution (CDWC) module, and Cross-level Feature Integration (CFI) module. All results show that LFP-Mono outperforms existing lightweight methods on the KITTI benchmark, and by evaluating with the Make3D dataset, show that our method generalizes outdoors. Finally, by training and testing on the Syndrone dataset, baseline work shows that LFP-Mono exceeds state-of-the-art methods for low-altitude drone performance. Full article
Show Figures

Figure 1

24 pages, 9828 KB  
Article
A Novel Object Detection Algorithm Combined YOLOv11 with Dual-Encoder Feature Aggregation
by Haisong Chen, Pengfei Yuan, Wenbai Liu, Fuling Li and Aili Wang
Sensors 2025, 25(23), 7270; https://doi.org/10.3390/s25237270 - 28 Nov 2025
Cited by 2 | Viewed by 1102
Abstract
To address the limitations of unimodal visual detection in complex scenarios involving low illumination, occlusion, and texture-sparse environments, this paper proposes an improved YOLOv11-based dual-branch RGB-D fusion framework. The symmetric architecture processes RGB images and depth maps in parallel, integrating a Dual-Encoder Cross-Attention [...] Read more.
To address the limitations of unimodal visual detection in complex scenarios involving low illumination, occlusion, and texture-sparse environments, this paper proposes an improved YOLOv11-based dual-branch RGB-D fusion framework. The symmetric architecture processes RGB images and depth maps in parallel, integrating a Dual-Encoder Cross-Attention (DECA) module for cross-modal feature weighting and a Dual-Encoder Feature Aggregation (DEPA) module for hierarchical fusion—where the RGB branch captures texture semantics while the depth branch extracts geometric priors. To comprehensively validate the effectiveness and generalization capability of the proposed framework, we designed a multi-stage evaluation strategy leveraging complementary benchmark datasets. On the M3FD dataset, the model was evaluated under both RGB-depth and RGB-infrared configurations to verify core fusion performance and extensibility to diverse modalities. Additionally, the VOC2007 dataset was augmented with pseudo-depth maps generated by Depth Anything, assessing adaptability under monocular input constraints. Experimental results demonstrate that our method achieves mAP50 scores of 82.59% on VOC2007 and 81.14% on M3FD in RGB-infrared mode, outperforming the baseline YOLOv11 by 5.06% and 9.15%, respectively. Notably, in the RGB-depth configuration on M3FD, the model attains a mAP50 of 77.37% with precision of 88.91%, highlighting its robustness in geometric-aware detection tasks. Ablation studies confirm the critical roles of the Dynamic Branch Enhancement (DBE) module in adaptive feature calibration and the Dual-Encoder Attention (DEA) mechanism in multi-scale fusion, significantly enhancing detection stability under challenging conditions. With only 2.47M parameters, the framework provides an efficient and scalable solution for high-precision spatial perception in autonomous driving and robotics applications. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

17 pages, 6551 KB  
Article
AdaLite: A Distilled AdaBins Model for Depth Estimation on Resource-Limited Devices
by Mohammed Chaouki Ziara, Mohamed Elbahri, Nasreddine Taleb, Kidiyo Kpalma and Sid Ahmed El Mehdi Ardjoun
AI 2025, 6(11), 298; https://doi.org/10.3390/ai6110298 - 20 Nov 2025
Viewed by 1095
Abstract
This paper presents AdaLite, a knowledge distillation framework for monocular depth estimation designed for efficient deployment on resource-limited devices, without relying on quantization or pruning. While large-scale depth estimation networks achieve high accuracy, their computational and memory demands hinder real-time use. To address [...] Read more.
This paper presents AdaLite, a knowledge distillation framework for monocular depth estimation designed for efficient deployment on resource-limited devices, without relying on quantization or pruning. While large-scale depth estimation networks achieve high accuracy, their computational and memory demands hinder real-time use. To address this problem, a large model is adopted as a teacher, and a compact encoder–decoder student with few trainable parameters is trained under a dual-supervision scheme that aligns its predictions with both teacher feature maps and ground-truth depths. AdaLite is evaluated on the NYUv2, SUN-RGBD and KITTI benchmarks using standard depth metrics and deployment-oriented measures, including inference latency. The distilled model achieves a 94% reduction in size and reaches 1.02 FPS on a Raspberry Pi 2 (2 GB CPU), while preserving 96.8% of the teacher’s accuracy (δ1) and providing over 11× faster inference. These results demonstrate the effectiveness of distillation-driven compression for real-time depth estimation in resource-limited environments. The code is publically available. Full article
Show Figures

Figure 1

15 pages, 2020 KB  
Article
3D Human Reconstruction from Monocular Vision Based on Neural Fields and Explicit Mesh Optimization
by Kaipeng Wang, Xiaolong Xie, Wei Li, Jie Liu and Zhuo Wang
Electronics 2025, 14(22), 4512; https://doi.org/10.3390/electronics14224512 - 18 Nov 2025
Viewed by 2020
Abstract
Three-dimensional Human Reconstruction from Monocular Vision is a key technology in Virtual Reality and digital humans. It aims to recover the 3D structure and pose of the human body from 2D images or video. Current methods for dynamic 3D reconstruction of the human [...] Read more.
Three-dimensional Human Reconstruction from Monocular Vision is a key technology in Virtual Reality and digital humans. It aims to recover the 3D structure and pose of the human body from 2D images or video. Current methods for dynamic 3D reconstruction of the human body, which are based on monocular views, have low accuracy and remain a challenging problem. This paper proposes a fast reconstruction method based on Instant Human Model (IHM) generation, which achieves highly realistic 3D reconstruction of the human body in arbitrary poses. First, the efficient dynamic human body reconstruction method, InstantAvatar, is utilized to learn the shape and appearance of the human body in different poses. However, due to its direct use of low-resolution voxels as canonical spatial human representations, it is not possible to achieve satisfactory reconstruction results on a wide range of datasets. Next, a voxel occupancy grid is initialized in the A-pose, and a voxel attention mechanism module is constructed to enhance the reconstruction effect. Finally, the Instant Human Model (IHM) method is employed to define continuous fields on the surface, enabling highly realistic dynamic 3D human reconstruction. Experimental results show that, compared to the representative InstantAvatar method, IHM achieves a 0.1% improvement in SSIM and a 2% improvement in PSNR on the PeopleSnapshot benchmark dataset, demonstrating improvements in both reconstruction quality and detail. Specifically, IHM, through voxel attention mechanisms and Mesh adaptive iterative optimization, achieves highly realistic 3D mesh models of human bodies in various poses while ensuring efficiency. Full article
(This article belongs to the Special Issue 3D Computer Vision and 3D Reconstruction)
Show Figures

Graphical abstract

13 pages, 16914 KB  
Article
Traversal by Touch: Tactile-Based Robotic Traversal with Artificial Skin in Complex Environments
by Adam Mazurick and Alex Ferworn
Sensors 2025, 25(21), 6569; https://doi.org/10.3390/s25216569 - 25 Oct 2025
Cited by 1 | Viewed by 1018
Abstract
We evaluate tactile-first robotic traversal on the Department of Homeland Security (DHS) figure-8 mobility test using a two-way repeated-measures design across various algorithms (three tactile policies—M1 reactive, M2 terrain-weighted, M3 memory-augmented; a monocular camera baseline, CB-V; a tactile histogram baseline, T-VFH; and an [...] Read more.
We evaluate tactile-first robotic traversal on the Department of Homeland Security (DHS) figure-8 mobility test using a two-way repeated-measures design across various algorithms (three tactile policies—M1 reactive, M2 terrain-weighted, M3 memory-augmented; a monocular camera baseline, CB-V; a tactile histogram baseline, T-VFH; and an optional tactile-informed replanner, T-D* Lite) and lighting conditions (Indoor, Outdoor, and Dark). The platform is the custom-built Eleven robot—a quadruped integrating a joint-mounted tactile tentacle with a tip force-sensitive resistor (FSR; Walfront 9snmyvxw25, China; 0–10 kg range, ≈0.1 N resolution @ 83 Hz) and a woven Galvorn carbon-nanotube (CNT) yarn for proprioceptive bend sensing. Control and sensing are fully wireless via an ESP32-S3, Arduino Nano 33 BLE, Raspberry Pi 400, and a mini VESC controller. Across 660 trials, the tactile stack maintained ∼21 ms (p50) policy latency and mid-80% success across all lighting conditions, including total darkness. The memory-augmented tactile policy (M3) exhibited consistent robustness relative to the camera baseline (CB-V), trailing by only ≈3–4% in Indoor and ≈13–16% in Outdoor and Dark conditions. Pre-specified, two one-sided tests (TOSTs) confirmed no speed equivalence in any M3↔CB-V comparison. Unlike vision-based approaches, tactile-first traversal is invariant to illumination and texture—an essential capability for navigation in darkness, smoke, or texture-poor, confined environments. Overall, these results show that a tactile-first, memory-augmented control stack achieves lighting-independent traversal on DHS benchmarks while maintaining competitive latency and success, trading modest speed for robustness and sensing independence. Full article
(This article belongs to the Special Issue Intelligent Robots: Control and Sensing)
Show Figures

Figure 1

19 pages, 49708 KB  
Article
MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation
by Genki Higashiuchi, Tomoyasu Shimada, Xiangbo Kong, Haimin Yan and Hiroyuki Tomiyama
Appl. Sci. 2025, 15(19), 10393; https://doi.org/10.3390/app151910393 - 25 Sep 2025
Cited by 3 | Viewed by 1301
Abstract
Self-supervised monocular depth estimation is gaining significant attention because it can learn depth from video without needing expensive ground-truth data. However, many self-supervised models remain too heavy for edge devices, and simply shrinking them tends to degrade accuracy. To address this trade-off, we [...] Read more.
Self-supervised monocular depth estimation is gaining significant attention because it can learn depth from video without needing expensive ground-truth data. However, many self-supervised models remain too heavy for edge devices, and simply shrinking them tends to degrade accuracy. To address this trade-off, we present MonoLENS, an extension of Lite-Mono. MonoLENS follows a design that reduces computation while preserving geometric fidelity (relative depth relations, boundaries, and planar structures). MonoLENS advances Lite-Mono by suppressing computation on paths with low geometric contribution, focusing compute and attention on layers rich in structural cues, and pruning redundant operations in later stages. Our model incorporates two new modules, the DS-Upsampling Block and the MCACoder, along with a simplified encoder. Specifically, the DS-Upsampling Block uses depthwise separable convolutions throughout the decoder, which greatly lowers floating-point operations (FLOPs). Furthermore, the MCACoder applies Multidimensional Collaborative Attention (MCA) to the output of the second encoder stage, helping to make edge details sharper in high-resolution feature maps. Additionally, we simplified the encoder’s architecture by reducing the number of blocks in its fourth stage from 10 to 4, which resulted in a further reduction of model parameters. When tested on both the KITTI and Cityscapes benchmarks, MonoLENS achieved leading performance. On the KITTI benchmark, MonoLENS reduced the number of model parameters by 42% (1.8M) compared with Lite-Mono, while simultaneously improving the squared relative error by approximately 4.5%. Full article
(This article belongs to the Special Issue Convolutional Neural Networks and Computer Vision)
Show Figures

Figure 1

24 pages, 5065 KB  
Article
Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images
by Wentao Zhang, Wei Jia and Wei Li
Sensors 2025, 25(18), 5815; https://doi.org/10.3390/s25185815 - 18 Sep 2025
Viewed by 1346
Abstract
Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale [...] Read more.
Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale synthetic dataset containing 36 highway scenarios using the CARLA 0.9.15 simulation engine, generating approximately 336,000 virtual frames with precise calibration parameters. The dataset achieves statistical consistency with real-world scenes by incorporating diverse view distributions, complex weather conditions, and varied road geometries. Second, we developed DeepCalib, a deep calibration network that explicitly models perspective projection features through the triplet attention mechanism. This network simultaneously achieves road direction vanishing point localization and camera pose estimation using only a single image. Finally, we adopted a progressive learning paradigm: robust pre-training on synthetic data establishes universal feature representations in the first stage, followed by fine-tuning on real-world datasets in the second stage to enhance practical adaptability. Experimental results indicate that DeepCalib attains an average calibration precision of 89.6%. Compared to conventional multi-stage algorithms, our method achieves a single-frame processing speed of 10 frames per second, showing robust adaptability to dynamic calibration tasks across diverse surveillance views. Full article
Show Figures

Figure 1

Back to TopTop