Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (59)

Search Parameters:
Keywords = monocular benchmark

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
13 pages, 16914 KB  
Article
Traversal by Touch: Tactile-Based Robotic Traversal with Artificial Skin in Complex Environments
by Adam Mazurick and Alex Ferworn
Sensors 2025, 25(21), 6569; https://doi.org/10.3390/s25216569 - 25 Oct 2025
Viewed by 275
Abstract
We evaluate tactile-first robotic traversal on the Department of Homeland Security (DHS) figure-8 mobility test using a two-way repeated-measures design across various algorithms (three tactile policies—M1 reactive, M2 terrain-weighted, M3 memory-augmented; a monocular camera baseline, CB-V; a tactile histogram baseline, T-VFH; and an [...] Read more.
We evaluate tactile-first robotic traversal on the Department of Homeland Security (DHS) figure-8 mobility test using a two-way repeated-measures design across various algorithms (three tactile policies—M1 reactive, M2 terrain-weighted, M3 memory-augmented; a monocular camera baseline, CB-V; a tactile histogram baseline, T-VFH; and an optional tactile-informed replanner, T-D* Lite) and lighting conditions (Indoor, Outdoor, and Dark). The platform is the custom-built Eleven robot—a quadruped integrating a joint-mounted tactile tentacle with a tip force-sensitive resistor (FSR; Walfront 9snmyvxw25, China; 0–10 kg range, ≈0.1 N resolution @ 83 Hz) and a woven Galvorn carbon-nanotube (CNT) yarn for proprioceptive bend sensing. Control and sensing are fully wireless via an ESP32-S3, Arduino Nano 33 BLE, Raspberry Pi 400, and a mini VESC controller. Across 660 trials, the tactile stack maintained ∼21 ms (p50) policy latency and mid-80% success across all lighting conditions, including total darkness. The memory-augmented tactile policy (M3) exhibited consistent robustness relative to the camera baseline (CB-V), trailing by only ≈3–4% in Indoor and ≈13–16% in Outdoor and Dark conditions. Pre-specified, two one-sided tests (TOSTs) confirmed no speed equivalence in any M3↔CB-V comparison. Unlike vision-based approaches, tactile-first traversal is invariant to illumination and texture—an essential capability for navigation in darkness, smoke, or texture-poor, confined environments. Overall, these results show that a tactile-first, memory-augmented control stack achieves lighting-independent traversal on DHS benchmarks while maintaining competitive latency and success, trading modest speed for robustness and sensing independence. Full article
(This article belongs to the Special Issue Intelligent Robots: Control and Sensing)
Show Figures

Figure 1

19 pages, 49708 KB  
Article
MonoLENS: Monocular Lightweight Efficient Network with Separable Convolutions for Self-Supervised Monocular Depth Estimation
by Genki Higashiuchi, Tomoyasu Shimada, Xiangbo Kong, Haimin Yan and Hiroyuki Tomiyama
Appl. Sci. 2025, 15(19), 10393; https://doi.org/10.3390/app151910393 - 25 Sep 2025
Viewed by 403
Abstract
Self-supervised monocular depth estimation is gaining significant attention because it can learn depth from video without needing expensive ground-truth data. However, many self-supervised models remain too heavy for edge devices, and simply shrinking them tends to degrade accuracy. To address this trade-off, we [...] Read more.
Self-supervised monocular depth estimation is gaining significant attention because it can learn depth from video without needing expensive ground-truth data. However, many self-supervised models remain too heavy for edge devices, and simply shrinking them tends to degrade accuracy. To address this trade-off, we present MonoLENS, an extension of Lite-Mono. MonoLENS follows a design that reduces computation while preserving geometric fidelity (relative depth relations, boundaries, and planar structures). MonoLENS advances Lite-Mono by suppressing computation on paths with low geometric contribution, focusing compute and attention on layers rich in structural cues, and pruning redundant operations in later stages. Our model incorporates two new modules, the DS-Upsampling Block and the MCACoder, along with a simplified encoder. Specifically, the DS-Upsampling Block uses depthwise separable convolutions throughout the decoder, which greatly lowers floating-point operations (FLOPs). Furthermore, the MCACoder applies Multidimensional Collaborative Attention (MCA) to the output of the second encoder stage, helping to make edge details sharper in high-resolution feature maps. Additionally, we simplified the encoder’s architecture by reducing the number of blocks in its fourth stage from 10 to 4, which resulted in a further reduction of model parameters. When tested on both the KITTI and Cityscapes benchmarks, MonoLENS achieved leading performance. On the KITTI benchmark, MonoLENS reduced the number of model parameters by 42% (1.8M) compared with Lite-Mono, while simultaneously improving the squared relative error by approximately 4.5%. Full article
(This article belongs to the Special Issue Convolutional Neural Networks and Computer Vision)
Show Figures

Figure 1

24 pages, 5065 KB  
Article
Benchmark Dataset and Deep Model for Monocular Camera Calibration from Single Highway Images
by Wentao Zhang, Wei Jia and Wei Li
Sensors 2025, 25(18), 5815; https://doi.org/10.3390/s25185815 - 18 Sep 2025
Viewed by 529
Abstract
Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale [...] Read more.
Single-image based camera auto-calibration holds significant value for improving perception efficiency in traffic surveillance systems. However, existing approaches face dual challenges: scarcity of real-world datasets and poor adaptability to multi-view scenarios. This paper presents a systematic solution framework. First, we constructed a large-scale synthetic dataset containing 36 highway scenarios using the CARLA 0.9.15 simulation engine, generating approximately 336,000 virtual frames with precise calibration parameters. The dataset achieves statistical consistency with real-world scenes by incorporating diverse view distributions, complex weather conditions, and varied road geometries. Second, we developed DeepCalib, a deep calibration network that explicitly models perspective projection features through the triplet attention mechanism. This network simultaneously achieves road direction vanishing point localization and camera pose estimation using only a single image. Finally, we adopted a progressive learning paradigm: robust pre-training on synthetic data establishes universal feature representations in the first stage, followed by fine-tuning on real-world datasets in the second stage to enhance practical adaptability. Experimental results indicate that DeepCalib attains an average calibration precision of 89.6%. Compared to conventional multi-stage algorithms, our method achieves a single-frame processing speed of 10 frames per second, showing robust adaptability to dynamic calibration tasks across diverse surveillance views. Full article
Show Figures

Figure 1

40 pages, 1026 KB  
Review
A Survey of Deep Learning-Based 3D Object Detection Methods for Autonomous Driving Across Different Sensor Modalities
by Miguel Valverde, Alexandra Moutinho and João-Vitor Zacchi
Sensors 2025, 25(17), 5264; https://doi.org/10.3390/s25175264 - 24 Aug 2025
Viewed by 3377
Abstract
This paper presents a comprehensive survey of deep learning-based methods for 3D object detection in autonomous driving, focusing on their use of diverse sensor modalities, including monocular cameras, stereo vision, LiDAR, radar, and multi-modal fusion. To systematically organize the literature, a structured taxonomy [...] Read more.
This paper presents a comprehensive survey of deep learning-based methods for 3D object detection in autonomous driving, focusing on their use of diverse sensor modalities, including monocular cameras, stereo vision, LiDAR, radar, and multi-modal fusion. To systematically organize the literature, a structured taxonomy is proposed that categorizes methods by input modality. The review also outlines the chronological evolution of these approaches, highlighting major architectural developments and paradigm shifts. Furthermore, the surveyed methods are quantitatively compared using standard evaluation metrics across benchmark datasets in autonomous driving scenarios. Overall, this work provides a detailed and modality-agnostic overview of the current landscape of deep learning approaches for 3D object detection in autonomous driving. Results of this work are available in a github open repository. Full article
(This article belongs to the Special Issue Sensors and Sensor Fusion Technology in Autonomous Vehicles)
Show Figures

Figure 1

42 pages, 5531 KB  
Article
Preliminary Analysis and Proof-of-Concept Validation of a Neuronally Controlled Visual Assistive Device Integrating Computer Vision with EEG-Based Binary Control
by Preetam Kumar Khuntia, Prajwal Sanjay Bhide and Pudureddiyur Venkataraman Manivannan
Sensors 2025, 25(16), 5187; https://doi.org/10.3390/s25165187 - 21 Aug 2025
Viewed by 999
Abstract
Contemporary visual assistive devices often lack immersive user experience due to passive control systems. This study introduces a neuronally controlled visual assistive device (NCVAD) that aims to assist visually impaired users in performing reach tasks with active, intuitive control. The developed NCVAD integrates [...] Read more.
Contemporary visual assistive devices often lack immersive user experience due to passive control systems. This study introduces a neuronally controlled visual assistive device (NCVAD) that aims to assist visually impaired users in performing reach tasks with active, intuitive control. The developed NCVAD integrates computer vision, electroencephalogram (EEG) signal processing, and robotic manipulation to facilitate object detection, selection, and assistive guidance. The monocular vision-based subsystem implements the YOLOv8n algorithm to detect objects of daily use. Then, audio prompting conveys the detected objects’ information to the user, who selects their targeted object using a voluntary trigger decoded through real-time EEG classification. The target’s physical coordinates are extracted using ArUco markers, and a gradient descent-based path optimization algorithm (POA) guides a 3-DoF robotic arm to reach the target. The classification algorithm achieves over 85% precision and recall in decoding EEG data, even with coexisting physiological artifacts. Similarly, the POA achieves approximately 650 ms of actuation time with a 0.001 learning rate and 0.1 cm2 error threshold settings. In conclusion, the study also validates the preliminary analysis results on a working physical model and benchmarks the robotic arm’s performance against human users, establishing the proof-of-concept for future assistive technologies integrating EEG and computer vision paradigms. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

19 pages, 2717 KB  
Article
EASD: Exposure Aware Single-Step Diffusion Framework for Monocular Depth Estimation in Autonomous Vehicles
by Chenyuan Zhang and Deokwoo Lee
Appl. Sci. 2025, 15(16), 9130; https://doi.org/10.3390/app15169130 - 19 Aug 2025
Viewed by 567
Abstract
Monocular depth estimation (MDE) is a cornerstone of computer vision and is applied to diverse practical areas such as autonomous vehicles, robotics, etc., yet even the latest methods suffer substantial errors in high-dynamic-range (HDR) scenes where over- or under-exposure erases critical texture. To [...] Read more.
Monocular depth estimation (MDE) is a cornerstone of computer vision and is applied to diverse practical areas such as autonomous vehicles, robotics, etc., yet even the latest methods suffer substantial errors in high-dynamic-range (HDR) scenes where over- or under-exposure erases critical texture. To address this challenge in real-world autonomous driving scenarios, we propose the Exposure-Aware Single-Step Diffusion Framework for Monocular Depth Estimation (EASD). EASD leverages a pre-trained Stable Diffusion variational auto-encoder, freezing its encoder to extract exposure-robust latent RGB and depth representations. A single-step diffusion process then predicts the clean depth latent vector, eliminating iterative error accumulation and enabling real-time inference suitable for autonomous vehicle perception pipelines. To further enhance robustness under extreme lighting conditions, EASD introduces an Exposure-Aware Feature Fusion (EAF) module—an attention-based pyramid that dynamically modulates multi-scale features according to global brightness statistics. This mechanism suppresses bias in saturated regions while restoring detail in under-exposed areas. Furthermore, an Exposure-Balanced Loss (EBL) jointly optimises global depth accuracy, local gradient coherence and reliability in exposure-extreme regions—key metrics for safety-critical perception tasks such as obstacle detection and path planning. Experimental results on NYU-v2, KITTI, and related benchmarks demonstrate that EASD reduces absolute relative error by an average of 20% under extreme illumination, using only 60,000 labelled images. The framework achieves real-time performance (<50 ms per frame) and strikes a superior balance between accuracy, computational efficiency, and data efficiency, offering a promising solution for robust monocular depth estimation in challenging automotive lighting conditions such as tunnel transitions, night driving and sun glare. Full article
Show Figures

Figure 1

22 pages, 30414 KB  
Article
Metric Scaling and Extrinsic Calibration of Monocular Neural Network-Derived 3D Point Clouds in Railway Applications
by Daniel Thomanek and Clemens Gühmann
Appl. Sci. 2025, 15(10), 5361; https://doi.org/10.3390/app15105361 - 11 May 2025
Viewed by 1432
Abstract
Three-dimensional reconstruction using monocular camera images is a well-established research topic. While multi-image approaches like Structure from Motion produce sparse point clouds, single-image depth estimation via machine learning promises denser results. However, many models estimate relative depth, and even those providing metric depth [...] Read more.
Three-dimensional reconstruction using monocular camera images is a well-established research topic. While multi-image approaches like Structure from Motion produce sparse point clouds, single-image depth estimation via machine learning promises denser results. However, many models estimate relative depth, and even those providing metric depth often struggle with unseen data due to unfamiliar camera parameters or domain-specific challenges. Accurate metric 3D reconstruction is critical for railway applications, such as ensuring structural gauge clearance from vegetation to meet legal requirements. We propose a novel method to scale 3D point clouds using the track gauge, which typically only varies in very limited values between large areas or countries worldwide (e.g., 1.435 m in Europe). Our approach leverages state-of-the-art image segmentation to detect rails and measure the track gauge from a train driver’s perspective. Additionally, we extend our method to estimate a reasonable railway-specific extrinsic camera calibration. Evaluations show that our method reduces the average Chamfer distance to LiDAR point clouds from 1.94 m (benchmark UniDepth) to 0.41 m for image-wise calibration and 0.71 m for average calibration. Full article
Show Figures

Figure 1

40 pages, 10575 KB  
Review
A Survey of the State of the Art in Monocular 3D Human Pose Estimation: Methods, Benchmarks, and Challenges
by Yan Guo, Tianhan Gao, Aoshuang Dong, Xinbei Jiang, Zichen Zhu and Fuxin Wang
Sensors 2025, 25(8), 2409; https://doi.org/10.3390/s25082409 - 10 Apr 2025
Cited by 4 | Viewed by 8759
Abstract
Three-dimensional human pose estimation (3D HPE) from monocular RGB cameras is a fundamental yet challenging task in computer vision, forming the basis of a wide range of applications such as action recognition, metaverse, self-driving, and healthcare. Recent advances in deep learning have significantly [...] Read more.
Three-dimensional human pose estimation (3D HPE) from monocular RGB cameras is a fundamental yet challenging task in computer vision, forming the basis of a wide range of applications such as action recognition, metaverse, self-driving, and healthcare. Recent advances in deep learning have significantly propelled the field, particularly with the incorporation of state-space models (SSMs) and diffusion models. However, systematic reviews that comprehensively cover these emerging techniques remain limited. This survey contributes to the literature by providing the first comprehensive analysis of recent innovative approaches, featuring diffusion models and SSMs within 3D HPE. It categorizes and analyzes various techniques, highlighting their strengths, limitations, and notable innovations. Additionally, it provides a detailed overview of commonly employed datasets and evaluation metrics. Furthermore, this survey offers an in-depth discussion on key challenges, particularly depth ambiguity and occlusion issues arising from single-view setups, thoroughly reviewing effective solutions proposed in recent studies. Finally, current applications and promising avenues for future research are highlighted to guide and inspire ongoing innovation in the area, with emerging trends such as integrating large language models (LLMs) to provide semantic priors and prompt-based supervision for improved 3D pose estimation. Full article
(This article belongs to the Special Issue Computer Vision and Sensors-Based Application for Intelligent Systems)
Show Figures

Figure 1

20 pages, 8973 KB  
Article
UE-SLAM: Monocular Neural Radiance Field SLAM with Semantic Mapping Capabilities
by Yuquan Zhang, Guangan Jiang, Mingrui Li and Guosheng Feng
Symmetry 2025, 17(4), 508; https://doi.org/10.3390/sym17040508 - 27 Mar 2025
Viewed by 1795
Abstract
Neural Radiance Fields (NeRF) have transformed 3D reconstruction by enabling high-fidelity scene generation from sparse views. However, existing neural SLAM systems face challenges such as limited scene understanding and heavy reliance on depth sensors. We propose UE-SLAM, a real-time monocular SLAM system integrating [...] Read more.
Neural Radiance Fields (NeRF) have transformed 3D reconstruction by enabling high-fidelity scene generation from sparse views. However, existing neural SLAM systems face challenges such as limited scene understanding and heavy reliance on depth sensors. We propose UE-SLAM, a real-time monocular SLAM system integrating semantic segmentation, depth fusion, and robust tracking modules. By leveraging the inherent symmetry between semantic segmentation and depth estimation, UE-SLAM utilizes DINOv2 for instance segmentation and combines monocular depth estimation, radiance field-rendered depth, and an uncertainty framework to produce refined proxy depth. This approach enables high-quality semantic mapping and eliminates the need for depth sensors. Experiments on benchmark datasets demonstrate that UE-SLAM achieves robust semantic segmentation, detailed scene reconstruction, and accurate tracking, significantly outperforming existing monocular SLAM methods. The modular and symmetrical architecture of UE-SLAM ensures a balance between computational efficiency and reconstruction quality, aligning with the thematic focus of symmetry in engineering and computational systems. Full article
(This article belongs to the Section Engineering and Materials)
Show Figures

Figure 1

20 pages, 6052 KB  
Article
Representation Learning for Vision-Based Autonomous Driving via Probabilistic World Modeling
by Haoqiang Chen, Yadong Liu and Dewen Hu
Machines 2025, 13(3), 231; https://doi.org/10.3390/machines13030231 - 12 Mar 2025
Viewed by 1544
Abstract
Representation learning plays a vital role in autonomous driving by extracting meaningful features from raw sensory inputs. World models emerge as an effective approach to representation learning by capturing predictive features that can anticipate multiple possible futures, which is particularly suited for driving [...] Read more.
Representation learning plays a vital role in autonomous driving by extracting meaningful features from raw sensory inputs. World models emerge as an effective approach to representation learning by capturing predictive features that can anticipate multiple possible futures, which is particularly suited for driving scenarios. However, existing world model approaches face two critical limitations: First, conventional methods rely heavily on computationally expensive variational inference that requires decoding back to high-dimensional observation space. Second, current end-to-end autonomous driving systems demand extensive labeled data for training, resulting in prohibitive annotation costs. To address these challenges, we present BYOL-Drive, a novel method that firstly introduces the self-supervised representation-learning paradigm BYOL (Bootstrap Your Own Latent) to implement world modeling. Our method eliminates the computational burden of observation space decoding while requiring substantially fewer labeled data compared to mainstream approaches. Additionally, our model only relies on monocular camera images as input, making it easy to deploy and generalize. Based on this learned representation, experiments on the standard closed-loop CARLA benchmark demonstrate that our BYOL-Drive achieves competitive performance with improved computational efficiency and significantly reduced annotation requirements compared to the state-of-the-art methods. Our work contributes to the development of end-to-end autonomous driving. Full article
Show Figures

Figure 1

13 pages, 2200 KB  
Article
Deep Neural Networks for Accurate Depth Estimation with Latent Space Features
by Siddiqui Muhammad Yasir and Hyunsik Ahn
Biomimetics 2024, 9(12), 747; https://doi.org/10.3390/biomimetics9120747 - 9 Dec 2024
Cited by 1 | Viewed by 2905
Abstract
Depth estimation plays a pivotal role in advancing human–robot interactions, especially in indoor environments where accurate 3D scene reconstruction is essential for tasks like navigation and object handling. Monocular depth estimation, which relies on a single RGB camera, offers a more affordable solution [...] Read more.
Depth estimation plays a pivotal role in advancing human–robot interactions, especially in indoor environments where accurate 3D scene reconstruction is essential for tasks like navigation and object handling. Monocular depth estimation, which relies on a single RGB camera, offers a more affordable solution compared to traditional methods that use stereo cameras or LiDAR. However, despite recent progress, many monocular approaches struggle with accurately defining depth boundaries, leading to less precise reconstructions. In response to these challenges, this study introduces a novel depth estimation framework that leverages latent space features within a deep convolutional neural network to enhance the precision of monocular depth maps. The proposed model features dual encoder–decoder architecture, enabling both color-to-depth and depth-to-depth transformations. This structure allows for refined depth estimation through latent space encoding. To further improve the accuracy of depth boundaries and local features, a new loss function is introduced. This function combines latent loss with gradient loss, helping the model maintain the integrity of depth boundaries. The framework is thoroughly tested using the NYU Depth V2 dataset, where it sets a new benchmark, particularly excelling in complex indoor scenarios. The results clearly show that this approach effectively reduces depth ambiguities and blurring, making it a promising solution for applications in human–robot interaction and 3D scene reconstruction. Full article
(This article belongs to the Special Issue Biologically Inspired Vision and Image Processing 2024)
Show Figures

Figure 1

7 pages, 1849 KB  
Proceeding Paper
Inverse Perspective Mapping Correction for Aiding Camera-Based Autonomous Driving Tasks
by Norbert Markó, Péter Kőrös and Miklós Unger
Eng. Proc. 2024, 79(1), 67; https://doi.org/10.3390/engproc2024079067 - 7 Nov 2024
Cited by 2 | Viewed by 2518
Abstract
Inverse perspective mapping (IPM) is a crucial technique in camera-based autonomous driving, transforming the perspective view captured by the camera into a bird’s-eye view. This can be beneficial for accurate environmental perception, path planning, obstacle detection, and navigation. IPM faces challenges such as [...] Read more.
Inverse perspective mapping (IPM) is a crucial technique in camera-based autonomous driving, transforming the perspective view captured by the camera into a bird’s-eye view. This can be beneficial for accurate environmental perception, path planning, obstacle detection, and navigation. IPM faces challenges such as distortion and inaccuracies due to varying road inclinations and intrinsic camera properties. Herein, we revealed inaccuracies inherent in our current IPM approach so proper correction techniques can be applied later. We aimed to explore correction possibilities to enhance the accuracy of IPM and examine other methods that could be used as a benchmark or even a replacement, such as stereo vision and deep learning-based monocular depth estimation methods. With this work, we aimed to provide an analysis and direction for working with IPM. Full article
(This article belongs to the Proceedings of The Sustainable Mobility and Transportation Symposium 2024)
Show Figures

Figure 1

20 pages, 1373 KB  
Article
A Sparsity-Invariant Model via Unifying Depth Prediction and Completion
by Shuling Wang, Fengze Jiang and Xiaojin Gong
Algorithms 2024, 17(7), 298; https://doi.org/10.3390/a17070298 - 6 Jul 2024
Cited by 1 | Viewed by 1367
Abstract
The development of a sparse-invariant depth completion model capable of handling varying levels of input depth sparsity is highly desirable in real-world applications. However, existing sparse-invariant models tend to degrade when the input depth points are extremely sparse. In this paper, we propose [...] Read more.
The development of a sparse-invariant depth completion model capable of handling varying levels of input depth sparsity is highly desirable in real-world applications. However, existing sparse-invariant models tend to degrade when the input depth points are extremely sparse. In this paper, we propose a new model that combines the advantageous designs of depth completion and monocular depth estimation tasks to achieve sparse invariance. Specifically, we construct a dual-branch architecture with one branch dedicated to depth prediction and the other to depth completion. Additionally, we integrate the multi-scale local planar module in the decoders of both branches. Experimental results on the NYU Depth V2 benchmark and the OPPO prototype dataset equipped with the Spot-iToF316 sensor demonstrate that our model achieves reliable results even in cases with irregularly distributed, limited or absent depth information. Full article
(This article belongs to the Special Issue Machine Learning Algorithms for Image Understanding and Analysis)
Show Figures

Figure 1

18 pages, 4124 KB  
Article
MoNA Bench: A Benchmark for Monocular Depth Estimation in Navigation of Autonomous Unmanned Aircraft System
by Yongzhou Pan, Binhong Liu, Zhen Liu, Hao Shen, Jianyu Xu, Wenxing Fu and Tao Yang
Drones 2024, 8(2), 66; https://doi.org/10.3390/drones8020066 - 16 Feb 2024
Cited by 3 | Viewed by 3921
Abstract
Efficient trajectory and path planning (TPP) is essential for unmanned aircraft systems (UASs) autonomy in challenging environments. Despite the scale ambiguity inherent in monocular vision, characteristics like compact size make a monocular camera ideal for micro-aerial vehicle (MAV)-based UASs. This work introduces a [...] Read more.
Efficient trajectory and path planning (TPP) is essential for unmanned aircraft systems (UASs) autonomy in challenging environments. Despite the scale ambiguity inherent in monocular vision, characteristics like compact size make a monocular camera ideal for micro-aerial vehicle (MAV)-based UASs. This work introduces a real-time MAV system using monocular depth estimation (MDE) with novel scale recovery module for autonomous navigation. We present MoNA Bench, a benchmark for Monocular depth estimation in Navigation of the Autonomous unmanned Aircraft system (MoNA), emphasizing its obstacle avoidance and safe target tracking capabilities. We highlight key attributes—estimation efficiency, depth map accuracy, and scale consistency—for efficient TPP through MDE. Full article
(This article belongs to the Special Issue Efficient UAS Trajectory and Path Planning)
Show Figures

Figure 1

26 pages, 12281 KB  
Article
MonoGhost: Lightweight Monocular GhostNet 3D Object Properties Estimation for Autonomous Driving
by Ahmed El-Dawy, Amr El-Zawawi and Mohamed El-Habrouk
Robotics 2023, 12(6), 155; https://doi.org/10.3390/robotics12060155 - 17 Nov 2023
Cited by 2 | Viewed by 3222
Abstract
Effective environmental perception is critical for autonomous driving; thus, the perception system requires collecting 3D information of the surrounding objects, such as their dimensions, locations, and orientation in space. Recently, deep learning has been widely used in perception systems that convert image features [...] Read more.
Effective environmental perception is critical for autonomous driving; thus, the perception system requires collecting 3D information of the surrounding objects, such as their dimensions, locations, and orientation in space. Recently, deep learning has been widely used in perception systems that convert image features from a camera into semantic information. This paper presents the MonoGhost network, a lightweight Monocular GhostNet deep learning technique for full 3D object properties estimation from a single frame monocular image. Unlike other techniques, the proposed MonoGhost network first estimates relatively reliable 3D object properties depending on efficient feature extractor. The proposed MonoGhost network estimates the orientation of the 3D object as well as the 3D dimensions of that object, resulting in reasonably small errors in the dimensions estimations versus other networks. These estimations, combined with the translation projection constraints imposed by the 2D detection coordinates, allow for the prediction of a robust and dependable Bird’s Eye View bounding box. The experimental outcomes prove that the proposed MonoGhost network performs better than other state-of-the-art networks in the Bird’s Eye View of the KITTI dataset benchmark by scoring 16.73% on the moderate class and 15.01% on the hard class while preserving real-time requirements. Full article
(This article belongs to the Special Issue Autonomous Navigation of Mobile Robots in Unstructured Environments)
Show Figures

Figure 1

Back to TopTop