Next Article in Journal
Efficient Parallel Processing of R-Tree on GPUs
Previous Article in Journal
Software Fault Localization Based on Weighted Association Rule Mining and Complex Networks
Previous Article in Special Issue
Retinex Jointed Multiscale CLAHE Model for HDR Image Tone Compression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos

1
Department of Multimedia Engineering, Dongguk University-Seoul, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Republic of Korea
2
Division of AI Software Convergence, Dongguk University-Seoul, Seoul 04620, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(13), 2114; https://doi.org/10.3390/math12132114
Submission received: 17 June 2024 / Revised: 1 July 2024 / Accepted: 4 July 2024 / Published: 5 July 2024
(This article belongs to the Special Issue New Advances and Applications in Image Processing and Computer Vision)

Abstract

Existing 3D semantic scene reconstruction methods utilize the same set of features extracted from deep learning networks for both 3D semantic estimation and geometry reconstruction, ignoring the differing requirements of semantic segmentation and geometry construction tasks. Additionally, current methods allocate 2D image features to all voxels along camera rays during the back-projection process, without accounting for empty or occluded voxels. To address these issues, we propose separating the features for 3D semantic estimation from those for 3D mesh reconstruction. We use a pretrained vision transformer network for image feature extraction and depth priors estimated by a pretrained multi-view stereo-network to guide the allocation of image features within 3D voxels during the back-projection process. The back-projected image features are aggregated within each 3D voxel via averaging, creating coherent voxel features. The resulting 3D feature volume, composed of unified voxel feature vectors, is fed into a 3D CNN with a semantic classification head to produce a 3D semantic volume. This volume can be combined with existing 3D mesh reconstruction networks to produce a 3D semantic mesh. Experimental results on real-world datasets demonstrate that the proposed method significantly increases 3D semantic estimation accuracy.
Keywords: 3D semantic scene reconstruction; depth priors; vision transformer; multi-view stereo-network; voxel feature fusion 3D semantic scene reconstruction; depth priors; vision transformer; multi-view stereo-network; voxel feature fusion

Share and Cite

MDPI and ACS Style

Wen, M.; Cho, K. Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics 2024, 12, 2114. https://doi.org/10.3390/math12132114

AMA Style

Wen M, Cho K. Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos. Mathematics. 2024; 12(13):2114. https://doi.org/10.3390/math12132114

Chicago/Turabian Style

Wen, Mingyun, and Kyungeun Cho. 2024. "Depth Prior-Guided 3D Voxel Feature Fusion for 3D Semantic Estimation from Monocular Videos" Mathematics 12, no. 13: 2114. https://doi.org/10.3390/math12132114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop