Topic Editors

Dr. Wei Zhou
School of Computer Science and Informatics, Cardiff University, Cathays, Cardiff CF24 4AG, UK
School of Biomedical Engineering, Shenzhen University Health Science Center, Shenzhen 518037, China
Dr. Wenhan Yang
Peng Cheng Laboratory, Shenzhen 518066, China

Visual Computing and Understanding: New Developments and Trends

Abstract submission deadline
31 December 2025
Manuscript submission deadline
31 March 2026
Viewed by
9991

Topic Information

Dear Colleagues,

We as humans have to handle massive amounts of visual information in our daily lives. As a result, there has been recent growing interest in the advancement of artificial intelligence-based perception and analysis algorithms in the field of computer vision and image processing.

Despite significant successes in visual computing and understanding in recent years, new developments and trends in the methods in which these achievements are made are still in their infancy, especially for many complex real-world applications.

The aim of this topic is to progress this field by collecting research on both the theorical and applied issues related to advances in visual computing and understanding. All interested authors are invited to submit their innovative manuscripts on (but are not limited to) the following  topics:

  • Image/video acquisition, fusion, and generation;
  • Image/video coding, restoration, and quality assessment;
  • Image/video classification, segmentation, and detection;
  • Deep learning-based methods for image processing and analysis;
  • Deep learning-based methods for video processing and analysis;
  • Deep learning-based computer vision methods for 3D models;
  • Intelligent vision methods for autonomous driving systems;
  • Robotic vision and its applications;
  • Biomedical vision analysis and applications;
  • Advances in visual computing theories.

Dr. Wei Zhou
Dr. Guanghui Yue
Dr. Wenhan Yang
Topic Editors

Keywords

  • image processing and video processing
  • visual computing and deep learning
  • computer vision and robotic vision
  • autonomous driving
  • biomedical vision
  • image acquisition and image fusion
  • generative models
  • video coding and image restoration
  • quality assessment
  • visual understanding
  • feature extraction and object detection
  • image classification
  • semantic segmentation
  • saliency detection
  • perception modelling

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Applied Sciences
applsci
2.5 5.3 2011 18.4 Days CHF 2400 Submit
Computers
computers
2.6 5.4 2012 15.5 Days CHF 1800 Submit
Electronics
electronics
2.6 5.3 2012 16.4 Days CHF 2400 Submit
Information
information
2.4 6.9 2010 16.4 Days CHF 1600 Submit
Journal of Imaging
jimaging
2.7 5.9 2015 18.3 Days CHF 1800 Submit

Preprints.org is a multidisciplinary platform offering a preprint service designed to facilitate the early sharing of your research. It supports and empowers your research journey from the very beginning.

MDPI Topics is collaborating with Preprints.org and has established a direct connection between MDPI journals and the platform. Authors are encouraged to take advantage of this opportunity by posting their preprints at Preprints.org prior to publication:

  1. Share your research immediately: disseminate your ideas prior to publication and establish priority for your work.
  2. Safeguard your intellectual contribution: Protect your ideas with a time-stamped preprint that serves as proof of your research timeline.
  3. Boost visibility and impact: Increase the reach and influence of your research by making it accessible to a global audience.
  4. Gain early feedback: Receive valuable input and insights from peers before submitting to a journal.
  5. Ensure broad indexing: Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (13 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
33 pages, 36906 KiB  
Article
Making Images Speak: Human-Inspired Image Description Generation
by Chifaa Sebbane, Ikram Belhajem and Mohammed Rziza
Information 2025, 16(5), 356; https://doi.org/10.3390/info16050356 - 28 Apr 2025
Abstract
Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these [...] Read more.
Despite significant advances in deep learning-based image captioning, many state-of-the-art models still struggle to balance visual grounding (i.e., accurate object and scene descriptions) with linguistic coherence (i.e., grammatical fluency and appropriate use of non-visual tokens such as articles and prepositions). To address these limitations, we propose a hybrid image captioning framework that integrates handcrafted and deep visual features. Specifically, we combine local descriptors—Scale-Invariant Feature Transform (SIFT) and Bag of Features (BoF)—with high-level semantic features extracted using ResNet50. This dual representation captures both fine-grained spatial details and contextual semantics. The decoder employs Bahdanau attention refined with an Attention-on-Attention (AoA) mechanism to optimize visual-textual alignment, while GloVe embeddings and a GRU-based sequence model ensure fluent language generation. The proposed system is trained on 200,000 image-caption pairs from the MS COCO train2014 dataset and evaluated on 50,000 held-out MS COCO pairs plus the Flickr8K benchmark. Our model achieves a CIDEr score of 128.3 and a SPICE score of 29.24, reflecting clear improvements over baselines in both semantic precision—particularly for spatial relationships—and grammatical fluency. These results validate that combining classical computer vision techniques with modern attention mechanisms yields more interpretable and linguistically precise captions, addressing key limitations in neural caption generation. Full article
17 pages, 39878 KiB  
Article
Real-Time Volume-Rendering Image Denoising Based on Spatiotemporal Weighted Kernel Prediction
by Xinran Xu, Chunxiao Xu and Lingxiao Zhao
J. Imaging 2025, 11(4), 126; https://doi.org/10.3390/jimaging11040126 - 21 Apr 2025
Abstract
Volumetric Path Tracing (VPT) based on Monte Carlo (MC) sampling often requires numerous samples for high-quality images, but real-time applications limit samples to maintain interaction rates, leading to significant noise. Traditional real-time denoising methods use radiance and geometric features as neural network inputs, [...] Read more.
Volumetric Path Tracing (VPT) based on Monte Carlo (MC) sampling often requires numerous samples for high-quality images, but real-time applications limit samples to maintain interaction rates, leading to significant noise. Traditional real-time denoising methods use radiance and geometric features as neural network inputs, but lightweight networks struggle with temporal stability and complex mapping relationships, causing blurry results. To address these issues, a spatiotemporal lightweight neural network is proposed to enhance the denoising performance of VPT-rendered images with low samples per pixel. First, the reprojection technique was employed to obtain features from historical frames. Next, a dual-input convolutional neural network architecture was designed to predict filtering kernels. Radiance and geometric features were encoded independently. The encoding of geometric features guided the pixel-wise fitting of radiance feature filters. Finally, learned weight filtering kernels were applied to images’ spatiotemporal filtering to produce denoised results. The experimental results across multiple denoising datasets demonstrate that this approach outperformed the baseline models in terms of feature extraction and detail representation capabilities while effectively suppressing noise with superior performance and enhanced temporal stability. Full article
Show Figures

Figure 1

16 pages, 16551 KiB  
Article
Camera Pose Generation Based on Unity3D
by Hao Luo, Wenjie Luo and Wenzhu Yang
Information 2025, 16(4), 315; https://doi.org/10.3390/info16040315 - 16 Apr 2025
Viewed by 142
Abstract
Deep learning models performing complex tasks require the support of datasets. With the advancement of virtual reality technology, the use of virtual datasets in deep learning models is becoming more and more widespread. Indoor scenes represents a significant area of interest for the [...] Read more.
Deep learning models performing complex tasks require the support of datasets. With the advancement of virtual reality technology, the use of virtual datasets in deep learning models is becoming more and more widespread. Indoor scenes represents a significant area of interest for the application of machine vision technologies. Existing virtual indoor datasets exhibit deficiencies with regard to camera poses, resulting in problems such as occlusion, object omission, and objects having too small of a proportion of the image, and perform poorly in the training for object detection and simultaneous localization and mapping (SLAM) tasks. Aiming at the problems regarding the capacity of cameras to comprehensively capture scene objects, this study presents an enhanced algorithm based on rapidly exploring random tree star (RRT*) for the generation of camera poses in a 3D indoor scene. Meanwhile, in order to generate multimodal data for various deep learning tasks, this study designs an automatic image acquisition module under the Unity3D platform. The experimental results from running the model on several mainstream virtual indoor datasets—such as 3D-FRONT and Hypersim—indicate that the image sequences generated in this study show enhancements in terms of object capture rate and efficiency. Even in cluttered environments such as those in SceneNet RGB-D, the object capture rate remains stable at around 75%. Compared with the image sequences from the original datasets, those generated in this study achieve improvements in the object detection and SLAM tasks, with increases of up to approximately 30% in mAP for the YOLOv10 object detection task and up to approximately 10% in SR for the ORB-SLAM algorithm. Full article
Show Figures

Figure 1

21 pages, 1718 KiB  
Article
Gaze Estimation Based on a Multi-Stream Adaptive Feature Fusion Network
by Changli Li, Enrui Tong, Kao Zhang, Nenglun Cheng, Zhongyuan Lai and Zhigeng Pan
Appl. Sci. 2025, 15(7), 3684; https://doi.org/10.3390/app15073684 - 27 Mar 2025
Viewed by 253
Abstract
Recently, with the widespread application of deep learning networks, appearance-based gaze estimation has made breakthrough progress. However, most methods focus on feature extraction from the facial region while neglecting the critical role of the eye region in gaze estimation, leading to insufficient eye [...] Read more.
Recently, with the widespread application of deep learning networks, appearance-based gaze estimation has made breakthrough progress. However, most methods focus on feature extraction from the facial region while neglecting the critical role of the eye region in gaze estimation, leading to insufficient eye detail representation. To address this issue, this paper proposes a multi-stream multi-input network architecture (MSMI-Net) based on appearance. The model consists of two independent streams designed to extract high-dimensional eye features and low-dimensional features, integrating both eye and facial information. A parallel channel and spatial attention mechanism is employed to fuse low-dimensional eye and facial features, while an adaptive weight adjustment mechanism (AWAM) dynamically determines the contribution ratio of eye and facial features. The concatenated high-dimensional and fused low-dimensional features are processed through fully connected layers to predict the final gaze direction. Extensive experiments on the EYEDIAP, MPIIFaceGaze, and Gaze360 datasets validate the superiority of the proposed method. Full article
Show Figures

Figure 1

18 pages, 9766 KiB  
Article
MSG-YOLO: A Multi-Scale Dynamically Enhanced Network for the Real-Time Detection of Small Impurities in Large-Volume Parenterals
by Ziqi Li, Dongyao Jia, Zihao He and Nengkai Wu
Electronics 2025, 14(6), 1149; https://doi.org/10.3390/electronics14061149 - 14 Mar 2025
Cited by 1 | Viewed by 381
Abstract
The detection of small targets holds significant application value in the identification of small foreign objects within large-volume parenterals. However, existing methods often face challenges such as inadequate feature expression capabilities, the loss of detailed information, and difficulties in suppressing background interference. To [...] Read more.
The detection of small targets holds significant application value in the identification of small foreign objects within large-volume parenterals. However, existing methods often face challenges such as inadequate feature expression capabilities, the loss of detailed information, and difficulties in suppressing background interference. To tackle the task of the high-speed and high-precision detection of tiny foreign objects in production scenarios involving large infusions, this paper introduces a multi-scale dynamic enhancement network (MSG-YOLO) based on an improved YOLO framework. The primary innovation is the design of a multi-scale dynamic grouped channel enhancement convolution module (MSG-CECM). This module captures multi-scale contextual features through parallel dilated convolutions, enhances the response of critical areas by integrating channel-space joint attention mechanisms, and employs a dynamic grouping strategy for adaptive feature reorganization. In the channel dimension, cross-scale feature fusion and a squeeze-excitation mechanism optimize feature weight distribution; in the spatial dimension, local maximum responses and spatial attention enhance edge details. Furthermore, the module features a lightweight design that reduces computational costs through grouped convolutions. The experiments conducted on our custom large infusion dataset (LVPD) demonstrate that our method improves the mean Average Precision (mAP) by 2.2% compared to the baseline YOLOv9 and increases small target detection accuracy (AP_small) by 3.1% while maintaining a real-time inference speed of 58 FPS. Full article
Show Figures

Figure 1

25 pages, 104688 KiB  
Article
No-Reference Quality Assessment of Infrared Image Colorization with Color–Spatial Features
by Dian Sheng, Weiqi Jin, Xia Wang and Li Li
Electronics 2025, 14(6), 1126; https://doi.org/10.3390/electronics14061126 - 12 Mar 2025
Viewed by 444
Abstract
LDANet represents an innovative no-reference quality assessment model specifically engineered to evaluate colorized infrared images. This is a crucial task for various applications, and existing methods often fail to capture color-specific distortions. The proposed model distinguishes itself by uniquely combining color feature extraction [...] Read more.
LDANet represents an innovative no-reference quality assessment model specifically engineered to evaluate colorized infrared images. This is a crucial task for various applications, and existing methods often fail to capture color-specific distortions. The proposed model distinguishes itself by uniquely combining color feature extraction through latent Dirichlet allocation (LDA) with spatial feature extraction enhanced by multichannel and spatial attention mechanisms. It employs a dual-feature approach that facilitates thorough assessment of both color fidelity and detail preservation in colorized images. The architecture of LDANet encompasses two critical components: an LDA-based color feature extraction module which meticulously analyzes and learns color distribution patterns, and a spatial feature extraction module that leverages an inception network bolstered by attention mechanisms to effectively capture multiscale spatial characteristics. Rigorous experimental validation conducted on a specialized dataset of colorized infrared images demonstrates that LDANet significantly outperforms existing leading no-reference image quality assessment methods. This study reports the effectiveness of integrating color-specific features within a quality assessment framework tailored for infrared image colorization, representing a meaningful advancement in this domain. These findings emphasize the essential role of color feature integration in the evaluation of colorized infrared images, providing a robust tool for optimizing colorization algorithms and enhancing their practical applications. Full article
Show Figures

Figure 1

18 pages, 8347 KiB  
Article
Shallow Subsurface Wavefield Data Interpolation Method Based on Transfer Learning
by Danfeng Zang, Jian Li, Chuankun Li, Hengran Zhang, Zhipeng Pei and Yixiang Ma
Appl. Sci. 2025, 15(4), 1964; https://doi.org/10.3390/app15041964 - 13 Feb 2025
Viewed by 453
Abstract
The deployment density of surface sensors can significantly impact the accuracy of subsurface shallow seismic field energy inversion. With finite budget constraints, it is often not feasible to deploy a large number of sensors, resulting in limited seismic signal acquisition that hinders accurate [...] Read more.
The deployment density of surface sensors can significantly impact the accuracy of subsurface shallow seismic field energy inversion. With finite budget constraints, it is often not feasible to deploy a large number of sensors, resulting in limited seismic signal acquisition that hinders accurate inversion of the shallow subsurface explosions. To address the challenge of insufficient sensor signals needed for inversion, we conducted a study on a subsurface shallow wavefield data interpolation method based on transfer learning. This method is designed to increase the overall signal acquisition by interpolating signals at target locations from limited measurement points. Our research employs neural networks to interpolate real seismic data, supplementing the sampled signals. Given the lack of extensive samples from actual data collection, we devised a training approach that combines numerically simulated signals with real collected signals. Initially, we performed conventional interpolation training using a deep interpolation network with complete synthetic gather images obtained from numerical simulations. Subsequently, the feature extraction part was frozen, and the interpolation network was transferred to real datasets, where it was trained using incomplete gather images. Finally, these incomplete gather images were re-input into the trained network to obtain interpolated results at the target locations. Our study demonstrates the superiority of our method by comparing it with two other interpolation networks and validating the effectiveness of transfer learning through four sets of ablation experiments in the actual test. This method can also be applied to other shallow geological structures to generate a large number of seismic signals for energy inversion. Full article
Show Figures

Figure 1

22 pages, 5498 KiB  
Article
Small-Sample Target Detection Across Domains Based on Supervision and Distillation
by Fusheng Sun, Jianli Jia, Xie Han, Liqun Kuang and Huiyan Han
Electronics 2024, 13(24), 4975; https://doi.org/10.3390/electronics13244975 - 18 Dec 2024
Cited by 1 | Viewed by 679
Abstract
To address the issues of significant object discrepancies, low similarity, and image noise interference between source and target domains in object detection, we propose a supervised learning approach combined with knowledge distillation. Initially, student and teacher models are jointly trained through supervised and [...] Read more.
To address the issues of significant object discrepancies, low similarity, and image noise interference between source and target domains in object detection, we propose a supervised learning approach combined with knowledge distillation. Initially, student and teacher models are jointly trained through supervised and distillation-based approaches, iteratively refining the inter-model weights to mitigate the issue of model overfitting. Secondly, a combined convolutional module is integrated into the feature extraction network of the student model, to minimize redundant computational effort; an explicit visual center module is embedded within the feature pyramid network, to bolster feature representation; and a spatial grouping enhancement module is incorporated into the region proposal network, to mitigate the adverse effects of noise on the outcomes. Ultimately, the model undergoes a comprehensive optimization process that leverages the loss functions originating from both the supervised and knowledge distillation phases. The experimental results demonstrate that this strategy significantly boosts classification and identification accuracy on cross-domain datasets; when compared to the TFA (Task-agnostic Fine-tuning and Adapter), CD-FSOD (Cross-Domain Few-Shot Object Detection) and DeFRCN (Decoupled Faster R-CNN for Few-Shot Object Detection), with sample orders of magnitude 1 and 5, increased the detection accuracy by 1.67% and 1.87%, respectively. Full article
Show Figures

Figure 1

19 pages, 7944 KiB  
Article
Method for Reconstructing Velocity Field Images of the Internal Structures of Bridges Based on Group Sparsity
by Jian Li, Jin Li, Chenli Guo, Hongtao Wu, Chuankun Li, Rui Liu and Lujun Wei
Electronics 2024, 13(22), 4574; https://doi.org/10.3390/electronics13224574 - 20 Nov 2024
Viewed by 682
Abstract
Non-destructive testing (NDT) enables the determination of internal defects and flaws in concrete structures without damaging them, making it a common application in current bridge concrete inspections. However, due to the complexity of the internal structure of this type of concrete, limitations regarding [...] Read more.
Non-destructive testing (NDT) enables the determination of internal defects and flaws in concrete structures without damaging them, making it a common application in current bridge concrete inspections. However, due to the complexity of the internal structure of this type of concrete, limitations regarding measurement point placement, and the extensive detection area, accurate defect detection cannot be guaranteed. This paper proposes a method that combines the Simultaneous Algebraic Reconstruction Technique with Group Sparsity Regularization (SART-GSR) to achieve tomographic imaging of bridge concrete under sparse measurement conditions. Firstly, a mathematical model is established based on the principles of the tomographic imaging of bridge concrete; secondly, the SART algorithm is used to solve for its velocity values; thirdly, on the basis of the SART results, GSR is applied for optimized solution processing; finally, simulation experiments are conducted to verify the reconstruction effects of the SART-GSR algorithm compared with those of the SART and ART algorithms. The results show that the SART-GSR algorithm reduced the relative error to 1.5% and the root mean square error to 89.76 m/s compared to the SART and ART algorithms. This improvement in accuracy makes it valuable for the tomographic imaging of bridge concrete and provides a reference for defect detection in bridge concrete. Full article
Show Figures

Figure 1

19 pages, 5749 KiB  
Article
Video Anomaly Detection Based on Global–Local Convolutional Autoencoder
by Fusheng Sun, Jiahao Zhang, Xiaodong Wu, Zhong Zheng and Xiaowen Yang
Electronics 2024, 13(22), 4415; https://doi.org/10.3390/electronics13224415 - 11 Nov 2024
Viewed by 1029
Abstract
Video anomaly detection (VAD) plays a crucial role in fields such as security, production, and transportation. To address the issue of overgeneralization in anomaly behavior prediction by deep neural networks, we propose a network called AMFCFBMem-Net (appearance and motion feature cross-fusion block memory [...] Read more.
Video anomaly detection (VAD) plays a crucial role in fields such as security, production, and transportation. To address the issue of overgeneralization in anomaly behavior prediction by deep neural networks, we propose a network called AMFCFBMem-Net (appearance and motion feature cross-fusion block memory network), which combines appearance and motion feature cross-fusion blocks. Firstly, dual encoders for appearance and motion are employed to separately extract these features, which are then integrated into the skip connection layer to mitigate the model’s tendency to predict abnormal behavior, ultimately enhancing the prediction accuracy for abnormal samples. Secondly, a motion foreground extraction module is integrated into the network to generate a foreground mask map based on speed differences, thereby widening the prediction error margin between normal and abnormal behaviors. To capture the latent features of various models for normal samples, a memory module is introduced at the bottleneck of the encoder and decoder structures. This further enhances the model’s anomaly detection capabilities and diminishes its predictive generalization towards abnormal samples. The experimental results on the UCSD Pedestrian dataset 2 (UCSD Ped2) and CUHK Avenue anomaly detection dataset (CUHK Avenue) demonstrate that, compared to current cutting-edge video anomaly detection algorithms, our proposed method achieves frame-level AUCs of 97.5% and 88.8%, respectively, effectively enhancing anomaly detection capabilities. Full article
Show Figures

Figure 1

16 pages, 6488 KiB  
Article
3D-CNN Method for Drowsy Driving Detection Based on Driving Pattern Recognition
by Jimin Lee, Soomin Woo and Changjoo Moon
Electronics 2024, 13(17), 3388; https://doi.org/10.3390/electronics13173388 - 26 Aug 2024
Viewed by 1216
Abstract
Drowsiness impairs drivers’ concentration and reaction time, doubling the risk of car accidents. Various methods for detecting drowsy driving have been proposed that rely on facial changes. However, they have poor detection for drivers wearing a mask or sunglasses, and they do not [...] Read more.
Drowsiness impairs drivers’ concentration and reaction time, doubling the risk of car accidents. Various methods for detecting drowsy driving have been proposed that rely on facial changes. However, they have poor detection for drivers wearing a mask or sunglasses, and they do not reflect the driver’s drowsiness habits. Therefore, this paper proposes a novel method to detect drowsy driving even with facial detection obstructions, such as masks or sunglasses, and regardless of the driver’s different drowsiness habits, by recognizing behavioral patterns. We achieve this by constructing both normal driving and drowsy driving datasets and developing a 3D-CNN (3D Convolutional Neural Network) model reflecting the Inception structure of GoogleNet. This binary classification model classifies normal driving and drowsy driving videos. Using actual videos captured inside real vehicles, this model achieved a classification accuracy of 85% for detecting drowsy driving without facial obstructions and 75% for detecting drowsy driving when masks and sunglasses are worn. Our results demonstrate that the behavioral pattern recognition method is effective in detecting drowsy driving. Full article
Show Figures

Figure 1

17 pages, 6653 KiB  
Article
Supervised-Learning-Based Method for Restoring Subsurface Shallow-Layer Q Factor Distribution
by Danfeng Zang, Jian Li, Chuankun Li, Mingxing Ma, Chenli Guo and Jiangang Wang
Electronics 2024, 13(11), 2145; https://doi.org/10.3390/electronics13112145 - 30 May 2024
Cited by 1 | Viewed by 697
Abstract
The distribution of shallow subsurface quality factors (Q) is a crucial indicator for assessing the integrity of subsurface structures and serves as a primary parameter for evaluating the attenuation characteristics of seismic waves propagating through subsurface media. As the complexity of underground spaces [...] Read more.
The distribution of shallow subsurface quality factors (Q) is a crucial indicator for assessing the integrity of subsurface structures and serves as a primary parameter for evaluating the attenuation characteristics of seismic waves propagating through subsurface media. As the complexity of underground spaces increases, regions expand, and testing environments diversify, the survivability of test nodes is compromised, resulting in sparse effective seismic data with a low signal-to-noise ratio (SNR). Within the confined area defined by the source and sensor placement, only the Q factor along the wave propagation path can be estimated with relative accuracy. Estimating the Q factor in other parts of the area is challenging. Additionally, in recent years, deep neural networks have been employed to address the issue of missing values in seismic data; however, these methods typically require large datasets to train networks that can effectively fit the data, making them less applicable to our specific problem. In response to this challenge, we have developed a supervised learning method for the restoration of shallow subsurface Q factor distributions. The process begins with the construction of an incomplete labeled data volume, followed by the application of a block-based data augmentation technique to enrich the training samples and train the network. The uniformly partitioned initial data are then fed into the trained network to obtain output data, which are subsequently combined to form a complete Q factor data volume. We have validated this training approach using various networks, all yielding favorable results. Additionally, we compared our method with a data augmentation approach that involves creating random masks, demonstrating that our method reduces the mean absolute percentage error (MAPE) by 5%. Full article
Show Figures

Figure 1

14 pages, 5118 KiB  
Article
Domain Adaptive Subterranean 3D Pedestrian Detection via Instance Transfer and Confidence Guidance
by Zengyun Liu, Zexun Zheng, Tianyi Qin, Liying Xu and Xu Zhang
Electronics 2024, 13(5), 982; https://doi.org/10.3390/electronics13050982 - 4 Mar 2024
Cited by 3 | Viewed by 1269
Abstract
With the exploration of subterranean scenes, determining how to ensure the safety of subterranean pedestrians has gradually become a hot research topic. Considering the poor illumination and lack of annotated data in subterranean scenes, it is essential to explore the LiDAR-based domain adaptive [...] Read more.
With the exploration of subterranean scenes, determining how to ensure the safety of subterranean pedestrians has gradually become a hot research topic. Considering the poor illumination and lack of annotated data in subterranean scenes, it is essential to explore the LiDAR-based domain adaptive detectors for localizing the spatial location of pedestrians, thus providing instruction for evacuation and rescue. In this paper, a novel domain adaptive subterranean 3D pedestrian detection method is proposed to adapt pre-trained detectors from the annotated road scenes to the unannotated subterranean scenes. Specifically, an instance transfer-based scene updating strategy is designed to update the subterranean scenes by transferring instances from the road scenes to the subterranean scenes, aiming to create sufficient high-quality pseudo labels for fine-tuning the pre-trained detector. In addition, a pseudo label confidence-guided learning mechanism is constructed to fully utilize pseudo labels of different qualities under the guidance of confidence scores. Extensive experiments validate the superiority of our proposed domain adaptive subterranean 3D pedestrian detection method. Full article
Show Figures

Figure 1

Back to TopTop