Deep Learning in Computer Vision

A special issue of Journal of Imaging (ISSN 2313-433X). This special issue belongs to the section "Computer Vision and Pattern Recognition".

Deadline for manuscript submissions: 15 September 2024 | Viewed by 5743

Special Issue Editors


E-Mail Website
Guest Editor
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China
Interests: image classification; object detection; semantic segmentation; pose estimation

E-Mail Website
Guest Editor
Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China
Interests: complex human behavior understanding; video-language understanding

Special Issue Information

Dear Colleagues,

The field of computer vision has undergone a significant transformation with the advent of deep learning techniques, enabling the development of innovative applications across various domains. This Special Issue focuses on exploring the latest advancements, methodologies, and applications in this rapidly evolving area.

Deep learning methods, such as convolutional neural networks (CNNs), vision transformer (ViT), diffusion models and generative adversarial networks (GANs), have demonstrated remarkable success in tasks such as object recognition, semantic segmentation and image synthesis. These techniques have paved the way for a myriad of applications, including autonomous vehicles, facial recognition, biomedical image analysis and video surveillance, among others.

We invite contributions that present cutting-edge research, novel techniques, methods, tools and ideas related to the integration of deep learning in computer vision. Submissions may cover a wide range of topics, including, but not limited to:

  • Advances in deep learning architectures for computer vision tasks;
  • Transfer learning and domain adaptation in computer vision;
  • Deep reinforcement learning for vision-based control;
  • Generative models for image synthesis and manipulation;
  • Application of computer vision technology in biomedical imaging;
  • Applications of deep learning in fields such as remote sensing, robotics and art.

We encourage submissions that propose innovative and scientifically grounded research lines for the future development of deep learning techniques in computer vision. Together, we aim to advance the field of computer vision and its applications in various industries.

Dr. Dong Zhang
Dr. Rui Yan
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Journal of Imaging is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • image classification
  • object detection
  • semantic segmentation
  • pose estimation
  • multimedia analysis and retrieval
  • few-shot learning
  • human behavior understanding
  • video language understanding
  • video understanding and analysis

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 5650 KiB  
Article
Unleashing the Power of Contrastive Learning for Zero-Shot Video Summarization
by Zongshang Pang, Yuta Nakashima, Mayu Otani and Hajime Nagahara
J. Imaging 2024, 10(9), 229; https://doi.org/10.3390/jimaging10090229 (registering DOI) - 14 Sep 2024
Abstract
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level [...] Read more.
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing. Past efforts have invariantly involved training summarization models with annotated summaries or heuristic objectives. In this work, we reveal that features pre-trained on image-level tasks contain rich semantic information that can be readily leveraged to quantify frame-level importance for zero-shot video summarization. Leveraging pre-trained features and contrastive learning, we propose three metrics featuring a desirable keyframe: local dissimilarity, global consistency, and uniqueness. We show that the metrics can well-capture the diversity and representativeness of frames commonly used for the unsupervised generation of video summaries, demonstrating competitive or better performance compared to past methods when no training is needed. We further propose a contrastive learning-based pre-training strategy on unlabeled videos to enhance the quality of the proposed metrics and, thus, improve the evaluated performance on the public benchmarks TVSum and SumMe. Full article
(This article belongs to the Special Issue Deep Learning in Computer Vision)
12 pages, 8025 KiB  
Article
Deep Learning for Single-Shot Structured Light Profilometry: A Comprehensive Dataset and Performance Analysis
by Rhys G. Evans, Ester Devlieghere, Robrecht Keijzer, Joris J. J. Dirckx and Sam Van der Jeught
J. Imaging 2024, 10(8), 179; https://doi.org/10.3390/jimaging10080179 - 24 Jul 2024
Viewed by 842
Abstract
In 3D optical metrology, single-shot deep learning-based structured light profilometry (SS-DL-SLP) has gained attention because of its measurement speed, simplicity of optical setup, and robustness to noise and motion artefacts. However, gathering a sufficiently large training dataset for these techniques remains challenging because [...] Read more.
In 3D optical metrology, single-shot deep learning-based structured light profilometry (SS-DL-SLP) has gained attention because of its measurement speed, simplicity of optical setup, and robustness to noise and motion artefacts. However, gathering a sufficiently large training dataset for these techniques remains challenging because of practical limitations. This paper presents a comprehensive DL-SLP dataset of over 10,000 physical data couples. The dataset was constructed by 3D-printing a calibration target featuring randomly varying surface profiles and storing the height profiles and the corresponding deformed fringe patterns. Our dataset aims to serve as a benchmark for evaluating and comparing different models and network architectures in DL-SLP. We performed an analysis of several established neural networks, demonstrating high accuracy in obtaining full-field height information from previously unseen fringe patterns. In addition, the network was validated on unique objects to test the overall robustness of the trained model. To facilitate further research and promote reproducibility, all code and the dataset are made publicly available. This dataset will enable researchers to explore, develop, and benchmark novel DL-based approaches for SS-DL-SLP. Full article
(This article belongs to the Special Issue Deep Learning in Computer Vision)
Show Figures

Figure 1

15 pages, 1666 KiB  
Article
MResTNet: A Multi-Resolution Transformer Framework with CNN Extensions for Semantic Segmentation
by Nikolaos Detsikas, Nikolaos Mitianoudis and Ioannis Pratikakis
J. Imaging 2024, 10(6), 125; https://doi.org/10.3390/jimaging10060125 - 21 May 2024
Viewed by 989
Abstract
A fundamental task in computer vision is the process of differentiation and identification of different objects or entities in a visual scene using semantic segmentation methods. The advancement of transformer networks has surpassed traditional convolutional neural network (CNN) architectures in terms of segmentation [...] Read more.
A fundamental task in computer vision is the process of differentiation and identification of different objects or entities in a visual scene using semantic segmentation methods. The advancement of transformer networks has surpassed traditional convolutional neural network (CNN) architectures in terms of segmentation performance. The continuous pursuit of optimal performance, with respect to the popular evaluation metric results, has led to very large architectures that require a significant amount of computational power to operate, making them prohibitive for real-time applications, including autonomous driving. In this paper, we propose a model that leverages a visual transformer encoder with a parallel twin decoder, consisting of a visual transformer decoder and a CNN decoder with multi-resolution connections working in parallel. The two decoders are merged with the aid of two trainable CNN blocks, the fuser that combined the information from the two decoders and the scaler that scales the contribution of each decoder. The proposed model achieves state-of-the-art performance on the Cityscapes and ADE20K datasets, maintaining a low-complexity network that can be used in real-time applications. Full article
(This article belongs to the Special Issue Deep Learning in Computer Vision)
Show Figures

Figure 1

18 pages, 1810 KiB  
Article
Knowledge Distillation in Video-Based Human Action Recognition: An Intuitive Approach to Efficient and Flexible Model Training
by Fernando Camarena, Miguel Gonzalez-Mendoza and Leonardo Chang
J. Imaging 2024, 10(4), 85; https://doi.org/10.3390/jimaging10040085 - 30 Mar 2024
Viewed by 1298
Abstract
Training a model to recognize human actions in videos is computationally intensive. While modern strategies employ transfer learning methods to make the process more efficient, they still face challenges regarding flexibility and efficiency. Existing solutions are limited in functionality and rely heavily on [...] Read more.
Training a model to recognize human actions in videos is computationally intensive. While modern strategies employ transfer learning methods to make the process more efficient, they still face challenges regarding flexibility and efficiency. Existing solutions are limited in functionality and rely heavily on pretrained architectures, which can restrict their applicability to diverse scenarios. Our work explores knowledge distillation (KD) for enhancing the training of self-supervised video models in three aspects: improving classification accuracy, accelerating model convergence, and increasing model flexibility under regular and limited-data scenarios. We tested our method on the UCF101 dataset using differently balanced proportions: 100%, 50%, 25%, and 2%. We found that using knowledge distillation to guide the model’s training outperforms traditional training without affecting the classification accuracy and while reducing the convergence rate of model training in standard settings and a data-scarce environment. Additionally, knowledge distillation enables cross-architecture flexibility, allowing model customization for various applications: from resource-limited to high-performance scenarios. Full article
(This article belongs to the Special Issue Deep Learning in Computer Vision)
Show Figures

Figure 1

15 pages, 980 KiB  
Article
FishSegSSL: A Semi-Supervised Semantic Segmentation Framework for Fish-Eye Images
by Sneha Paul, Zachary Patterson and Nizar Bouguila
J. Imaging 2024, 10(3), 71; https://doi.org/10.3390/jimaging10030071 - 15 Mar 2024
Viewed by 1715
Abstract
The application of large field-of-view (FoV) cameras equipped with fish-eye lenses brings notable advantages to various real-world computer vision applications, including autonomous driving. While deep learning has proven successful in conventional computer vision applications using regular perspective images, its potential in fish-eye camera [...] Read more.
The application of large field-of-view (FoV) cameras equipped with fish-eye lenses brings notable advantages to various real-world computer vision applications, including autonomous driving. While deep learning has proven successful in conventional computer vision applications using regular perspective images, its potential in fish-eye camera contexts remains largely unexplored due to limited datasets for fully supervised learning. Semi-supervised learning comes as a potential solution to manage this challenge. In this study, we explore and benchmark two popular semi-supervised methods from the perspective image domain for fish-eye image segmentation. We further introduce FishSegSSL, a novel fish-eye image segmentation framework featuring three semi-supervised components: pseudo-label filtering, dynamic confidence thresholding, and robust strong augmentation. Evaluation on the WoodScape dataset, collected from vehicle-mounted fish-eye cameras, demonstrates that our proposed method enhances the model’s performance by up to 10.49% over fully supervised methods using the same amount of labeled data. Our method also improves the existing image segmentation methods by 2.34%. To the best of our knowledge, this is the first work on semi-supervised semantic segmentation on fish-eye images. Additionally, we conduct a comprehensive ablation study and sensitivity analysis to showcase the efficacy of each proposed method in this research. Full article
(This article belongs to the Special Issue Deep Learning in Computer Vision)
Show Figures

Graphical abstract

Back to TopTop