Advances in Computer Vision and Multimedia Information Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Electronic Multimedia".

Deadline for manuscript submissions: closed (1 November 2023) | Viewed by 7876

Special Issue Editors


E-Mail Website
Guest Editor
School of Computer Science and Technology, East China Normal University, Shanghai 200062, China
Interests: computer vision; machine learning; deep learning; compression and speeding-up of large capacity models
Department of Computer Science, The University of Hong Kong, Pokfulam 999077, Hong Kong
Interests: computer vision; natural language processing; image/video captioning; machine learning

E-Mail Website
Guest Editor
Tencent YouTu Lab, Shanghai 200233, China
Interests: computer vision; machine learning; object detection; semi/weakly-supervised learning

Special Issue Information

Dear Colleagues,

With the rapid evolution of Computer Vision (CV) and Multimedia Information Processing (MIP), various deep neural networks (DNNs) have been developed in these areas, including ResNets, CLIPs and Transformers. Currently, the application of CV and MIP is extensive, feasible and sound, especially for intelligent video surveillance, remote sensing, healthcare and robotics. Meanwhile, advanced CV and MIP technologies significantly impact humanity, improving quality of life. However, there are still several challenges regarding the implementation of CV and MIP, including noise samples, multimodal semantic gap, large computation cost, privacy issues and model interpretability. Advanced CV and MIP technologies are urgently needed to mitigate these issues.

The focus of this Special Issue is on the state-of-the-art research related to the model design and implementation of advanced CV and MIP technologies. Topics of interest include (but are not limited to):

  1. Novel CV and MIP learning methods and algorithms;
  2. Compression and acceleration for CV and MIP models, to be applied to resource-limited devices;
  3. Effective multi-modality fusion methods for Multimedia applications;
  4. High-performance CV and MIP methods for image classification, object detection, segmentation, understanding and generation;
  5. Interpretable methods for model understanding and data analysis;
  6. Data-privacy protected CV and MIP technologies;
  7. Effective learning from noisy data;
  8. Model attack and defense for CV and MIP models.

Prof. Dr. Shaohui Lin
Dr. Fuhai Chen
Dr. Yunhang Shen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • computer vision
  • model compression and acceleration
  • multimedia information processing
  • data privacy
  • model interpretability
  • model attack and defense
  • noisy data learning
  • information fusion

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Other

21 pages, 6072 KiB  
Article
Human Motion Prediction Based on a Multi-Scale Hypergraph for Intangible Cultural Heritage Dance Videos
by Xingquan Cai, Pengyan Cheng, Shike Liu, Haoyu Zhang and Haiyan Sun
Electronics 2023, 12(23), 4830; https://doi.org/10.3390/electronics12234830 - 29 Nov 2023
Viewed by 746
Abstract
Compared to traditional dance, intangible cultural heritage dance often involves the isotropic extension of choreographic actions, utilizing both upper and lower limbs. This characteristic choreography style makes the remote joints lack interaction, consequently reducing accuracy in existing human motion prediction methods. Therefore, we [...] Read more.
Compared to traditional dance, intangible cultural heritage dance often involves the isotropic extension of choreographic actions, utilizing both upper and lower limbs. This characteristic choreography style makes the remote joints lack interaction, consequently reducing accuracy in existing human motion prediction methods. Therefore, we propose a human motion prediction method based on the multi-scale hypergraph convolutional network of the intangible cultural heritage dance video. Firstly, this method inputs the 3D human posture sequence from intangible cultural heritage dance videos. The hypergraph is designed according to the synergistic relationship of the human joints in the intangible cultural heritage dance video, which is used to represent the spatial correlation of the 3D human posture. Then, a multi-scale hypergraph convolutional network is constructed, utilizing multi-scale transformation operators to segment the human skeleton into different scales. This network adopts a graph structure to represent the 3D human posture at different scales, which is then used by the single-scalar fusion operator to spatial features in the 3D human posture sequence are extracted by fusing the feature information of the hypergraph and the multi-scale graph. Finally, the Temporal Graph Transformer network is introduced to capture the temporal dependence among adjacent frames within the time domain. This facilitates the extraction of temporal features from the 3D human posture sequence, ultimately enabling the prediction of future 3D human posture sequences. Experiments show that we achieve the best performance in both short-term and long-term human motion prediction when compared to Motion-Mixer and Motion-Attention algorithms on Human3.6M and 3DPW datasets. In addition, ablation experiments show that our method can predict more precise 3D human pose sequences, even in the presence of isotropic extensions of upper and lower limbs in intangible cultural heritage dance videos. This approach effectively addresses the issue of missing segments in intangible cultural heritage dance videos. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

16 pages, 5855 KiB  
Article
Enhancing Underwater Image Quality Assessment with Influential Perceptual Features
by Feifei Liu, Zihao Huang, Tianrang Xie, Runze Hu and Bingbing Qi
Electronics 2023, 12(23), 4760; https://doi.org/10.3390/electronics12234760 - 23 Nov 2023
Viewed by 881
Abstract
In the multifaceted field of oceanic engineering, the quality of underwater images is paramount for a range of applications, from marine biology to robotic exploration. This paper presents a novel approach in underwater image quality assessment (UIQA) that addresses the current limitations by [...] Read more.
In the multifaceted field of oceanic engineering, the quality of underwater images is paramount for a range of applications, from marine biology to robotic exploration. This paper presents a novel approach in underwater image quality assessment (UIQA) that addresses the current limitations by effectively combining low-level image properties with high-level semantic features. Traditional UIQA methods predominantly focus on either low-level attributes such as brightness and contrast or high-level semantic content, but rarely both, which leads to a gap in achieving a comprehensive assessment of image quality. Our proposed methodology bridges this gap by integrating these two critical aspects of underwater imaging. We employ the least-angle regression technique for balanced feature selection, particularly in high-level semantics, to ensure that the extensive feature dimensions of high-level content do not overshadow the fundamental low-level properties. The experimental results of our method demonstrate a remarkable improvement over existing UIQA techniques, establishing a new benchmark in both accuracy and reliability for underwater image assessment. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

14 pages, 3509 KiB  
Article
An Enhanced DOA Estimation Method for Coherent Sources via Toeplitz Matrix Reconstruction and Khatri–Rao Subspace
by Bingbing Qi, Xiaogang Liu, Daowei Dou, Yan Zhang and Runze Hu
Electronics 2023, 12(20), 4268; https://doi.org/10.3390/electronics12204268 - 16 Oct 2023
Viewed by 818
Abstract
The Toeplitz matrix reconstruction methods are capable of resolving coherent signals, playing a crucial role in the direction-of-arrival (DOA) estimation of acoustic sources. However, the decoherence processing sacrifices the array aperture and further results in a reduced resolution capability for the number of [...] Read more.
The Toeplitz matrix reconstruction methods are capable of resolving coherent signals, playing a crucial role in the direction-of-arrival (DOA) estimation of acoustic sources. However, the decoherence processing sacrifices the array aperture and further results in a reduced resolution capability for the number of identifiable sources. To solve this issue, we propose an enhanced method using the Khatri–Rao subspace to resolve more coherent sources than that of the existing Toeplitz matrix reconstruction methods. Firstly, a full set of Toeplitz matrices with full rank is obtained. Then, the virtual array aperture can be obtained using the Khatri–Rao product of the array response, and the degree of freedom provided inherently in the virtual array structure is about twice the size of that of the existing Toeplitz methods. Next, linear processing is further used to achieve complexity reduction without losing the effective degree of freedom. Finally, the DOA estimation for more coherent sources can be achieved by combining it with conventional methods. Numerical simulations verify the superiority of the proposed method. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

14 pages, 3267 KiB  
Article
THANet: Transferring Human Pose Estimation to Animal Pose Estimation
by Jincheng Liao, Jianzhong Xu, Yunhang Shen and Shaohui Lin
Electronics 2023, 12(20), 4210; https://doi.org/10.3390/electronics12204210 - 11 Oct 2023
Cited by 1 | Viewed by 1133
Abstract
Animal pose estimation (APE) boosts the understanding of animal behaviors. Recent vision-based APE has attracted extensive attention due to the advantages of contactless and sensorless applications. One of the main challenges in APE is the lack of high-quality keypoint annotations for different animal [...] Read more.
Animal pose estimation (APE) boosts the understanding of animal behaviors. Recent vision-based APE has attracted extensive attention due to the advantages of contactless and sensorless applications. One of the main challenges in APE is the lack of high-quality keypoint annotations for different animal species since manually annotating the animal keypoints is very expensive and time-consuming. Existing works alleviate this problem by synthesizing APE data and generating pseudo-labels for unlabeled animal images. However, feature representations learned from synthetic images could not be directly transferred to real-world scenarios, and the generated pseudo-labels are usually noisy, which limits the model’s performance. To address the above challenge, we propose a novel cross-domain vision transformer for APE to Transfer Human pose estimation to Animal pose estimation, termed THANet, as humans share skeleton similarities with some animals. Inspired by the success of ViTPose in HPE, we design a unified vision transformer encoder to extract universal features for both animals and humans followed by two task-specific decoders. We further introduce a simple but effective cross-domain discriminator to bridge the domain gaps between the human pose and the animal pose. We evaluated the proposed THANet on the AP-10K and Animal-Pose benchmarks, and the extensive experiments show that our method achieves a promising performance. Specifically, the proposed vision transformer and cross-domain method significantly improve the model’s accuracy and generalization ability for APE. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

14 pages, 3489 KiB  
Article
3DAGNet: 3D Deep Attention and Global Search Network for Pulmonary Nodule Detection
by Muwei Jian, Linsong Zhang, Haodong Jin and Xiaoguang Li
Electronics 2023, 12(10), 2333; https://doi.org/10.3390/electronics12102333 - 22 May 2023
Cited by 6 | Viewed by 1241
Abstract
In traditional clinical medicine, respiratory physicians or radiologists often identify the location of lung nodules by highlighting targets in consecutive CT slices, which is labor-intensive and easy-to-misdiagnose work. To achieve intelligent detection and diagnosis of CT lung nodules, we designed a 3D convolutional [...] Read more.
In traditional clinical medicine, respiratory physicians or radiologists often identify the location of lung nodules by highlighting targets in consecutive CT slices, which is labor-intensive and easy-to-misdiagnose work. To achieve intelligent detection and diagnosis of CT lung nodules, we designed a 3D convolutional neural network, called 3DAGNet, for pulmonary nodule detection. Inspired by the diagnostic process of lung nodule localization by physicians, the 3DGNet includes a spatial attention and a global search module. A multi-scale cascade module has also been introduced to enhance the model detection using attention enhancement, global information search, and contextual feature fusion. The experimental results showed that the proposed network achieved accurate detection of lung nodule information, and our method achieves a high sensitivity of 88.08% of the average FROC score on the LUNA16 dataset. In addition, ablation experiments also demonstrated the effectiveness of our method. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

Other

Jump to: Research

11 pages, 1009 KiB  
Brief Report
Contrastive Learning via Local Activity
by He Zhu, Yang Chen, Guyue Hu and Shan Yu
Electronics 2023, 12(1), 147; https://doi.org/10.3390/electronics12010147 - 29 Dec 2022
Cited by 1 | Viewed by 1810
Abstract
Contrastive learning (CL) helps deep networks discriminate between positive and negative pairs in learning. As a powerful unsupervised pretraining method, CL has greatly reduced the performance gap with supervised training. However, current CL approaches mainly rely on sophisticated augmentations, a large number of [...] Read more.
Contrastive learning (CL) helps deep networks discriminate between positive and negative pairs in learning. As a powerful unsupervised pretraining method, CL has greatly reduced the performance gap with supervised training. However, current CL approaches mainly rely on sophisticated augmentations, a large number of negative pairs and chained gradient calculations, which are complex to use. To address these issues, in this paper, we propose the local activity contrast (LAC) algorithm, which is an unsupervised method based on two forward passes and locally defined loss to learn meaningful representations. The learning target of each layer is to minimize the activation value difference between two forward passes, effectively overcoming the limitations of applying CL above mentioned. We demonstrated that LAC could be a very useful pretraining method using reconstruction as the pretext task. Moreover, through pretraining with LAC, the networks exhibited competitive performance in various downstream tasks compared with other unsupervised learning methods. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

Back to TopTop