Multimedia Content Analysis, Management and Retrieval: Trends and Challenges

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Electronic Multimedia".

Deadline for manuscript submissions: closed (15 October 2022) | Viewed by 21739

Special Issue Editors

School of Computer Science, Wuhan University, Wuhan 430072, China
Interests: multimedia content analysis; image retrieval; artificial intelligence
Special Issues, Collections and Topics in MDPI journals
Institute of North Electronic Equipment, Beijing 100191, China
Interests: artificial intelligence; pattern recognition; machine learning; computer vision; multimedia analytics
National Institute of Informatics, Tokyo 101-8430, Japan
Interests: machine learning; computer vision; ML safety/reliability
Special Issues, Collections and Topics in MDPI journals
Multimedia and Human Understanding Group, University of Trento, 38123 Povo-Trento, Italy
Interests: person re-identification; novel class discovery; domain adaptation

Special Issue Information

Dear Colleagues,

In recent years we have witnessed the development of computing, communication, and storage technologies. Multimedia technology has gained enormous potential to improve processes in a wide range of areas such as advertising, education, entertainment, healthcare, surveillance, wearable computing, biometrics, and remote sensing. Huge quantities of multimedia data require new and innovative approaches to modelling, processing, mining, organizing, and indexing this data in order to effectively and efficiently search, retrieve, deliver, manage, and share multimedia content as required by applications in the aforementioned fields. The main objective of this Special Issue is to bring together researchers and professionals from academia and industry around the world to discuss the wide spectrum of technological opportunities, challenges, solutions, and emerging applications for multimedia content analysis, management, and retrieval. We particularly encourage original work based on interdisciplinary research, such as computer science and social science. Topics of interest include, but are not limited to, the following:

  • Multimedia annotation, search, and retrieval;
  • Multimedia signal processing and analysis;
  • Multimedia content analysis and event detection;
  • Content-based analysis for multimedia data;
  • Image and video indexing and classification
  • Multimodal processing and analysis;
  • Multimedia applications in education, medicine, surveillance, and remote sensing;
  • Human-computer interaction.

Dr. Zheng Wang
Dr. Jian Zhao
Dr. Hong Liu
Dr. Zhun Zhong
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Multimedia analysis
  • Multimedia management
  • Multimedia retrieval
  • Signal processing
  • Image and video understanding
  • Human-computer interaction
  • Multimodal processing
  • Multimedia applications

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

14 pages, 8366 KiB  
Article
Transformer-Based Multimodal Infusion Dialogue Systems
by Bo Liu, Lejian He, Yafei Liu, Tianyao Yu, Yuejia Xiang, Li Zhu and Weijian Ruan
Electronics 2022, 11(20), 3409; https://doi.org/10.3390/electronics11203409 - 20 Oct 2022
Cited by 1 | Viewed by 1951
Abstract
The recent advancements in multimodal dialogue systems have been gaining importance in several domains such as retail, travel, fashion, among others. Several existing works have improved the understanding and generation of multimodal dialogues. However, there still exists considerable space to improve the quality [...] Read more.
The recent advancements in multimodal dialogue systems have been gaining importance in several domains such as retail, travel, fashion, among others. Several existing works have improved the understanding and generation of multimodal dialogues. However, there still exists considerable space to improve the quality of output textual responses due to insufficient information infusion between the visual and textual semantics. Moreover, the existing dialogue systems often generate defective knowledge-aware responses for tasks such as providing product attributes and celebrity endorsements. To address the aforementioned issues, we present a Transformer-based Multimodal Infusion Dialogue (TMID) system that extracts the visual and textual information from dialogues via a transformer-based multimodal context encoder and employs a cross-attention mechanism to achieve information infusion between images and texts for each utterance. Furthermore, TMID uses adaptive decoders to generate appropriate multimodal responses based on the user intentions it has determined using a state classifier and enriches the output responses by incorporating domain knowledge into the decoders. The results of extensive experiments on a multimodal dialogue dataset demonstrate that TMID has achieved a state-of-the-art performance by improving the BLUE-4 score by 13.03, NIST by 2.77, image selection Recall@1 by 1.84%. Full article
Show Figures

Figure 1

25 pages, 3087 KiB  
Article
Integrated Framework to Assess the Extent of the Pandemic Impact on the Size and Structure of the E-Commerce Retail Sales Sector and Forecast Retail Trade E-Commerce
by Cristiana Tudor
Electronics 2022, 11(19), 3194; https://doi.org/10.3390/electronics11193194 - 5 Oct 2022
Cited by 7 | Viewed by 3476
Abstract
With customers’ increasing reliance on e-commerce and multimedia content after the outbreak of COVID-19, it has become crucial for companies to digitize their business methods and models. Consequently, COVID-19 has highlighted the prominence of e-commerce and new business models while disrupting conventional business [...] Read more.
With customers’ increasing reliance on e-commerce and multimedia content after the outbreak of COVID-19, it has become crucial for companies to digitize their business methods and models. Consequently, COVID-19 has highlighted the prominence of e-commerce and new business models while disrupting conventional business activities. Hence, assessing and forecasting e-commerce growth is currently paramount for e-market planners, market players, and policymakers alike. This study sources data for the global e-commerce market leader, the US, and proposes an integrated framework that encompasses automated algorithms able to estimate six statistical and machine-learning univariate methods in order to accomplish two main tasks: (i) to produce accurate forecasts for e-commerce retail sales (e-sale) and the share of e-commerce in total retail sales (e-share); and (ii) to assess in quantitative terms the pandemic impact on the size and structure of the e-commerce retail sales sector. The results confirm that COVID-19 has significantly impacted the trend and structure of the US retail sales sector, producing cumulative excess (or abnormal) retail e-sales of $227.820 billion and a cumulative additional e-share of 10.61 percent. Additionally, estimations indicate a continuation of the increasing trend, with point estimates of $378.691 billion for US e-commerce retail sales that are projected to account for 16.72 percent of total US retail sales by the end of 2025. Nonetheless, the current findings also document that the growth of e-commerce is not a consequence of the COVID-19 crisis, but that the pandemic has accelerated the evolution of the e-commerce sector by at least five years. Overall, the study concludes that the shift towards e-commerce is permanent and, thus, governments (especially in developing countries) should prioritize policies aimed at harnessing e-commerce for sustainable development. Furthermore, in light of the research findings, digital transformation should constitute a top management priority for retail businesses. Full article
Show Figures

Figure 1

11 pages, 1330 KiB  
Article
SCA-MMA: Spatial and Channel-Aware Multi-Modal Adaptation for Robust RGB-T Object Tracking
by Run Shi, Chaoqun Wang, Gang Zhao and Chunyan Xu
Electronics 2022, 11(12), 1820; https://doi.org/10.3390/electronics11121820 - 8 Jun 2022
Cited by 1 | Viewed by 1252
Abstract
The RGB and thermal (RGB-T) object tracking task is challenging, especially with various target changes caused by deformation, abrupt motion, background clutter and occlusion. It is critical to employ the complementary nature between visual RGB and thermal infrared data. In this work, we [...] Read more.
The RGB and thermal (RGB-T) object tracking task is challenging, especially with various target changes caused by deformation, abrupt motion, background clutter and occlusion. It is critical to employ the complementary nature between visual RGB and thermal infrared data. In this work, we address the RGB-T object tracking task with a novel spatial- and channel-aware multi-modal adaptation (SCA-MMA) framework, which builds an adaptive feature learning process for better mining this object-aware information in a unified network. For each type of modality information, the spatial-aware adaptation mechanism is introduced to dynamically learn the location-based characteristics of specific tracking objects at multiple convolution layers. Further, the channel-aware multi-modal adaptation mechanism is proposed to adaptively learn the feature fusion/aggregation of different modalities. In order to perform object tracking, we employ a binary classification module with two fully connected layers to predict the bounding boxes of specific targets. Comprehensive evaluations on GTOT and RGBT234 datasets demonstrate the significant superiority of our proposed SCA-MMA for robust RGB-T object tracking tasks. In particular, the precision rate (PR) and success rate (SR) on GTOT and RGBT234 datasets can reach 90.5%/73.2% and 80.2%/56.9%, significantly higher than the state-of-the-art algorithms. Full article
Show Figures

Figure 1

20 pages, 6179 KiB  
Article
A Comparative Study of Reduction Methods Applied on a Convolutional Neural Network
by Aurélie Cools, Mohammed Amin Belarbi and Sidi Ahmed Mahmoudi
Electronics 2022, 11(9), 1422; https://doi.org/10.3390/electronics11091422 - 28 Apr 2022
Cited by 3 | Viewed by 1386
Abstract
With the emergence of smartphones, video surveillance cameras, social networks, and multimedia engines, as well as the development of the internet and connected objects (the Internet of Things—IoT), the number of available images is increasing very quickly. This leads to the necessity of [...] Read more.
With the emergence of smartphones, video surveillance cameras, social networks, and multimedia engines, as well as the development of the internet and connected objects (the Internet of Things—IoT), the number of available images is increasing very quickly. This leads to the necessity of managing a huge amount of data using Big Data technologies. In this context, several sectors, such as security and medicine, need to extract image features (index) in order to quickly and efficiently find these data with high precision. To reach this first goal, two main approaches exist in the literature. The first one uses classical methods based on the extraction of visual features, such as color, texture, and shape for indexation. The accuracy of these methods was acceptable until the early 2010s. The second approach is based on convolutional neuronal networks (CNN), which offer better precision due to the largeness of the descriptors, but they can cause an increase in research time and storage space. To decrease the research time, one needs to reduce the size of these vectors (descriptors) by using dimensionality reduction methods. In this paper, we propose an approach that allows the problem of the “curse of dimensionality” to be solved thanks to an efficient combination of convolutional neural networks and dimensionality reduction methods. Our contribution consists of defining the best combination approach between the CNN layers and the regional maximum activation of convolutions (RMAC) method and its variants. With our combined approach, we propose providing reduced descriptors that will accelerate the research time and reduce the storage space while maintaining precision. We conclude by proposing the best position of an RMAC layer with an increase in accuracy ranging from 4.03% to 27.34%, a decrease in research time ranging from 89.66% to 98.14% in the function of CNN architecture, and a reduction in the size of the descriptor vector by 97.96% on the GHIM-10K benchmark database. Full article
Show Figures

Figure 1

10 pages, 7456 KiB  
Article
Quality Assessment of View Synthesis Based on Visual Saliency and Texture Naturalness
by Lijuan Tang, Kezheng Sun, Shuaifeng Huang, Guangcheng Wang and Kui Jiang
Electronics 2022, 11(9), 1384; https://doi.org/10.3390/electronics11091384 - 26 Apr 2022
Cited by 2 | Viewed by 1594
Abstract
Depth-Image-Based-Rendering (DIBR) is one of the core techniques for generating new views in 3D video applications. However, the distortion characteristics of the DIBR synthetic view are different from the 2D image. It is necessary to study the unique distortion characteristics of DIBR views [...] Read more.
Depth-Image-Based-Rendering (DIBR) is one of the core techniques for generating new views in 3D video applications. However, the distortion characteristics of the DIBR synthetic view are different from the 2D image. It is necessary to study the unique distortion characteristics of DIBR views and design effective and efficient algorithms to evaluate the DIBR-synthesized image and guide DIBR algorithms. In this work, the visual saliency and texture natrualness features are extracted to evaluate the quality of the DIBR views. After extracting the feature, we adopt machine learning method for mapping the extracted feature to the quality score of the DIBR views. Experiments constructed on two synthetic view databases IETR and IRCCyN/IVC, and the results show that our proposed algorithm performs better than the compared synthetic view quality evaluation methods. Full article
Show Figures

Figure 1

12 pages, 3207 KiB  
Article
Part-Aware Refinement Network for Occlusion Vehicle Detection
by Qifan Wang, Ning Xu, Baojin Huang and Guangcheng Wang
Electronics 2022, 11(9), 1375; https://doi.org/10.3390/electronics11091375 - 25 Apr 2022
Cited by 3 | Viewed by 1593
Abstract
Traditional machine learning approaches are susceptible to factors such as object scale, occlusion, leading to low detection efficiency and poor versatility in vehicle detection applications. To tackle this issue, we propose a part-aware refinement network, which combines multi-scale training and component confidence generation [...] Read more.
Traditional machine learning approaches are susceptible to factors such as object scale, occlusion, leading to low detection efficiency and poor versatility in vehicle detection applications. To tackle this issue, we propose a part-aware refinement network, which combines multi-scale training and component confidence generation strategies in vehicle detection. Specifically, we divide the original single-valued prediction confidence and adopt the confidence of the visible part of the vehicle to correct the absolute detection confidence of the vehicle. That reduces the impact of occlusion on the detection effect. Simultaneously, we relabel the KITTI data, adding the detailed occlusion information of the vehicles. Then, the deep neural network model is trained and tested using the new images. Our proposed method can automatically extract the vehicle features and solve larger error problems when locating vehicles in traditional approaches. Extensive experimental results on KITTI datasets show that our method significantly outperforms the state-of-the-arts while maintaining the detection time. Full article
Show Figures

Figure 1

20 pages, 2735 KiB  
Article
Vehicle Re-Identification with Spatio-Temporal Model Leveraging by Pose View Embedding
by Wenxin Huang, Xian Zhong, Xuemei Jia, Wenxuan Liu, Meng Feng, Zheng Wang and Shin’ichi Satoh
Electronics 2022, 11(9), 1354; https://doi.org/10.3390/electronics11091354 - 24 Apr 2022
Cited by 3 | Viewed by 1448
Abstract
Vehicle re-identification (Re-ID) research has intensified as numerous advancements have been made along with the rapid development of person Re-ID. In this paper, we tackle the vehicle Re-ID problem in open scenarios. This research differs from the early-stage studies that focused on a [...] Read more.
Vehicle re-identification (Re-ID) research has intensified as numerous advancements have been made along with the rapid development of person Re-ID. In this paper, we tackle the vehicle Re-ID problem in open scenarios. This research differs from the early-stage studies that focused on a certain view, and it faces more challenges due to view variations, illumination changes, occlusions, etc. Inspired by the research of person Re-ID, we propose leveraging pose view to enhance the discrimination performance of visual features and utilizing keypoints to improve the accuracy of pose recognition. However, the visual appearance information is still limited by the changing surroundings and extremely similar appearances of vehicles. To the best of our knowledge, few methods have been aware of the spatio-temporal information to supplement visual appearance information, but they neglect the influence of the driving direction. Considering the peculiar characteristic of vehicle movements, we observe that vehicles’ poses on camera views indicating their directions are closely related to spatio-temporal cues. Consequently, we design a two-branch framework for vehicle Re-ID, including a Keypoint-based Pose Embedding Visual (KPEV) model and a Keypoint-based Pose-Guided Spatio-Temporal (KPGST) model. These models are integrated into the framework, and the results of KPEV and KPGST are fused based on a Bayesian network. Extensive experiments performed on the VeRi-776 and VehicleID datasets related to functional urban surveillance scenarios demonstrate the competitive performance of our proposed approach. Full article
Show Figures

Figure 1

12 pages, 10733 KiB  
Article
PM2.5 Concentration Measurement Based on Image Perception
by Guangcheng Wang, Quan Shi and Kui Jiang
Electronics 2022, 11(9), 1298; https://doi.org/10.3390/electronics11091298 - 20 Apr 2022
Cited by 3 | Viewed by 2105
Abstract
PM2.5 in the atmosphere causes severe air pollution and dramatically affects the normal production and lives of residents. The real-time monitoring of PM2.5 concentrations has important practical significance for the construction of ecological civilization. The mainstream PM2.5 concentration prediction algorithms [...] Read more.
PM2.5 in the atmosphere causes severe air pollution and dramatically affects the normal production and lives of residents. The real-time monitoring of PM2.5 concentrations has important practical significance for the construction of ecological civilization. The mainstream PM2.5 concentration prediction algorithms based on electrochemical sensors have some disadvantages, such as high economic cost, high labor cost, time delay, and more. To this end, we propose a simple and effective PM2.5 concentration prediction algorithm based on image perception. Specifically, the proposed method develops a natural scene statistical prior to estimating the saturation loss caused by the ’haze’ formed by PM2.5. After extracting the prior features, this paper uses the feedforward neural network to achieve the mapping function from the proposed prior features to the PM2.5 concentration values. Experiments constructed on the public Air Quality Image Dataset (AQID) show the superiority of our proposed PM2.5 concentration measurement method compared to state-of-the-art related PM2.5 concentration monitoring methods. Full article
Show Figures

Figure 1

27 pages, 1358 KiB  
Article
Bagged Tree and ResNet-Based Joint End-to-End Fast CTU Partition Decision Algorithm for Video Intra Coding
by Yixiao Li, Lixiang Li, Yuan Fang, Haipeng Peng and Nam Ling
Electronics 2022, 11(8), 1264; https://doi.org/10.3390/electronics11081264 - 16 Apr 2022
Cited by 8 | Viewed by 1907
Abstract
Video coding standards, such as high-efficiency video coding (HEVC), versatile video coding (VVC), and AOMedia video 2 (AV2), achieve an optimal encoding performance by traversing all possible combinations of coding unit (CU) partition and selecting the combination with the minimum coding cost. It [...] Read more.
Video coding standards, such as high-efficiency video coding (HEVC), versatile video coding (VVC), and AOMedia video 2 (AV2), achieve an optimal encoding performance by traversing all possible combinations of coding unit (CU) partition and selecting the combination with the minimum coding cost. It is still necessary to further reduce the encoding time of HEVC, because HEVC is one of the most widely used coding standards. In HEVC, the process of searching for the best performance is the source of most of the encoding complexity. To reduce the complexity of the coding block partition in HEVC, a new end-to-end fast algorithm is presented to aid the partition structure decisions of the coding tree unit (CTU) in intra coding. In the proposed method, the partition structure decision problem of a CTU is solved by a novel two-stage strategy. In the first stage, a bagged tree model is employed to predict the splitting of a CTU. In the second stage, the partition problem of a 32 × 32-sized CU is modeled as a 17-output classification task for the first time, so that it can be solved by a single prediction. To achieve a high prediction accuracy, a residual network (ResNet) with 34 layers is employed. Jointly using bagged tree and ResNet, the proposed fast CTU partition algorithm is able to generate the partition quad-tree structure of a CTU through an end-to-end prediction process, which abandons the traditional scheme of making multiple decisions at various depth levels. In addition, several datasets are used in this paper to lay the foundation for high prediction accuracy. Compared with the original HM16.7 encoder, the experimental results show that the proposed algorithm can reduce the encoding time by 60.29% on average, while the Bjøntegaard delta rate (BD-rate) loss is as low as 2.03%, which outperforms the results of most of the state-of-the-art approaches in the field of fast intra CU partition. Full article
Show Figures

Figure 1

Review

Jump to: Research

18 pages, 2168 KiB  
Review
Visible-Infrared Person Re-Identification: A Comprehensive Survey and a New Setting
by Huantao Zheng, Xian Zhong, Wenxin Huang, Kui Jiang, Wenxuan Liu and Zheng Wang
Electronics 2022, 11(3), 454; https://doi.org/10.3390/electronics11030454 - 3 Feb 2022
Cited by 10 | Viewed by 3462
Abstract
Person re-identification (ReID) plays a crucial role in video surveillance with the aim to search a specific person across disjoint cameras, and it has progressed notably in recent years. However, visible cameras may not be able to record enough information about the pedestrian’s [...] Read more.
Person re-identification (ReID) plays a crucial role in video surveillance with the aim to search a specific person across disjoint cameras, and it has progressed notably in recent years. However, visible cameras may not be able to record enough information about the pedestrian’s appearance under the condition of low illumination. On the contrary, thermal infrared images can significantly mitigate this issue. To this end, combining visible images with infrared images is a natural trend, and are considerably heterogeneous modalities. Some attempts have recently been contributed to visible-infrared person re-identification (VI-ReID). This paper provides a complete overview of current VI-ReID approaches that employ deep learning algorithms. To align with the practical application scenarios, we first propose a new testing setting and systematically evaluate state-of-the-art methods based on our new setting. Then, we compare ReID with VI-ReID in three aspects, including data composition, challenges, and performance. According to the summary of previous work, we classify the existing methods into two categories. Additionally, we elaborate on frequently used datasets and metrics for performance evaluation. We give insights on the historical development and conclude the limitations of off-the-shelf methods. We finally discuss the future directions of VI-ReID that the community should further address. Full article
Show Figures

Figure 1

Back to TopTop