**1. Introduction**

At the ImageNet Large Scale Visual Re-Conversion Challenge (ILSVRC), a 2012 global image recognition contest, the University of Toronto Supervision team led by Prof. Geoffrey Hinton took first and second place by a landslide, sparking an explosion of interest in deep learning. Since then, global experts and companies such as Google, Microsoft, nVidia, and Intel have been competing to lead artificial intelligence technologies, such as deep learning. Now, they are developing deep-learning-based technologies that can applied to all industries to solve many classification and recognition problems.

These artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies based on recognition and classification [1–3]. A vast amount of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology [4–6]. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. In this issue, we present excellent papers related to advanced computational intelligence algorithms and technologies for emerging multimedia processing.

#### **2. Emerging Multimedia Signal Processing**

Thirteen papers related to artificial intelligence for multimedia signal processing have been published in this Special Issue. They deal with a broad range of topics concerning advanced computational intelligence algorithms and technologies for emerging multimedia signal processing.

We present the following works in relation to the computer vision field. Lee et al. propose a densely cascading image restoration network (DCRN) consisting of an input layer, a densely cascading feature extractor, a channel attention block, and an output layer [7]. The densely cascading feature extractor has three densely cascading (DC) blocks, and each DC block contains two convolutional layers. From this design, they achieved better quality measures for the compressed joint photographic experts group (JPEG) images compared with the existing methods. In [8], an image de-raining approach is developed using the generative capabilities of recently introduced conditional generative adversarial networks (cGANs). This method could be very useful to recover visual quality when degraded due to diverse weather conditions, recording conditions, or motion blur.

Additionally, Wu et al. sugges<sup>t</sup> a framework to leverage the sentimental interaction characteristic based on a graph convolutional network (GCN) [9]. They first utilize an off-the-shelf tool to recognize the objects and build a graph over them. Visual features are represented as nodes, and the emotional distances between the objects act as edges. Then, they employ GCNs to obtain the interaction features among the objects, which are fused with the CNN output of the whole image to predict the result. This approach is very useful to analyze human sentiment analysis. In [10], two lightweight neural networks

**Citation:** Kim, B.-G.; Jun, D.-S. Artificial Intelligence for Multimedia Signal Processing. *Appl. Sci.* **2022**, *12*, 7358. https://doi.org/10.3390/ app12157358

Received: 15 July 2022 Accepted: 21 July 2022 Published: 22 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

with a hybrid residual and dense connection structure are suggested by Kim et al. to improve super-resolution performance. They show that the proposed methods could significantly reduce both the inference speed and the memory required to store parameters and intermediate feature maps, while maintaining similar image quality compared to the previous methods.

Kim et al. propose an efficient scene classification algorithm for three different classes by detecting objects in the scene [11]. The authors utilize a pre-trained semantic segmentation model to extract objects from an image. After that, they construct a weighting matrix to better determine the scene class. Finally, this classifies an image into one of three scene classes (i.e., indoor, nature, city) using the designed weighting matrix. This technique can be utilized for semantic searches in multimedia databases.

Lastly, an estimation method for human height is proposed by Lee et al. using color and depth information [12]. They use color images for deep learning by mask R-CNN to detect a human body and a human head separately. If color images are not available for extracting the human body region due to a low light environment, then the human body region is extracted by comparison with the current frame in the depth video.

For speech, sound, and text processing, Lin et al. improve the raw-signal-input network from other research using deeper network architectures [13]. They also propose a network architecture that can combine different kinds of network feeds with different features. In the experiment, the proposed scheme achieves an accuracy of 73.55% in the open audio dataset, "Dataset for Environmental Sound Classification 50" (ESC50). A multi-scale discriminator that discriminates between real and generated speech at various sampling rates is devised by Kim et al. to stabilize GAN training [14]. In this paper, the proposed structure is compared with conventional GAN-based speech enhancement algorithms using the VoiceBank-DEMAND dataset. They show that the proposed approach can make the training faster and more stable.

To translate the speech, a multimodal unsupervised scheme is proposed by Lee and Park [15]. They make a variational autoencoder (VAE)-based speech conversion network by decomposing the spectral features of the speech into a speaker-invariant content factor and a speaker-specific style factor to estimate diverse and robust speech styles. This approach can help second language (L2) speech education. To develop a 3D avatar-based sign language learning system, Chakladar et al. sugges<sup>t</sup> a system that converts the input speech/text into corresponding sign movements for Indian Sign Language (ISL) [16]. The translation module achieves a 10.50 SER (sign error rate) score in the actual test.

Two papers concern content analysis and information mining. The first one, by Krishna Kumar Thirukokaranam Chandrasekar and Steven Verstockt, regards a contextbased structure mining pipeline [17]. The proposed scheme not only attempts to enrich the content, but also simultaneously splits it into shots and logical story units (LSU). They demonstrate quantitatively that the pipeline outperforms existing state-of-the-art methods for shot boundary detection, scene detection, and re-identification tasks. The other paper outlines a framework which can learn the multimodal joint representation of pins, including text representation, image representation, and multimodal fusion [18]. In this work, the authors combine image representations and text representations in a multimodal form. It is shown that the proposed multimodal joint representation outperforms unimodal representation in different recommendation tasks.

For ECG signal processing, Tanoh and Napoletano propose a 1D convolutional neural network (CNN) that exploits a novel analysis of the correlation between the two leads of the noisy electrocardiogram (ECG) to classify heartbeats [19]. This approach is one-dimensional, enabling complex structures while maintaining reasonable computational complexity.

I hope that the technical papers published in this Special Issue can help researchers and readers to understand the emerging theories and technologies in the field of multimedia signal processing.

**Funding:** This research received no external funding. **Acknowledgments:** We thank all authors who submitted excellent research work to this Special Issue. We are grateful to all reviewers who contributed evaluations of scientific merits and quality of the manuscripts and provided countless valuable suggestions to improve their quality and the overall value for the scientific community. Our special thanks go to the editorial board of MDPI Applied Sciences journal for the opportunity to gues<sup>t</sup> edit this Special Issue, and to the Applied Sciences Editorial Office staff for the hard and precise work required to keep to a rigorous peer-review schedule and complete timely publication.

**Conflicts of Interest:** The authors declare no conflict of interest.
