*3.3. Domain Adaptation*

In the training phase, a network was trained to match audio with objects and scenes. In the domain adaptation phase, we developed a method to train our network to match music and a painting, which is challenging because of the absence of a related dataset. We therefore focused on appealing advertisements in our dataset. The purpose of emotionally appealing advertisements is to transfer emotional feelings to the customer rather than rational information. A new paradigm of emotionally appealing product advertisements has recently emerged, where the concept is not to focus on revealing production, but to convey a brand image and overall atmosphere. These advertisements convey a sense of

artistry and atmosphere that transcend the boundaries of usability and beauty by using well-matched colors and atmospheric music. Thus, the video of an advertisement can be considered as an expert's well-matched labeling data. In this work, we matched paintings and music by using the color and the audio atmosphere rather than simply matching audio to objects or scenes. Therefore, we used advertisements for products. Examples of the emotionally appealing advertisements used are provided in Figure 3. In this work, data were collected manually. The collection criteria were as follows: The first was whether background music was included; the second was whether the advertisements emphasized color; the third was whether there was conversation. If conversations were frequent, the advertising was excluded from the collection. If several atmospheric parts existed in the collected advertisement, the video was divided into several videos according to the sections where the atmosphere changed. For example, in Figure 3, each row is from the same advertisement. This advertisement was divided based on the points at which the atmosphere changed.

**Figure 3.** Dataset of emotionally appealing advertisements for domain adaption.

Figure 4 shows our domain adaptation method with its key features of mutual learning through advertisement data with symmetric *KL* divergence. The purpose of mutual learning was not to increase the accuracy of the task, but to train the two features to be the same. Therefore, we used advertising videos to fine-tune the network through mutual learning with symmetric *KL* divergence for domain adaptation. This domain adaptation had two main objectives. First, we wanted our system to learn a method of matching sounds and images based on considerations of the color and the atmosphere. Second, we focused on extracting these two features equally, rather than matching the sound to the space of the image via symmetric *KL* divergence.

$$D\_{\text{SymmetricKL}}(P, Q) = D\_{\text{KL}}(P \parallel Q) + D\_{\text{KL}}(Q \parallel P) = \sum\_{x \in \chi} P(x) \ln \frac{P(x)}{Q(x)} + \sum\_{x \in \chi} Q(x) \ln \frac{Q(x)}{P(x)}.\tag{1}$$

Equation (1) is the symmetric *KL* divergence for the mutual learning used in this study. The objective of mutual learning is to train model so that image and audio features can be extracted equally; the image is indicated by *P* and the audio by *Q*. Importantly, unlike in the training phase, a model freeze was not used in order to reduce the gap between the distributions of the two features.

**Figure 4.** Domain adaptation phase with symmetric KL divergence based on mutual learning.

### *3.4. Audio Feature Extraction via Multi-Time-Scale Transform*

Algorithm 1 is for audio feature extraction via the multi-time-scale transform in order to extract features from an RGB image rather than a conventional gray image. This method is an improvement over WaveMsNet [45,46], the Specaugment method [65], and time-wise multi-inference strategies. In the initialization section, the hyper-parameters of the FFT were experimentally obtained based on values that are commonly used in ESC-50, and the hyper-parameters of the MTST were obtained via a greedy search. The ranges of the greedy searches were as follows: The steps were [50, 100, 150, 200, 250, 300] and x\_size was [224, 401, 501, 601, 701, 801]. M was a method of conversion from raw data into a mel-spectrogram with the conversion of power to decibels (dB). Audio features were then extracted in a similar manner, except that we used a multi-time-scale transform. The model that we used was the wide resnet101 with double width.

Figure 5 shows the key idea of the multi-time-scale transform. The two figures are the same graph, with the three-dimensional visualization shown on the left and the twodimensional visualization on the right. The x-axis shows frequency, the y-axis shows time, and the *z*-axis shows power. A heatmap-based RGB image is shown on the left, while the multi-time-scale transform image is shown on the right. The heatmap-based RGB image was determined by the power. In other words, even when information from the three channels was combined, only the gray information was available because the gray information quantity was distributed among the three channels according to the power level. We distributed information into the RGB channel to increase the total information. The spectrogram was up-scaled to match input shapes via bilinear interpolation, then divided into three parts based on the time axis using multi-scale inference (see Algorithm 1).

**Figure 5.** Lighting-chart-based simulation plots to help understand the key idea of the multi-timescale transform.

