*2.4. Knowledge Distillation*

Knowledge distillation began from mimic model [54], and is a method to learn differences among distributions for model compression. This method has therefore been used in various fields related to distribution learning, such as domain adaptation [55] and knowledge transfer [56]. This method is also used in soundnet frameworks as a learning method. There are two main considerations when applying knowledge distillation. The first is the choice of the knowledge to learn, such as score maps [57], feature maps [58], attention maps [59], Jacobian matrixes [60], or decision boundaries [61]. The second is how the distilled knowledge is transferred, such as through mutual learning [62], knowledge projection [63], or teacher assistance [64].

In this study, we selected two research studies as baseline studies. The first was that of fitnet [58], which conducted knowledge distillation based on a feature map with abundant information. A previous fitnet study solved the problem of size mismatch between feature maps via a regressor, which problem is to transfer of knowledge from wide features to narrow features Later, an attention study [59] showed that learning without an additional regressor was possible by changing the learning structure from deep to shallow. In this paper, we conducted knowledge transfer through direct measurement of feature distance without an additional regressor by setting the same model structure for transferring from wide and deep features to wide and deep features. Our learning method is based on a deep mutual learning strategy [62] with symmetric Kullback–Leibler divergence. Furthermore, the strategy used by the learning method allows the audio feature extractors to be configured in the same way as the image feature extractors. We chose this learning method because it aims to reduce the difference between the two sample spaces rather than to increase the accuracy of specific tasks.

#### **3. Proposed Architecture and Learning Method for Constructing a Soundscape**

In this section, we deal with the proposed architecture of our application and present a learning method for constructing a soundscape in four subsections—namely, music– artwork matching for the soundscape, a training phase, a domain adaptation phase, and a multi-time-scale transform for audio feature extraction.

### *3.1. Music–Artwork Matching for the Soundscape*

Figure 1 shows the architecture of our service application. In the preprocessing phase, music is converted into RGB images via a multi-time-scale transform. Multi-time-scale transform and audio feature extraction are performed, and the resulting data are stored in a database. Later, when a painting is entered through a service application, the application matches and recommends the nearest *n* music pieces that match well with the audio features stored in the database. The distance is measured by the cross-entropy. A feature extractor for the audio and images was constructed by using a wide resnet 101 with double width. For audio with a standard sample rate (44.1 kHz), the extracted features are stored as a JSON object. The stored JSON object had about 3.48 times more capacity than the existing audio because of the feature characteristics, such as the high resolution and multiple channels. We did not use any additional compression methods. Our music database consisted of 2000 items of classical music stored in the form of key values, and it took about 2.5 days for the database to be configured without parallel processing. In addition, a total inspection was conducted via cross-entropy without an additional search algorithm for the music–painting matching, which took about 3.2 h. Our device settings were as follows: CPU: Intel® Core ™ i7-8700K processor, GPU: 2 RTX 2080 Ti. However, because we used unoptimized code, the technical issues described above are likely to be improved in real-world service via optimization. In the next section, we describe how the deep neural networks were trained.

**Figure 1.** Our service application architecture, consisting of a preprocessing phase before the service and an inference phase at runtime. This figure shows a music–artwork matching method for soundscape construction.
