*Contributions*


### **2. Related Studies**

#### *2.1. Multi-Modal Artwork Platform for People with Visual Impairments*

Modern information systems rely on vision, resulting in differences in information gain between visually impaired people and non-visually impaired people. Technological developments have improved this situation [18,19], but access to cultural content beyond daily life remains challenging. In particular, persons with disabilities should have opportunities provided for cultural, physical, and artistic activities based on the 2008 Convention on the Rights of Persons with Disabilities [20]. Nevertheless, this has not been the case from the perspective of arts and culture, as pointed out by Kim Hyung-sik, a former member of the Committee on the Rights of Persons with Disabilities (CRPD) in the United Nations [21].

The blind touch project [22] was launched by ratifying Article 30 (2) of the Convention on the Rights of Persons with Disabilities, which states that "Parties shall take appropriate measures to enable persons with disabilities to have the opportunity to develop and utilize their creative, artistic, and intellectual potential, not only for their own benefit, but also for the enrichment of society". This project aimed to improve the artwork appreciation environment for people with visual impairments [23,24]. The blind touch project had two main objectives. The first objective was to allow blind people to experience, understand, and interpret art through various multi-modal senses, such as hearing, touch, temperature, and texture. The second objective was to develop a framework and technologies that would allow visually impaired people not only to experience art through their senses, but also to understand, interpret, and reflect upon it.

In a blind touch project, Cavazos et al. [25] developed a multi-modal artwork guide platform (see Table 1) that transformed an existing 2D visual artwork into a 2.5D (relief form) replica using 3D printing technology, making it accessible through touch, audio descriptions, and sound in order to provide a high level of user experience. Thus, visually impaired individuals could enjoy this artwork freely, independently, and comfortably through touch and sound without the need for a professional commentator. In addition, gamification concepts [26] were included to awaken various other non-visual senses and maximize enjoyment of an artwork. For example, vivid visual descriptions and sound effects were provided to maximize the sense of immersion in appreciating artworks. Such recreated artworks with multi-modal guides facilitated user-friendly interaction environments by sensing the event of tactile input on some part of an artwork and providing the related information. In addition, background music was created to elicit emotions similar to those of the work, taking into consideration the musical instrument's timbre, minor/major mode, tempo, and pitch. In this paper, we aimed to replace the background music with other classical music recommended by deep neural networks and soundscape concepts.


**Table 1.** Interactive multi-modal guideline for appreciating visual artworks and museum objects [25].

### *2.2. Soundscape Construction Using Deep Neural Networks*

In this study, we constructed soundscapes based on music that matched well with a given artwork by using deep neural networks. We considered three technical approaches to constructing soundscapes. The first is a generative-model-based approach, in which generative model is used to translate a painting into music based on deep neural networks [27]. This proposed method is characterized by the use of consistent features that allow interconversion between music and painting. However, this method of using consistent and interchangeable features does not ensure well-matched music. Therefore, in this work, we did not adopt this method because we felt that it did not produce music that fits paintings well. Rather, we used the approach of finding music that matches the painting rather than using a generative models.

The second is an approach based on music recommendation systems. The recommendation systems were divided into user-based, content-based, and hybrid-based methods [28,29]. We focused on content-based recommendation systems [30] because our purpose was a recommendation between music and painting, not personalized recommendations. The key to this approach is the vectorization of content because it matches based on the similarities in vectorized content. Examples of vectorization methods include video and description summarization [31,32] and image captioning [33–35]. In particular, the authors of [36] matched poetry and images through captioning, the authors of [33] presented an automatic caption generation method for Impressionist artworks for people with visual impairments, and the authors of [28] used emotional features in music recommendations. However, when applying these methods to our research, not only was a capping module required for the music and images, but a user-based recommendation system was also needed.

Thus, we selected baseline networks as a third method [10,37–41] for feature matching by using kernel density estimation based on weakly supervised learning. Our training baseline was soundnet [37], in which a kernel mapped an audio sample space to an image sample space via weakly supervised learning for vectorization. In prior imaginary soundscape research [10], an application based on soundnet was used to improve users' experience with Google Maps. This application mapped vectors of images and audio into the same sample space by using soundnet. Our approach uses a similar framework. This approach has the limitation of not being able to use audio or image descriptions, but it has the advantage of being able to focus on natural features. In the following sections, we describe our audio feature extraction method and knowledge distillation method.

### *2.3. Audio Feature Extraction*

Audio feature extraction is a major component of soundnet frameworks. There are three major methods. The first deals with audio data as a spectrogram image by using a short-term Fourier transform [42–45]. The second is an end-to-end method for dealing with raw audio data by using a shallow and wide raw feature extractor [46,47] rather than a Fourier transform. The third one uses improved learning techniques, such as data argumentation [47–49], pre-processing (or post-processing) [50,51], and other learning methods, such self (or weakly) supervised learning [10,37–41,47,51].

Spectrogram-based audio feature extraction depends on the hyper-parameters of the short-term Fourier transform. Thus, many state-of-the-art networks [37,47,51] have been studied by using raw feature extractors; however, we adopted a spectrogram-based approach because it has advantages in terms of knowledge distillation and its application to our domain. In particular, we selected the WaveMsNet [45] as our baseline feature extractor. The WaveMsNet fuses features of time and frequency domains to improve the dependence of the Fourier transformation on the hyper-parameter window size via multi-scale feature extraction in the time domain. However, the network still receives the gray channel as the input. In this paper, we propose a multi-time-scale transform [52,53] to convert audio data into an RGB image in order to receive RGB input instead of gray input. These techniques not only improve the receptive field, but also enable direct measurement of feature distance. In the next sections, we shall address knowledge distillation for mapping audio features and image features into the same sample space.
