**4. Experiments**

We performed three main experiments. The first experiment was a quantitative evaluation in order to determine how effective the multi-time-scale transform was and if it improved the task of audio feature extraction. In the second experiment, we evaluated how reasonably music and paintings were matched by our system. In the third experiment, we evaluated whether the user experience improved while appreciating artworks and the soundscape music.

**Hypothesis 1.** *"If the soundscape music and painting are well matched, user appreciation of the experience will increase."*

The above hypothesis is the central focus of this study. However, assessing the match between the music and painting and the improvement of the user experience is challenging. We therefore developed strategies for measuring these factors, as described in the following subsections.

### *4.1. Audio Feature Extraction via the Multi-Time-Scale Transform*

We performed an ablation and benchmark study to determine the effectiveness of the multi-time-scale transform. This experiment was conducted using the ESC-50 as an environmental sound classification dataset. The environmental settings were as follows: Pytorch 1.1 with Python 3.6. The batch size was 32, and a random shuffle was used. The optimizer used was Adam, and the initial learning rate was 1 × <sup>10</sup>−4; the scheduler used the cosine annealing learning rate scheduler. The hyper-parameter T value was 50. The hyper-parameters of training were obtained via a greedy search. The ranges of the greedy searches were as follows: The batch sizes were [8, 16, 32], and the initial learning rates were [1 × <sup>10</sup>−4, 2 × <sup>10</sup>−4,..., 5 × <sup>10</sup>−4, 1 × <sup>10</sup>−5, 2 × 10−<sup>5</sup> ...5 × <sup>10</sup>−5].

Table 2 shows the results of the ablation study. First, we conducted experiments using various backbone network settings, such as Mobilenet v2, Efficientnet, VGG, Resnet, and Densenet, with different widths and depths. However, networks other than Resnet and Densenet were omitted from Table 2 due to their poor results. We increased the depth of the network-provided setting and the width by one, two, and four times the depth. The width showed the best performance when it was doubled, and the depth tended to be better, but not in all cases. Resnet101 with double width showed the best performance. In Table 2, the application of our multi-time-scale transform to resnet resulted in about 2.8% higher accuracy than that of the baseline network, WaveMsNet.


**Table 2.** Ablation study of the multi-time-scale transform.

Table 3 shows the sound classification results for various methods based on the ESC-50 benchmark. Methods without an asterisk in Table 3 are examples of supervised learning methods that use backbone and data argumentation methods, while methods with an asterisk were trained using various methods, such as weakly (or self) supervised learning, between-class learning, or some other method. Our network showed better performance than that of baseline, but lower performance than that of the state-of-the-art methods. AclNet uses raw feature extractors, such as Envnet v2. This is a weakness for mutual learning; therefore, we did not use these methods. Ensemble-fusing CNN is a state-ofthe-art method, but is not a single-model method. Our model performed as well as the single-model methods, and had advantages for mutual learning because it is spectrogrambased. Therefore, in this work, we selected our network as the baseline. However, in our future research, various augmentations with ensemble-fusing CNNs should be explored for applications in our work. Our model's performance was about 12.2% poorer than that of WEANET. Therefore, work is needed to enhance the performance of our network. Nevertheless, our network has the advantage of being spectrogram-based and showed better performance than human performance. The validation results for each fold of our proposed method can be found in detail in Table 4.

**Table 3.** ESC-50 dataset benchmark.


### *4.2. Relevance of Music–Artwork Matching*

**Definition.** *"Well-Matched Soundscape Music–Painting"*: In this work, we defined details factor of "Well-Matched Soundscape Music–Painting" for a measurement of the quality of a match. Blind touch is intended to provide direct and material interactions that go beyond metaphorical and psychological interactions between the art and the appreciator. However, this does not imply the absence of traditional art. Therefore, in this study, we wanted to measure the metaphorical and psychological relevance of matching between music and a painting. The metaphorical scale measured how much the same implicit multi-sensory experience was transferred because blind touch is a multi-sensory-based media art. The psychological scale was measured through subjective fitness, which is

an individual's appreciation, because it is the most representative single scalar value of individual appreciation.

**Table 4.** Result of five-fold cross-validation. The F1 score is the weighted F1 score and the MCC is the Matthews correlation coefficient, which was obtained as the average of the MCC for each class. This metric tables were written by referring to [68,69].


**Experimental Setting:** Twenty-four men and 16 women, all of whom are Korean, were evaluated in this study. Each group consisted of 10 people divided into four groups with a balanced gender ratio. The criteria for dividing the groups were as follows. First, the artistic culture of the subjects was evaluated via a simple test with the music and paintings used in this experiment. Second, the subjects were allocated to each group based on their test scores, which we averaged. The average music score was 2.3, the average painting score was 3.4, and a perfect score was 5. The roles of each group was as follows: Group 1 participated in subjective fitness experiments. Groups 2 and 3 participated in multi-sensory concordance experiments. Group 4 participated in multi-sensory concordance inverse measurement experiments. The groups were designed to avoid duplication of the experiments and to ensure that prior knowledge did not affect the experimental results. Furthermore, we did not provide any information other than information related to the experimental progress. This meant that the subjects did not know that the paintings and music were recommended via a deep neural network. Figure 6 shows the paintings used in our experiment, and the music allocated to the paintings by our system is indicated in Table 5.

**Figure 6.** Paintings used in the experiments. (**a**) Starry Night; (**b**) Starry Night over the Rhône; (**c**) Café Terrace at Night; (**d**) Femme avec parasol dans un jardin. This figure is associated with Table 5.


**Table 5.** Music–painting matching scheme. The top three pieces of music selected by our system for each painting are shown. Rows 1, 2, and 3 correspond to musical choices 1, 2, and 3, respectively.

### 4.2.1. Measurements of Subjective Fitness

**Experimental Methods:** Subjective fitness was measured in group 1, which comprised five men and five women. First, we input the paintings into our system and extracted the top three music recommendations. Second, when both the paintings and music were experienced at the same time, we measured the subjective fitness using a five-point Likert scale. We changed the paintings in the painting–music pairs to ensure that previous and subsequent experiments did not affect the current experiments; for example, painting 1–music 1, painting 2–music 1, painting 3–music 1, painting 4–music 1, then painting 1–music 2, painting 2–music 2, etc.

Table 6 shows the results of the subjective fitness experiments. Columns (a), (b), (c), and (d) are associated with Figure 6, Music pieces 1, 2, and 3 are associated with Table 5, and Music 4 was used in the soundscape music exhibition [12] "Bunker de Lumières Van Gogh" held in Jeju, Korea. Therefore, Music 4 can be considered to be a type of ground truth labeled by experts. F-score is the fitness score, and the P-score is the preference ratio. The fitness score was measured on a five-point Likert scale, and the P-score was the proportion of people who scored three or more F-score points. Music 4 used the same values as in previous studies [70]. Therefore, the system matched the artwork and music well in terms of subjective fitness. However, the P-score was not stable for Music 2, and Music 3 had a low F-score and P-score. The average F-score for all music items was 3.24, and the average P-score for all music items was 75%. The average F-score of Music 1 was 3.68, and the average P-score of Music 1 was 97.5%. This result is similar to that reported in a previous study, which reported an average F-score of 3.16, an average F-score for Music 1 of 3.74, an average P-score of 87.6%, and an average P-score of Music 1 of 94%. Thus, our system matched music and artworks well, but did not perform better than previous systems, despite the improved feature representation. This is likely due to the small size of the music database. However, the reliability of the system was demonstrated by the attainment of results similar to those reported in previous studies. In future studies, we need to assess whether the stability and performance of the system increase in response to the expansion of the music database.


**Table 6.** Results of the subjective fitness experiment. The F-score is the fitness score and the P-score is the preference ratio. Fitness was measured on a five-point Likert scale, whereas the P-score reflects the proportion of individuals with three or more points.

4.2.2. Measurements of Implicit Multi-Sensory Concordance

**Experimental Methods:** This experiment was conducted in groups 2 and 3, each consisting of five men and five women. The members of group 2 first wrote a review after viewing the four paintings. Group 3 then wrote a review after listening to the 12 music items. At this time, the subjects were not provided any information about the content of the experiment; the only guidelines that they received were to use all five senses when assessing the artworks or music. Fourth, we measured multi-sensory concordance by comparing the sensory language similarity of the reviews. We mapped words in the review using a Korean sensory word classification table (Table [71]). Sensory word tables are classified as gustatory, tactile, and temperature sensations; examples of such words are bitter, sweet, salty, sour, nutty, astringent, spicy, plain, rough, smooth, soft (texture), soft (material), hard, moist, sharp, cold, cool, lukewarm, warm, hot, etc.

Table 7 shows the results of the implicit multi-sensory concordance experiments. The D-score refers to the Euclidean distance and the C-score to cosine similarity. Table 7 shows that similar sensory trends between the two groups were measured through the C-score. However, it can be confirmed that the C-score is not proportional to Table 6's P-score or F-score because the cosine similarity is advantageous for distance measurements in high-dimensional positive space, but the size of each dimension is not meaningful. In other words, only the trends in each dimension can be evaluated, while the size of the difference is difficult to assess. We used the D-score to overcome this limitation. The C-score in Table 7 indicates the sensory similarity between both the music matched by experts and the music matched by our system. The D-score was similarly measured. The results show that not only can media art provide similar multi-sensory appreciation, but our system can also provide similar results to those provided by experts. However, this result has the limitation that a criterion was not provided to determine the significance according to magnitude of the score. Future studies should develop a criterion to determine the significance of the score's magnitude.

#### 4.2.3. Inverse Measurements of Implicit Multi-Sensory Concordance for Validation

**Experimental Methods:** This experiment was conducted in group 4, which comprised nine men and one woman. The purpose of this experiment was to verify the experiments in Section 4.2.2. A questionnaire was created based on a multi-sensory word table constructed in Section 4.2.2. The experiment was conducted in the same order as that described in Section 4.2.1, but the questionnaire was completed instead of a written review. The questionnaire evaluated the extent to which the appreciator agreed with the sensory table using a five-point Likert scale.

Table 8 presents the results of the inverse implicit multi-sensory concordance experiment. The A-score is the agreemen<sup>t</sup> scale, which was the average score obtained using the five-point Likert scale. The C-score is the cosine similarity of the A-scores calculated based

on an active value (3 or higher). The purpose of this experiment was to measure how much the appreciator agreed with the implicit multi-sensory data. The A-scores and C-scores were low because of the failure of the experiment. Several problems were encountered during this experiment. The first problem was the mechanical marking phenomenon. When responding to the questionnaire, subjects gave low scores to senses that were mechanically opposed to the first high score. This was in contrast to the phenomenon where opposite senses were expressed together in the free reviews of appreciation. The second problem was the phenomenon of monotonous responses, which occurred when subjects became familiar with the experiment. This phenomenon demonstrated the tendency of subjects to exclude complex senses before appreciation. For example, the subjects gave low scores to combinations of options, such as the bitter taste of wine and the sour taste of candy, before even listening to music. Thus, in further studies, the validation of the experimental design should be improved.


**Table 7.** Results of the implicit multi-sensory concordance experiments. The D-score is the Euclidean distance and the C-score is the cosine similarity.

**Table 8.** Results of the inverse implicit multi-sensory concordance experiment. The A-score is the agreemen<sup>t</sup> scale and the C-score is the cosine similarity.


#### *4.3. Improvement of the Appreciation Experience with the Soundscape*

**Definition. "Improvement of the appreciation experience"**: In this work, we defined the measurable factor of "Improvement of the appreciation experience" as the combination of appreciation and subjective satisfaction; however, this is an ambiguous concept. To address this, we proposed an evaluation method for the immersion based on the flow theory of Csikszentmihalyi. We differentiated flow and cognitive immersion in this study according to flow theory. Flow was indirectly measured via time distortion phenomena, and cognitive absorption was measured through a simple test of working memory and attention concentration. The subjective satisfaction score of the appreciation experience comprised experiences of the environment and appreciation. The subjective environmental satisfaction was not related to improvements in the appreciation experience; however, the environmental score was a useful indicator of if the environmental setting was appropriate. Appreciation scores were measured with a questionnaire based on the SSID [72] and WHO-5 Well-Being indices. The SSID is an evaluation index for soundscapes, and the WHO-5 index is an evaluation index for quality of life. Our questionnaire was prepared based on the SSID and WHO-5 Well-Being indices.

**Experimental Setting:** The participants in this experiment included a total of 29 men and 1 woman. They were divided into three groups, with each group consisting of 10 participants. The roles of each group were as follows: Groups 5 and 6 participated in the immersion experiments. Group 7 participated in the subjective satisfaction experiments. The criteria for dividing the groups and other experimental conditions were similar to those described for the previous experiments.

### 4.3.1. Measurements of Immersion

**Experimental Methods:** This experiment was conducted in groups 5 and 6. The purpose of this experiment was to indirectly measure immersion. Time distortion was measured for the flow measurement experiments. Group 5 appreciated artworks without listening to music, and group 6 appreciated artworks while listening to music. Time distortion was measured as the subjective assessment of time between the two groups, and flow was evaluated indirectly from the time distortion measurements. Cognitive absorption was measured with a simple test of working memory and attention concentration. This test asked questions about the color, location, shape, and texture of an object or scene.

Table 9 presents the measurement results for the immersion experiment. Each row presents the subjective times that participants felt while appreciating the exhibition. The ground truth refers to the real length of the piece of music. The *p*-value is the result of the *t*-Test, which was performed under the assumption that the variances were different. The results of the time distortion experiment indicate that group 5 predicted a time closer to the ground truth than group 6. In particular, participants in group 6 felt that more time had elapsed than actually had. Two interesting phenomena were discovered during the analysis of the interviews. The first was the "sleepy" phenomenon. When music pieces 1 or 4 were played, participants in group 5 did not experience sleepiness, while more than 80% of the participants in group 6 experienced sleepiness. The subjective viewing time was 2 m 36 s on average, while the experimental time was longer than the appropriate viewing time, ranging from 3 m 37 s to 6 m 52 s. The second phenomenon was the phenomenon of ambiguous answers. Group 5 answered with specific times, such as 4 m 20 s and 5 m 30 s, while group 6 tended to answer with ambiguous numbers, such as "about 5 minutes" and "about 10 minutes". In the experiments on cognitive absorption, the averages of the answers given by each group in our simple test were used. The perfect score for this test was 10. As shown in Table 8, groups 5 and 6 had an average score difference of 2.2. This can be attributed to the sleepy phenomenon. Thus, with the results obtained from this experiment, it was not possible to accurately confirm whether cognitive absorption was affected or not. However, music is qualitatively conducive to cognitive absorption. Examples of answers from participants included the following: **"The bouncy rhythm in the song reminded me of stars"; "The music felt like a young man in the country was leaving the village, and it helped me remember because there was a real village in the artwork"; "it wasn't hard to find the location of the lover because I watched the couple carefully because of the sentimental music being played."**

**Table 9.** Results of immersion experiments for assessing time distortion and cognitive absorption. A *t*-test was performed with the assumption of different variances.


### 4.3.2. Measurements of Subjective Satisfaction

**Experimental Methods:** This experiment was conducted in group 7. The purpose of this experiment was to measure the subjective satisfaction score. The subjects appreciated the paintings with the top music and filled out a questionnaire that was prepared based on the SSID and WHO-5 Well-Being indices.

The subjective satisfaction scores, which were based on the environment and the appreciation experience, are shown in Table 10. The top three questions were about environmental experience. The environmental satisfaction scores ranged from 3.3 to 3.6, which are fairly high scores. For the first question, discomfort was associated with 0 points and comfort with 5 points. Therefore, this experiment was appropriately constructed to assess the appreciation experience. The subjective satisfaction scores of appreciation ranged from 3.1 to 4.0. Therefore, soundscape music can aid in appreciation.

**Table 10.** Questionnaire based on the SSID and WHO-5 Well-Being indices.

