**1. Introduction**

According to the 2020 data on visual impairment from the WHO, globally, the number of people of all ages visually impaired is estimated to be 285 million, of whom 39 million are blind [1]. People with visual impairments are interested in visiting museums and enjoying visual art [2]. Although many museums have improved the accessibility of their exhibitions and artworks through specialized tours and the access to tactile representations of artworks [3–5], it is still not enough to meet the needs of the visually impaired [6].

Multisensory (or multimodal) integration is an essential part of information processing by which various forms of sensory information, such as sight, hearing, touch, and proprioception (also called kinesthesia, the sense of self-movement and body position), is combined into a single experience [7]. The cross sensation between sight and other senses here refers to the representation of sight and other senses at the same time, but the aim of this paper is to use more than two other senses, such as touch and audio, besides the visual one to perform at the same time. Making art accessible to the visually impaired requires the ability to convey explicit and implicit visual images through non-visual forms. It argues that a multi-sensory system is needed to successfully convey artistic images. What art teachers wanted to do most with their blind students is to have them imagine colors using a variety of senses—audio, touch, scent, music, poetry, or literature.

In viewing artworks by the visually impaired, museums generally provide visually impaired people with audio explanatory guides that focus on the visual representation of the objects in paintings [8]. Brule et al. [9] created a raised-line overlaying multisensory interactive map on a capacitive projected touch screen for visually impaired children after a five-week field study in a specialized institute. Their map consisted of several multisensory tangibles that can be explored in a tactile way but can also be smelled or tasted, allowing users to interact with them using touch, taste, and smell together. A sliding gesture in

the dedicated menu in Mapsense filters geographical information (e.g., cities, seas, etc.). Additionally, the Mapsense design used conductive tangibles that can be detected. Some tangibles can be filled with "scents", such as olive puree, mashed raisins, and honey, which means that they use different methods (scent and taste) to promote reflexive learning and use objects to support storytelling. The Metropolitan Museum of Art in New York has displayed replicas of the artworks exhibited in the museum [10]. The Art Talking Tactile Exhibit Panel in the San Diego Museum allows visitors to touch Juan Sánchez Cotán's master still-life, "Quince, Cabbage, Melon, and Cucumber", painted in Toledo, Spain, in 1602 [11]. If the users touch one of these panels with bare hands or wearing light gloves, they can hear information about the touched part. This is like tapping on an iPad to make something happen; however, instead of a smooth, flat touch screen, these exhibit panels can include textures, bas-relief, raised lines, and other tactile surface treatments. Dobbelstein et al. [12] introduced inScent, a wearable olfactory display that allows users to receive notifications through scent in a mobile environment. Anagnostakis et al. [13] used proximity and touch sensors to provide voice guidance on museum exhibits through mobile devices. Reichinger et al. [14] introduced the concept of a gesture-controlled interactive audio guide for visual artworks that uses depth-sensing cameras to sense the location and gestures of the user's hands during tactile exploration of a bas-relief artwork model. The guide provides location-dependent audio descriptions based on user hand positions and gestures. Recently, Cavazos et al. [15] provided an audio description as well as related sound effects when the user touched a 2.5D-printed model with their finger. Thus, the visually impaired could enjoy it freely, independently, and comfortably through touch to feel the artwork shapes and textures and to listen and explore the explanation of objects of their interest without the need for a professional curator.

The use of binaural techniques that have been used to express the direction of sound is rarely used to express colors in works of art for the visually impaired. However, the connection between color and spatial audio using binaural recordings [16] of audio when appreciating colors in artworks using binaural sound has not been addressed. When using spatial audio to artificially represent the color wheel, it is necessary to investigate whether it is confusing or has a positive effect on color perception. Binaural technology allows the augmentation of spatial positioning of sound with the usage of a simple pair of headphones. Binaural recording and rendering refer specifically to recording and reproducing sounds in two ears [16]. It is designed to resemble the human two-ear auditory system and normally works with headphones [17]. Lessard et al. [18] investigated how the three-dimensional spatial mapping is carried out by early blind individuals with or without residual vision. Subjects were tested under monaural and binaural listening conditions. They found that early blind subjects could map their auditory environment with equal or better accuracy than sighted subjects. In [19], 3D-Sound was useful for visually impaired people; they felt significantly higher confidence in 3D-Sound.

This paper proposes a tool to intuitively recognize and understand the three elements of color: hue, value, and saturation using spatial audio. In addition, when touching objects in artwork with a finger, the description of the work is provided by voice, and the color, brightness, and depth of the object are expressed through the modulation of the voice.

### **2. Background and Related Works**

### *2.1. Review of Tactile and Sound Coding Color*

In order to convey color to visually impaired people, a method of coding color with tactile patterns or sounds has been proposed [20–23]. Taras et al. [20] presented a color code created for viewing on braille devices. The primary colors, red, blue, and yellow, are each coded by two dots. Mixed colors, for example, violet, green, orange, and brown, are coded as combinations of dots representing the primary colors. Additionally, the light and dark shades are added by using the second and third dots in the left column of the Braille cell.

Ramsamy-Iranah et al. [21] designed color symbols for children. The design process for the symbols was influenced by the children's prior knowledge of shapes and linked to their surroundings. For example, a small square box was associated with dark blue, reflecting the blue square soap, a circle represented red because it was associated with the red "dot" called "bindi" on the forehead of a Hindu woman. Yellow was represented by small dots reflecting the pollen of flowers. Orange is a mixture of yellow and red; therefore, circles of smaller dimensions were used to represent orange. Horizontal lines represented purple, and curved lines were associated with the green representative of bendable grass stems.

Shin et al. [22] coded nine colors (pink, red, orange, yellow, green, blue, navy, purple, brown, and achromatic) using a grating orientation (a regularly spaced collection of identical, parallel, elongated elements). The texture stimuli for color were structured by matching variations of orientation to hue, the width of the line to chroma, and the interval between the lines to value. The eight chromatic colors were divided into 20◦ angles and were achromatic at 90◦. Each color had nine levels of value and of chroma.

Cho et al. [23] developed a tactile color pictogram that used the shape of the sky, earth, and people derived from thoughts of heaven, earth, and people as metaphors. Colors could thus be recognized easily and intuitively by touching the different patterns. An experiment comparing the cognitive capacity for color codes found that users could intuitively recognize 24 chromatic and 5 achromatic colors with tactile codes [23].

Besides tactile patterns, sound patterns [24–27] use classical music sounds played on different instruments. Cho et al. [27] considered the tone, intensity, and pitch of melody sound extracted from classic music to express the brightness and saturation of colors. The sound code system represented 18 chromatic and 5 achromatic colors using classical music sounds played on different instruments. While using sound to depict color, tapping a relief-shaped embossed outline area transformed the color of that area into the sound of an orchestra instrument. Furthermore, the overall color composition of Van Gogh's "The Starry Night" was expressed as a single piece of music that accounted for color using the tone, key, tempo, and pitch of the instruments. The shape could be distinguished by touching it with a hand, but the overall color composition could be conveyed as a single piece of music, thereby reducing the effort required to recognize color from needing to touch each pattern one by one [27].

Jabber et al. [28] developed an interface that automatically translated reference colors into spatial tactile patterns. A range of achromatic colors and six prominent basic colors were represented with three levels of chroma and values through a color watch design. The color was represented through combination discs that represented the color hue, and square discs that represented lightness, and were perceived by touch.

This paper introduces two sound color codes, a six-color wheel and an eight-color wheel, created with 3D sound, based on the aforementioned observations. Table 1 shows a comparison between the previous color codes and the two sound color codes proposed in this paper.

### *2.2. Review of HRTF Systems*

The Head-Related Transfer Function (HRTF) is a filter defined on a spherical area that describes how the shape of the listener's head, torso, and ears affects incoming sound from all directions [29]. When sound hits the listener, the size and shape of the head, ears and ear canal, the density of the head, and the size and shape of the nasal and oral cavity all alter the sound and affect the way the sound is perceived, raising some frequencies and attenuating others. Therefore, the time difference between the two ears, the level difference between the two ears, and the interaction between sound and personal body anatomy are important for HRTF calculation. In this way, the ordinary audio is converted to 3D sound. Although binaural synthesis with HRTFs has been implemented in real-time applications, only a few commercialized applications utilize it. Limited research exists on the differences between audio systems that use HRTF, compared to systems that do not [30]. Systems that do not

use HRTF in their binaural synthesis instead often use a simplified interaural intensity difference (IID) [30]. This simplified IID alters the amplitude equally for all frequencies, relative to orientation and distance from the audio source to both ears of the listener. These systems do not utilize any audio cues for vertical placement and will therefore be referred to as "panning systems", while systems that use HRTF do have cues for vertical placement, and will therefore be referred to as "3D audio systems". Three-dimensional audio systems will show a difference in human localization performance compared to a panning system, because these systems utilize more precise spatial audio cues than panning systems. These results sugges<sup>t</sup> that 3D audio systems are better than panning systems in terms of precision, speed, and navigation, in an audio-exclusive virtual environment [31]. Additionally, the non-individualized HRTF filters currently in use may lack the published accuracy [32], but a better-personalized HRTF will increase the accuracy. Most of the virtual auditory displays employ generic or non-individualized HRTF filters that lead to a decreased sound localization accuracy [33].


**Table 1.** Existing color codes with instruments and the color codes in this paper.

Use cases of individualized HRTFs can be found for hearing aids [34], dereverberation [35], stereo recording enhancements [36], emotion recognition [37], 3D detection assisting blind people to avoid obstacles [38], etc.

In [18,19,38], spatial sound was proven useful for visually impaired people, and they felt significantly higher confidence with spatial sound. This paper reveals through experiments that spatial sound expressing colors through HRTF is an effective way to convey color information. The paper's spatial sound strategy is based on cognitive training and sensory adaptation to spatial sounds synthesized with a non-individualized HRTF. To the best of our knowledge, no HRTF has been applied to represent color wheels.

Drossos et al. [39] used binaural technology to provide accessible games for blind children. In the game of Tic-Tac-Toe, they used binaural processing of selected audio material performed by the utilization of a KEMAR HRTF library [40], and through three kinds of

sound presentation methods to carry out the information transmission and feedback in the game. The first method was to use eight different azimuths in the 0◦ elevation plane to represent the Tic-Tac-Toe chessboard shown in Figure 1. The second method was to use a combination of three elevations and three azimuths to simulate a Tic-Tac-Toe chessboard standing upright in front of the user. The third method was the same as the second method, but used pitch instead of elevation.

**Figure 1.** Illustration of a sound spatial positioning from Drossos et al. [39].

### *2.3. Review of the Sound Representations of Colors*

Newton's *Opticks* [41] showed that the colors of the spectrum and the pitches of musical scales are similar (for example, "red" and "C"; "green" and "Ab"). Maryon [42] also explored the similarity between the ratio of each tone to the wavelength of each color to connect them. This method of associating the pitch frequency of the scale with color can be a way of substituting colors and notes for one another [43]. However, the various sensibilities that can be obtained through color are limited by simply substituting colors into the musical scale. Lavigna [44] suggested that the technique of a composer in organizing an orchestra seems very similar to the technique of a painter applying colors. In other words, a musician's palette is a list of orchestral instruments.

A comprehensive survey of associations between color and sound can be found in [45], including how different color properties such as value and hue are mapped onto acoustic properties such as pitch and loudness. Using an implicit associations test, those researchers [45] confirmed the following cross-modal correspondences between visual and acoustic features. Pitch was associated with color lightness, whereas loudness mapped onto greater visual saliency. The associations between vowels and colors are mediated by differences in the overall balance of low- and high-frequency energy in the spectrum rather than by vowel identity as such. The hue of colors with the same luminance and saturation was not associated with any of the tested acoustic features, except for a weak preference to match higher pitch with blue (vs. yellow). In other research, high loudness was associated with orange/yellow rather than blue, and the high pitch was associated with yellow rather than blue [46].

Chroma has a relationship with sound intensity [46,47]. When the intensity of a sound is strong and loud, its color is close, intense, and deep. However, when the sound intensity is weak, the color feels pale, faint, and far away. A higher value is associated with higher pitch [48,49]. Children of all ages and adults matched pitch to value and loudness to chroma. The value (i.e., lightness) is high and heavily dependent on the light and dark levels of the color. Using the same concept in music, sound is divided into light and heavy feelings according to the high and low octaves of a scale. Another way to match color and sound is to associate an instrument's tone with color, as in Kandinsky [24]. A low-pitched

cello has a low-brightness dark blue color, a violin or trumpet-like instrument with a sharp tone feels red or yellow, and a high-pitched flute feels like a bright and saturated sky blue.

### **3. Binaural Audio Coding Colors with Spatial Color Wheel**

*3.1. Spatial Sound Representations of Colors*

The purpose of this study is to convey the concept of the spatial dimension of the color wheel. In other words, a timepiece watch makes it easy to familiarize oneself with the concept of relative time, and helps the reader understand the adjacency and complementarity of time. Similarly, this paper uses this concept for color presentation. In particular, for secondary colors such as orange, green, and purple, the basic concept of how the primary and secondary colors are created can be expressed simultaneously through the color wheel.

Figure 2 illustrates the RYB color wheel that was created by Johannes Itten [50]. There are two simplified color wheels that we want to express using 3D sound. One is a 6-color wheel composed of three primary colors (red, yellow, blue) and three secondary colors (orange, green, purple) as shown in Figure 2a, and the other as shown in Figure 2b is an 8-color wheel consisting of 8 colors (red, orange, yellow, yellow-green, green, blue-green, blue, purple). In addition, for each color (hue), three color tones (light, saturated, dark) as shown in Figure 2c are expressed in 3D sound. In addition, three achromatic colors of white, black and gray are expressed in 3D sound.

 **Figure 2.** (**a**) 6-color wheel; (**b**) 8-color wheel; (**c**) saturated (S), light (L), and dark (D) for red.

For easy identification of the color code, HRTF is used for the color representation with different fixed azimuth angles (0◦, 45◦, 90◦, 135◦, 180◦, 225◦, 270◦, and 315◦) to represent each color. However, the difference in the effect of the same HRTF for each person makes it possible to confuse 45◦, 135◦, 225◦, and 315◦ with the adjacent angles. This effect is not ideal. Therefore, the primary colors are represented by a fixed 3D sound, and the secondary colors are represented by a moving 3D sound to make it easier to recognize how the two primary colors are mixed. The color representation of the six-color wheel codes is shown in Figure 3a and Table 2, and the eight-color wheel codes are shown in Figure 3b and Table 3. The six-color wheel codes are not represented like the six-color wheel in Figure 2a, because the fixed azimuth angles of 120◦ and 240◦ are relatively vague and not as accurate as 90◦ and 270◦. Thus, yellow and blue are represented by 90◦ and 270◦, and the range of green is relatively expanded.

**Figure 3.** (**a**) Sound representations of 6-color wheel codes; (**b**) sound representations of 8-color wheel codes.



The sound files developed in this research are provided separately as a Supplementary Materials.

**Table 3.** Sound representations of 8-color wheel codes.


The sound files developed in this research are provided separately as a Supplementary Materials.

Eight colors have three levels of brightness, expressed by changing the pitch of the sound. A normal audio sound represents saturated colors, an audio sound that raises three semitones represents a lighter color, and an audio sound that decreases three semitones represents a darker color. In this way, this paper proposes a color-coding system that can represent 24 chromatic colors and three achromatic colors. The strategy complies with the definition of light and dark colors in the Munsell color system, as shown in Figure 2c. The reason for raising or lowering the three semitones is that the three semitones have little effect on the pitch characteristics of the original sound. For achromatic colors, gray is represented by 3D sound from 360◦ to 0◦. The black is the gray sound decreasing three chromatic scales, and white is the gray sound raising three chromatic scales.

There are many types of HRTF databases, such as the CIPIC HRTF-database [51], Listen HRTF-database [52], MIT HRTF-database [53], etc. This paper used the ITA HRTFdatabase [54,55] to change the audio direction by MATLAB. Additionally, Adobe Audition was used to change the sound of the pitch.

### *3.2. Sound Representations of Depth*

In order to find the most suitable sound variables to express depth, the paper tested them experimentally and applied them to the sound code.
